Hi Aniket

Thanks for your great contribution, The feature of ingestion streaming data
to carbondata would be very useful for some real-time query scenarios.

Some inputs from my side:

1. I agree with approach 2 for streaming file format, the performance for
query must be ensured.
2. Whether support compaction for streaming ingested data to add index, or
not ?
--------------------------------------------------------------------------------------------
CarbonData shall use write optimized format (instead of multi-layered 
indexed columnar format) to support ingestion of streaming data into a 
CarbonData table. 

3. For first version of streaming ingestion feature, will support which kind
of streaming processing system?
Structured streaming and Kafka ?  any other ? 

Regards
Liang


Aniket Adnaik wrote
> Hi All,
> 
> I would like to open up a discussion for new feature to support streaming
> ingestion in CarbonData.
> 
> Please refer to design document(draft) in the link below.
>       https://drive.google.com/file/d/0B71_EuXTdDi8MlFDU2tqZU9BZ3M
> /view?usp=sharing
> 
> Your comments/suggestions are welcome.
> Here are some high level points.
> 
> Rationale:
> The current ways of adding user data to CarbonData table is via LOAD
> statement or using SELECT query with INSERT INTO statement. These methods
> add bulk of data into CarbonData table into a new segment. Basically, it
> is
> a batch insertion for a bulk of data. However, with increasing demand of
> real time data analytics with streaming frameworks, CarbonData needs a way
> to insert streaming data continuously into CarbonData table. CarbonData
> needs a support for continuous and faster ingestion into CarbonData table
> and make it available for querying.
> 
> CarbonData can leverage from our newly introduced V3 format to append
> streaming data to existing carbon table.
> 
> 
> Requirements:
> 
> Following are some high level requirements;
> 1.  CarbonData shall create a new segment (Streaming Segment) for each
> streaming session. Concurrent streaming ingestion into same table will
> create separate streaming segments.
> 
> 2.  CarbonData shall use write optimized format (instead of multi-layered
> indexed columnar format) to support ingestion of streaming data into a
> CarbonData table.
> 
> 3.  CarbonData shall create streaming segment folder and open a streaming
> data file in append mode to write data. CarbonData should avoid creating
> multiple small files by appending to an existing file.
> 
> 4.  The data stored in new streaming segment shall be available for query
> after it is written to the disk (hflush/hsync). In other words, CarbonData
> Readers should be able to query the data in streaming segment written so
> far.
> 
> 5.  CarbonData should acknowledge the write operation status back to
> output
> sink/upper layer streaming engine so that in the case of write failure,
> streaming engine should restart the operation and maintain exactly once
> delivery semantics.
> 
> 6.  CarbonData Compaction process shall support compacting data from
> write-optimized streaming segment to regular read optimized columnar
> CarbonData format.
> 
> 7.  CarbonData readers should maintain the read consistency by means of
> using timestamp.
> 
> 8.  Maintain durability - in case of write failure, CarbonData should be
> able recover to latest commit status. This may require maintaining source
> and destination offsets of last commits in a metadata.
> 
> This feature can be done in phases;
> 
> Phase -1 : Add basic framework and writer support to allow Spark
> Structured
> streaming into CarbonData . This phase may or may not have append support.
> Add reader support to read streaming data files.
> 
> Phase-2 : Add append support if not done in phase 1. Maintain append
> offsets and metadata information.
> 
> Phase -3 : Add support for external streaming frameworks such as Kafka
> streaming using spark structured steaming, maintain
> topics/partitions/offsets and support fault tolerance .
> 
> Phase-4 : Add support to other streaming frameworks , such as flink , beam
> etc.
> 
> Phase-5: Future support for in-memory cache for buffering streaming data,
> support for union with Spark Structured streaming to serve directly from
> spark structured streaming.  And add support for Time series data.
> 
> Best Regards,
> Aniket





--
View this message in context: 
http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/DISCUSSION-New-Feature-Streaming-Ingestion-into-CarbonData-tp9724p9803.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at 
Nabble.com.

Reply via email to