Hi Aniket Thanks for your great contribution, The feature of ingestion streaming data to carbondata would be very useful for some real-time query scenarios.
Some inputs from my side: 1. I agree with approach 2 for streaming file format, the performance for query must be ensured. 2. Whether support compaction for streaming ingested data to add index, or not ? -------------------------------------------------------------------------------------------- CarbonData shall use write optimized format (instead of multi-layered indexed columnar format) to support ingestion of streaming data into a CarbonData table. 3. For first version of streaming ingestion feature, will support which kind of streaming processing system? Structured streaming and Kafka ? any other ? Regards Liang Aniket Adnaik wrote > Hi All, > > I would like to open up a discussion for new feature to support streaming > ingestion in CarbonData. > > Please refer to design document(draft) in the link below. > https://drive.google.com/file/d/0B71_EuXTdDi8MlFDU2tqZU9BZ3M > /view?usp=sharing > > Your comments/suggestions are welcome. > Here are some high level points. > > Rationale: > The current ways of adding user data to CarbonData table is via LOAD > statement or using SELECT query with INSERT INTO statement. These methods > add bulk of data into CarbonData table into a new segment. Basically, it > is > a batch insertion for a bulk of data. However, with increasing demand of > real time data analytics with streaming frameworks, CarbonData needs a way > to insert streaming data continuously into CarbonData table. CarbonData > needs a support for continuous and faster ingestion into CarbonData table > and make it available for querying. > > CarbonData can leverage from our newly introduced V3 format to append > streaming data to existing carbon table. > > > Requirements: > > Following are some high level requirements; > 1. CarbonData shall create a new segment (Streaming Segment) for each > streaming session. Concurrent streaming ingestion into same table will > create separate streaming segments. > > 2. CarbonData shall use write optimized format (instead of multi-layered > indexed columnar format) to support ingestion of streaming data into a > CarbonData table. > > 3. CarbonData shall create streaming segment folder and open a streaming > data file in append mode to write data. CarbonData should avoid creating > multiple small files by appending to an existing file. > > 4. The data stored in new streaming segment shall be available for query > after it is written to the disk (hflush/hsync). In other words, CarbonData > Readers should be able to query the data in streaming segment written so > far. > > 5. CarbonData should acknowledge the write operation status back to > output > sink/upper layer streaming engine so that in the case of write failure, > streaming engine should restart the operation and maintain exactly once > delivery semantics. > > 6. CarbonData Compaction process shall support compacting data from > write-optimized streaming segment to regular read optimized columnar > CarbonData format. > > 7. CarbonData readers should maintain the read consistency by means of > using timestamp. > > 8. Maintain durability - in case of write failure, CarbonData should be > able recover to latest commit status. This may require maintaining source > and destination offsets of last commits in a metadata. > > This feature can be done in phases; > > Phase -1 : Add basic framework and writer support to allow Spark > Structured > streaming into CarbonData . This phase may or may not have append support. > Add reader support to read streaming data files. > > Phase-2 : Add append support if not done in phase 1. Maintain append > offsets and metadata information. > > Phase -3 : Add support for external streaming frameworks such as Kafka > streaming using spark structured steaming, maintain > topics/partitions/offsets and support fault tolerance . > > Phase-4 : Add support to other streaming frameworks , such as flink , beam > etc. > > Phase-5: Future support for in-memory cache for buffering streaming data, > support for union with Spark Structured streaming to serve directly from > spark structured streaming. And add support for Time series data. > > Best Regards, > Aniket -- View this message in context: http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/DISCUSSION-New-Feature-Streaming-Ingestion-into-CarbonData-tp9724p9803.html Sent from the Apache CarbonData Mailing List archive mailing list archive at Nabble.com.