Hi All, I would like to open up a discussion for new feature to support streaming ingestion in CarbonData.
Please refer to design document(draft) in the link below. https://drive.google.com/file/d/0B71_EuXTdDi8MlFDU2tqZU9BZ3M /view?usp=sharing Your comments/suggestions are welcome. Here are some high level points. Rationale: The current ways of adding user data to CarbonData table is via LOAD statement or using SELECT query with INSERT INTO statement. These methods add bulk of data into CarbonData table into a new segment. Basically, it is a batch insertion for a bulk of data. However, with increasing demand of real time data analytics with streaming frameworks, CarbonData needs a way to insert streaming data continuously into CarbonData table. CarbonData needs a support for continuous and faster ingestion into CarbonData table and make it available for querying. CarbonData can leverage from our newly introduced V3 format to append streaming data to existing carbon table. Requirements: Following are some high level requirements; 1. CarbonData shall create a new segment (Streaming Segment) for each streaming session. Concurrent streaming ingestion into same table will create separate streaming segments. 2. CarbonData shall use write optimized format (instead of multi-layered indexed columnar format) to support ingestion of streaming data into a CarbonData table. 3. CarbonData shall create streaming segment folder and open a streaming data file in append mode to write data. CarbonData should avoid creating multiple small files by appending to an existing file. 4. The data stored in new streaming segment shall be available for query after it is written to the disk (hflush/hsync). In other words, CarbonData Readers should be able to query the data in streaming segment written so far. 5. CarbonData should acknowledge the write operation status back to output sink/upper layer streaming engine so that in the case of write failure, streaming engine should restart the operation and maintain exactly once delivery semantics. 6. CarbonData Compaction process shall support compacting data from write-optimized streaming segment to regular read optimized columnar CarbonData format. 7. CarbonData readers should maintain the read consistency by means of using timestamp. 8. Maintain durability - in case of write failure, CarbonData should be able recover to latest commit status. This may require maintaining source and destination offsets of last commits in a metadata. This feature can be done in phases; Phase -1 : Add basic framework and writer support to allow Spark Structured streaming into CarbonData . This phase may or may not have append support. Add reader support to read streaming data files. Phase-2 : Add append support if not done in phase 1. Maintain append offsets and metadata information. Phase -3 : Add support for external streaming frameworks such as Kafka streaming using spark structured steaming, maintain topics/partitions/offsets and support fault tolerance . Phase-4 : Add support to other streaming frameworks , such as flink , beam etc. Phase-5: Future support for in-memory cache for buffering streaming data, support for union with Spark Structured streaming to serve directly from spark structured streaming. And add support for Time series data. Best Regards, Aniket