Hi All,

I would like to open up a discussion for new feature to support streaming
ingestion in CarbonData.

Please refer to design document(draft) in the link below.
      https://drive.google.com/file/d/0B71_EuXTdDi8MlFDU2tqZU9BZ3M
/view?usp=sharing

Your comments/suggestions are welcome.
Here are some high level points.

Rationale:
The current ways of adding user data to CarbonData table is via LOAD
statement or using SELECT query with INSERT INTO statement. These methods
add bulk of data into CarbonData table into a new segment. Basically, it is
a batch insertion for a bulk of data. However, with increasing demand of
real time data analytics with streaming frameworks, CarbonData needs a way
to insert streaming data continuously into CarbonData table. CarbonData
needs a support for continuous and faster ingestion into CarbonData table
and make it available for querying.

CarbonData can leverage from our newly introduced V3 format to append
streaming data to existing carbon table.


Requirements:

Following are some high level requirements;
1.  CarbonData shall create a new segment (Streaming Segment) for each
streaming session. Concurrent streaming ingestion into same table will
create separate streaming segments.

2.  CarbonData shall use write optimized format (instead of multi-layered
indexed columnar format) to support ingestion of streaming data into a
CarbonData table.

3.  CarbonData shall create streaming segment folder and open a streaming
data file in append mode to write data. CarbonData should avoid creating
multiple small files by appending to an existing file.

4.  The data stored in new streaming segment shall be available for query
after it is written to the disk (hflush/hsync). In other words, CarbonData
Readers should be able to query the data in streaming segment written so
far.

5.  CarbonData should acknowledge the write operation status back to output
sink/upper layer streaming engine so that in the case of write failure,
streaming engine should restart the operation and maintain exactly once
delivery semantics.

6.  CarbonData Compaction process shall support compacting data from
write-optimized streaming segment to regular read optimized columnar
CarbonData format.

7.  CarbonData readers should maintain the read consistency by means of
using timestamp.

8.  Maintain durability - in case of write failure, CarbonData should be
able recover to latest commit status. This may require maintaining source
and destination offsets of last commits in a metadata.

This feature can be done in phases;

Phase -1 : Add basic framework and writer support to allow Spark Structured
streaming into CarbonData . This phase may or may not have append support.
Add reader support to read streaming data files.

Phase-2 : Add append support if not done in phase 1. Maintain append
offsets and metadata information.

Phase -3 : Add support for external streaming frameworks such as Kafka
streaming using spark structured steaming, maintain
topics/partitions/offsets and support fault tolerance .

Phase-4 : Add support to other streaming frameworks , such as flink , beam
etc.

Phase-5: Future support for in-memory cache for buffering streaming data,
support for union with Spark Structured streaming to serve directly from
spark structured streaming.  And add support for Time series data.

Best Regards,
Aniket

Reply via email to