Re: [DISCUSSION]: (New Feature) Streaming Ingestion into CarbonData

Aniket Adnaik Wed, 29 Mar 2017 17:56:37 -0700

Hi Liang,

Thanks, please see my comments to your questions.


2. Whether support compaction for streaming ingested data to add index, or
not ?
AA>> Yes, Eventually we would need streaming data files to be compacted
into regular read optimized CarbonData format. Triggering of compaction can
be based on the number of files in streaming segment.

3. For first version of streaming ingestion feature, will support which kind
of streaming processing system?
Structured streaming and Kafka ?  any other ?
AA>> for first phase we can support file source and socket source. For
Kafka as streaming source, there are some additional functionalities needs
to be covered like partitioning, Kafka offset management and , consistency
with carbon streaming ingestion, so we may defer it for later phase.

Best Regards,
Aniket


On Wed, Mar 29, 2017 at 2:00 AM, Liang Chen <chenliang6...@gmail.com> wrote:

> Hi Aniket
>
> Thanks for your great contribution, The feature of ingestion streaming data
> to carbondata would be very useful for some real-time query scenarios.
>
> Some inputs from my side:
>
> 1. I agree with approach 2 for streaming file format, the performance for
> query must be ensured.
> 2. Whether support compaction for streaming ingested data to add index, or
> not ?
> ------------------------------------------------------------
> --------------------------------
> CarbonData shall use write optimized format (instead of multi-layered
> indexed columnar format) to support ingestion of streaming data into a
> CarbonData table.
>
> 3. For first version of streaming ingestion feature, will support which
> kind
> of streaming processing system?
> Structured streaming and Kafka ?  any other ?
>
> Regards
> Liang
>
>
> Aniket Adnaik wrote
> > Hi All,
> >
> > I would like to open up a discussion for new feature to support streaming
> > ingestion in CarbonData.
> >
> > Please refer to design document(draft) in the link below.
> >       https://drive.google.com/file/d/0B71_EuXTdDi8MlFDU2tqZU9BZ3M
> > /view?usp=sharing
> >
> > Your comments/suggestions are welcome.
> > Here are some high level points.
> >
> > Rationale:
> > The current ways of adding user data to CarbonData table is via LOAD
> > statement or using SELECT query with INSERT INTO statement. These methods
> > add bulk of data into CarbonData table into a new segment. Basically, it
> > is
> > a batch insertion for a bulk of data. However, with increasing demand of
> > real time data analytics with streaming frameworks, CarbonData needs a
> way
> > to insert streaming data continuously into CarbonData table. CarbonData
> > needs a support for continuous and faster ingestion into CarbonData table
> > and make it available for querying.
> >
> > CarbonData can leverage from our newly introduced V3 format to append
> > streaming data to existing carbon table.
> >
> >
> > Requirements:
> >
> > Following are some high level requirements;
> > 1.  CarbonData shall create a new segment (Streaming Segment) for each
> > streaming session. Concurrent streaming ingestion into same table will
> > create separate streaming segments.
> >
> > 2.  CarbonData shall use write optimized format (instead of multi-layered
> > indexed columnar format) to support ingestion of streaming data into a
> > CarbonData table.
> >
> > 3.  CarbonData shall create streaming segment folder and open a streaming
> > data file in append mode to write data. CarbonData should avoid creating
> > multiple small files by appending to an existing file.
> >
> > 4.  The data stored in new streaming segment shall be available for query
> > after it is written to the disk (hflush/hsync). In other words,
> CarbonData
> > Readers should be able to query the data in streaming segment written so
> > far.
> >
> > 5.  CarbonData should acknowledge the write operation status back to
> > output
> > sink/upper layer streaming engine so that in the case of write failure,
> > streaming engine should restart the operation and maintain exactly once
> > delivery semantics.
> >
> > 6.  CarbonData Compaction process shall support compacting data from
> > write-optimized streaming segment to regular read optimized columnar
> > CarbonData format.
> >
> > 7.  CarbonData readers should maintain the read consistency by means of
> > using timestamp.
> >
> > 8.  Maintain durability - in case of write failure, CarbonData should be
> > able recover to latest commit status. This may require maintaining source
> > and destination offsets of last commits in a metadata.
> >
> > This feature can be done in phases;
> >
> > Phase -1 : Add basic framework and writer support to allow Spark
> > Structured
> > streaming into CarbonData . This phase may or may not have append
> support.
> > Add reader support to read streaming data files.
> >
> > Phase-2 : Add append support if not done in phase 1. Maintain append
> > offsets and metadata information.
> >
> > Phase -3 : Add support for external streaming frameworks such as Kafka
> > streaming using spark structured steaming, maintain
> > topics/partitions/offsets and support fault tolerance .
> >
> > Phase-4 : Add support to other streaming frameworks , such as flink ,
> beam
> > etc.
> >
> > Phase-5: Future support for in-memory cache for buffering streaming data,
> > support for union with Spark Structured streaming to serve directly from
> > spark structured streaming.  And add support for Time series data.
> >
> > Best Regards,
> > Aniket
>
>
>
>
>
> --
> View this message in context: http://apache-carbondata-
> mailing-list-archive.1130556.n5.nabble.com/DISCUSSION-New-
> Feature-Streaming-Ingestion-into-CarbonData-tp9724p9803.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>

Re: [DISCUSSION]: (New Feature) Streaming Ingestion into CarbonData

Reply via email to