Re: [DISCUSSION]Carbondata Streamer tool and Schema change capture in CDC merge

Pratyaksh Sharma Sat, 11 Sep 2021 02:11:33 -0700

Hi Indhu,

Apologies for the late reply. Please find the below inline answers -


1. For Multi-Table merge scenario, does it support concurrent cdc or
sequential cdc to target table?

> In phase 1, we are supporting a scenario where multiple tables (all with
the same schema) are being pushed to same topic and the events from that
topic will be ingested in a single iteration.

In next phases when we support ingesting tables with different schemas,
then we plan to have concurrent ingestion only.

2. On failure scenarios (like streamer tool is killed/crashed), how we can
ensure data is not duplicated on restarting the Kafka ingest ?

> For this scenario, we have few configurable options like `--deduplicate`
and `--combine-before-upsert`. If they are set to true, the following
operations are done -
a) Incoming batch of records is deduplicated for events with the same
record key.
b) In case of INSERT operation type, all existing records are removed from
the incoming batch.

Hence the use of spark streaming checkpointing and these options helps
ensure there are no duplicates.

Hope that answers your queries.

On Mon, Sep 6, 2021 at 7:06 PM Nihal ojha <nihalnit...@gmail.com> wrote:

> +1,  good idea to implement streamer tool.
>
> Regards
> Nihal
>
> On 2021/08/31 17:47:35, Akash Nilugal <akashnilu...@gmail.com> wrote:
> > Hi Community,
> >
> > OLTP systems like Mysql are used heavily for storing transactional data
> in
> > real-time and the same data is later used for doing fraud detection and
> > taking various data-driven business decisions. Since OLTP systems are not
> > suited for analytical queries due to their row-based storage, there is a
> > need to store this primary data into big data storage in a way that data
> on
> > DFS is an exact replica of the data present in Mysql. Traditional ways
> for
> > capturing data from primary databases, like Apache Sqoop, use pull-based
> > CDC approaches which put additional load on the primary databases. Hence
> > log-based CDC solutions became increasingly popular. However, there are 2
> > aspects to this problem. We should be able to incrementally capture the
> > data changes from primary databases and should be able to incrementally
> > ingest the same in the data lake so that the overall latency decreases.
> The
> > former is taken care of using log-based CDC systems like Maxwell and
> > Debezium. Here we are proposing a solution for the second aspect using
> > Apache Carbondata.
> >
> > Carbondata streamer tool enables users to incrementally ingest data from
> > various sources, like Kafka and DFS into their data lakes. The tool comes
> > with out-of-the-box support for almost all types of schema evolution use
> > cases. Currently, this tool can be launched as a spark application either
> > in continuous mode or a one-time job.
> >
> > Further details are present in the design document. Please review the
> > design and help to improve it. I'm attaching the link to the google doc,
> > you can directly comment on that. Any suggestions and improvements are
> most
> > welcome.
> >
> >
> https://docs.google.com/document/d/1x66X5LU5silp4wLzjxx2Hxmt78gFRLF_8IocapoXxJk/edit?usp=sharing
> >
> > Thanks
> >
> > Regards,
> > Akash R Nilugal
> >
>

Re: [DISCUSSION]Carbondata Streamer tool and Schema change capture in CDC merge

Reply via email to