[DISCUSSION]Carbondata Streamer tool and Schema change capture in CDC merge

Akash Nilugal Tue, 31 Aug 2021 10:48:18 -0700

Hi Community,

OLTP systems like Mysql are used heavily for storing transactional data in
real-time and the same data is later used for doing fraud detection and
taking various data-driven business decisions. Since OLTP systems are not
suited for analytical queries due to their row-based storage, there is a
need to store this primary data into big data storage in a way that data on
DFS is an exact replica of the data present in Mysql. Traditional ways for
capturing data from primary databases, like Apache Sqoop, use pull-based
CDC approaches which put additional load on the primary databases. Hence
log-based CDC solutions became increasingly popular. However, there are 2
aspects to this problem. We should be able to incrementally capture the
data changes from primary databases and should be able to incrementally
ingest the same in the data lake so that the overall latency decreases. The
former is taken care of using log-based CDC systems like Maxwell and
Debezium. Here we are proposing a solution for the second aspect using
Apache Carbondata.


Carbondata streamer tool enables users to incrementally ingest data from
various sources, like Kafka and DFS into their data lakes. The tool comes
with out-of-the-box support for almost all types of schema evolution use
cases. Currently, this tool can be launched as a spark application either
in continuous mode or a one-time job.

Further details are present in the design document. Please review the
design and help to improve it. I'm attaching the link to the google doc,
you can directly comment on that. Any suggestions and improvements are most
welcome.

https://docs.google.com/document/d/1x66X5LU5silp4wLzjxx2Hxmt78gFRLF_8IocapoXxJk/edit?usp=sharing

Thanks

Regards,
Akash R Nilugal

[DISCUSSION]Carbondata Streamer tool and Schema change capture in CDC merge

Reply via email to