+1 for the feature.
On 2021/08/31 17:47:35, Akash Nilugal <akashnilu...@gmail.com> wrote: > Hi Community, > > OLTP systems like Mysql are used heavily for storing transactional data in > real-time and the same data is later used for doing fraud detection and > taking various data-driven business decisions. Since OLTP systems are not > suited for analytical queries due to their row-based storage, there is a > need to store this primary data into big data storage in a way that data on > DFS is an exact replica of the data present in Mysql. Traditional ways for > capturing data from primary databases, like Apache Sqoop, use pull-based > CDC approaches which put additional load on the primary databases. Hence > log-based CDC solutions became increasingly popular. However, there are 2 > aspects to this problem. We should be able to incrementally capture the > data changes from primary databases and should be able to incrementally > ingest the same in the data lake so that the overall latency decreases. The > former is taken care of using log-based CDC systems like Maxwell and > Debezium. Here we are proposing a solution for the second aspect using > Apache Carbondata. > > Carbondata streamer tool enables users to incrementally ingest data from > various sources, like Kafka and DFS into their data lakes. The tool comes > with out-of-the-box support for almost all types of schema evolution use > cases. Currently, this tool can be launched as a spark application either > in continuous mode or a one-time job. > > Further details are present in the design document. Please review the > design and help to improve it. I'm attaching the link to the google doc, > you can directly comment on that. Any suggestions and improvements are most > welcome. > > https://docs.google.com/document/d/1x66X5LU5silp4wLzjxx2Hxmt78gFRLF_8IocapoXxJk/edit?usp=sharing > > Thanks > > Regards, > Akash R Nilugal >