Hi Community,
Carbondata supports update and delete using spark. So basically update is
delete + Insert, and delete is just delete
But we use spark APIs or actions on collections that use spark jobs to do
them, like map, partition etc
So Spark adds overhead of task serialization cost, total job execution in
remote nodes, shuffle etc
So even just for simple updates, Carbon takes a lot of time, and the same
for delete as well due to these overheads.
Carbondata 2.1.0 supports update and delete for SDK. This is implemented at
the carbon file format level
so we can reuse the same for simple updates and deletes and avoid spark
completely and can perform simple update
and delete on transactional tables using simple java code. This helps to
avoid all the overhead of spark and make
updates and deletes faster.
I have added an initial V1 design document, please check and give
comments/inputs/suggestions.
https://docs.google.com/document/d/1-M6xPKZG8l6yAu0c9qo3jdUKhpXHWgUR-h8HeUUmk8M/edit?usp=sharing
Thanks,
Regards,
Akash R Nilugal