Re: Improve carbondata CDC performance

David CaiQiang Thu, 18 Feb 2021 01:04:45 -0800

I mean you can push your logic into CarbonDataSourceScan as a dynamic runtime
filter.


Actually, CarbonDataSourceScan already used min/max zoom maps as an index
filter to prune blocklist (in the CarbonScanRDD.getPartition method). 

We can do more things on the join query. Here I assume the source table is
much smaller than the target table.

1. when the join broadcast the source table
    1.1 when the join columns contain the partition keys of the target
table,  it can reuse the result of the broadcast to prune the partitions of
the target table.
    1.2 when the join query has some filters on the target table, use
min/max zoom maps to prune the block list of the target table
    1.3 when the join query has some filters on the source table, it can use
min/max zoom maps of join columns to match the result of the broadcast

2. when the join doesn't broadcast the source table
    2.1 when the join query has some filters on the target table, use
min/max zoom maps to prune the block list of the target table
    2.2 join source table with min/max zoom maps of the target table to get
the new block list.

In the future, it better to move all pruning logics of the driver side into
one place and invoke them in CarbonDataSourceScan to get input partitions
for ScanRDD. (include min/max index, si, partition pruning, and dynamic
filters)



-----
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: Improve carbondata CDC performance

Reply via email to