Re: Improve carbondata CDC performance
Hi david, Thanks for your suggestion. I checked in local about the query you suggested, its going as a *BroadcastNestedLoopJoin*. As in local dataset is small it goes for that, but in cluster when the data size grows it goes back to cartesian product again. How about our own search logic in a distributed way using Interval tree datastructure? it will be faster and wont impact much. Other please give your suggestions. Thanks Regards, Akash R -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Re: [Discussion] Taking the inputs for Segment Interface Refactoring
+1 - Best Regards David Cai -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Re: Improve carbondata CDC performance
+1, you can finish the implementation. How about using the following SQL instead of the cartesian join? SELECT df.filePath FROM targetTableBlocks df where exists (select 1 from srcTable where srcTable.value between df.min and df.max) - Best Regards David Cai -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Re: Improve carbondata CDC performance
I mean you can push your logic into CarbonDataSourceScan as a dynamic runtime filter. Actually, CarbonDataSourceScan already used min/max zoom maps as an index filter to prune blocklist (in the CarbonScanRDD.getPartition method). We can do more things on the join query. Here I assume the source table is much smaller than the target table. 1. when the join broadcast the source table 1.1 when the join columns contain the partition keys of the target table, it can reuse the result of the broadcast to prune the partitions of the target table. 1.2 when the join query has some filters on the target table, use min/max zoom maps to prune the block list of the target table 1.3 when the join query has some filters on the source table, it can use min/max zoom maps of join columns to match the result of the broadcast 2. when the join doesn't broadcast the source table 2.1 when the join query has some filters on the target table, use min/max zoom maps to prune the block list of the target table 2.2 join source table with min/max zoom maps of the target table to get the new block list. In the future, it better to move all pruning logics of the driver side into one place and invoke them in CarbonDataSourceScan to get input partitions for ScanRDD. (include min/max index, si, partition pruning, and dynamic filters) - Best Regards David Cai -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Re: [DISCUSSION] Display the segment ID when carbondata load is successful
Hi, I think after load, only return the segment id which data is loaded to is enough no matter auto load merge is enable or not. I will add one more reason apart from @areyouokfreejoe metioned: 1. Because user alredy cares about each load, so mostly in their application logic, auto load merge is disabled, user will hanlde compaction by themselves. Auto load merge only base on segment no., not base on any business relation between the segments. So if they enable auto load merge, several segments which has no any relation just the segment_id is close will be compacted. After this kind of compaction, all the information in the segment before compaction will be lost, this is not what user wants. If any load is special, in order to not lost any information after compaction, this load should only merge with the segment which has the same special point which is only known by the application, carbon currently has no place to store this information. So only user can control which segments will be compacted by trigger custom compaction with the segment ids which those segments have the same special point. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/