Re: Improve carbondata CDC performance

2021-02-18 Thread akashrn5
Hi david,

Thanks for your suggestion.

I checked in local about the query you suggested, its going as a
*BroadcastNestedLoopJoin*.
As in local dataset is small it goes for that, but in cluster when the data
size grows it goes back to cartesian product again. 

How about our own search logic in a distributed way using Interval tree
datastructure? it will be faster and wont impact much.

Other please give your suggestions.

Thanks

Regards,
Akash R



--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [Discussion] Taking the inputs for Segment Interface Refactoring

2021-02-18 Thread David CaiQiang
+1 



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: Improve carbondata CDC performance

2021-02-18 Thread David CaiQiang
+1, you can finish the implementation.

How about using the following SQL instead of the cartesian join?

SELECT df.filePath
FROM targetTableBlocks df
where exists (select 1 from srcTable where  srcTable.value between df.min
and df.max)



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: Improve carbondata CDC performance

2021-02-18 Thread David CaiQiang
I mean you can push your logic into CarbonDataSourceScan as a dynamic runtime
filter.

Actually, CarbonDataSourceScan already used min/max zoom maps as an index
filter to prune blocklist (in the CarbonScanRDD.getPartition method). 

We can do more things on the join query. Here I assume the source table is
much smaller than the target table.

1. when the join broadcast the source table
1.1 when the join columns contain the partition keys of the target
table,  it can reuse the result of the broadcast to prune the partitions of
the target table.
1.2 when the join query has some filters on the target table, use
min/max zoom maps to prune the block list of the target table
1.3 when the join query has some filters on the source table, it can use
min/max zoom maps of join columns to match the result of the broadcast

2. when the join doesn't broadcast the source table
2.1 when the join query has some filters on the target table, use
min/max zoom maps to prune the block list of the target table
2.2 join source table with min/max zoom maps of the target table to get
the new block list.

In the future, it better to move all pruning logics of the driver side into
one place and invoke them in CarbonDataSourceScan to get input partitions
for ScanRDD. (include min/max index, si, partition pruning, and dynamic
filters)



-
Best Regards
David Cai
--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: [DISCUSSION] Display the segment ID when carbondata load is successful

2021-02-18 Thread Yahui Liu
Hi,

I think after load, only return the segment id which data is loaded to is
enough no matter auto load merge is enable or not. I will add one more
reason apart from @areyouokfreejoe metioned:
1. Because user alredy cares about each load, so mostly in their application
logic, auto load merge is disabled, user will hanlde compaction by
themselves. Auto load merge only base on segment no., not base on any
business relation between the segments. So if they enable auto load merge,
several segments which has no any relation just the segment_id is close will
be compacted. After this kind of compaction, all the information in the
segment before compaction will be lost, this is not what user wants. If any
load is special, in order to not lost any information after compaction, this
load should only merge with the segment which has the same special point
which is only known by the application, carbon currently has no place to
store this information. So only user can control which segments will be
compacted by trigger custom compaction with the segment ids which those
segments have the same special point.




--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/