Hi all, Currently CarbonData supports Compaction for all sort scopes based on their taskIds, i.e, we group the partitions(carbondata files) of different segments which have the same taskId to one task and then compact. But this would not be the correct way to handle the compaction in the case of Range Sort where we have data divided into different ranges for different segments. So we may group different ranges' data into one range which may not be correct.
For example: Seg_0 has 3 ranges (0-100), (100-200), (200-300) and Seg_1 has 2 ranges (50-150) and (250-300); so here if we combine based on taskIds we will get a wrong grouping after compaction. So we can solve this problem by merging the overlapping intervals and getting new intervals(ranges) out of them. After this we can assign each task approximately same amount of data by dividing on the basis of sizes of the ranges. After this we can continue as the normal data load flow of Range Column at each task. Any suggestions from the community will be greatly appreciated. I would be uploading the design doc shortly. Thanks and regards Manish Nalla EI BigData Kernel, Huawei Technologies India Pvt. Ltd -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
