Re: [PR] [HUDI-7957] fix data skew when writing with bulk_insert + bucket_inde… [hudi]

via GitHub Tue, 09 Jul 2024 05:50:36 -0700


xicm commented on PR #11578:
URL: https://github.com/apache/hudi/pull/11578#issuecomment-2217636975


   > > @danny0405 @xicm From the discussion results and unit test situations, 
we can conclude that in the case of consecutive partitions, the new algorithm 
is more stable than the old algorithm. However, in the case of non-consecutive 
partitions, both algorithms exhibit some fluctuations. For example, with a 
parallelism of 10 and 5 buckets, the new algorithm can only achieve half of the 
parallelism for partitions 01 and 03, whereas the old algorithm can fully 
utilize the parallelism. With a parallelism of 20 and 5 buckets, and the same 
partitions, the new algorithm can achieve half of the parallelism, but the old 
algorithm can only achieve a quarter. The specific outcome depends on the 
initial position of the partitions.
   > 
   > Yes, the result of the two algorithms depends on the hash value of the 
partition, which is almost random. Is there a better way to get the partition 
number?
   
   @KnightChess 
   
   The old algorithm has overflow problems, if we fix the overflow problem, the 
old algorithm is better.
   ```
             int partitionIndex = (partition.hashCode() & Integer.MAX_VALUE) / 
(parallelism / bucketNum) * bucketNum;
             int globalIndex = Math.abs(partitionIndex) + curBucket;
             return BucketIdentifier.mod(globalIndex, parallelism);
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-7957] fix data skew when writing with bulk_insert + bucket_inde… [hudi]

Reply via email to