xicm commented on PR #11578: URL: https://github.com/apache/hudi/pull/11578#issuecomment-2217636975
> > @danny0405 @xicm From the discussion results and unit test situations, we can conclude that in the case of consecutive partitions, the new algorithm is more stable than the old algorithm. However, in the case of non-consecutive partitions, both algorithms exhibit some fluctuations. For example, with a parallelism of 10 and 5 buckets, the new algorithm can only achieve half of the parallelism for partitions 01 and 03, whereas the old algorithm can fully utilize the parallelism. With a parallelism of 20 and 5 buckets, and the same partitions, the new algorithm can achieve half of the parallelism, but the old algorithm can only achieve a quarter. The specific outcome depends on the initial position of the partitions. > > Yes, the result of the two algorithms depends on the hash value of the partition, which is almost random. Is there a better way to get the partition number? @KnightChess The old algorithm has overflow problems, if we fix the overflow problem, the old algorithm is better. ``` int partitionIndex = (partition.hashCode() & Integer.MAX_VALUE) / (parallelism / bucketNum) * bucketNum; int globalIndex = Math.abs(partitionIndex) + curBucket; return BucketIdentifier.mod(globalIndex, parallelism); ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org