eric9204 commented on issue #11288: URL: https://github.com/apache/hudi/issues/11288#issuecomment-2132903234
This issue is primarily attributed to hash collisions and the implications of modulo operations in distribution. Firstly, bucket identifiers are intended to be consecutive, while partition paths may either follow a sequential or random pattern. The combination of different partitions and buckets should ideally yield unique hash values. However, when inferior hash functions are employed and subjected to modulo operations with respect to the level of parallelism, they can lead to a higher likelihood of hash collisions. Consequently, post-modulo calculation, data distribution across subtasks becomes uneven. Secondly, at its core, the modulo operation serves to map the combined hash values of partition paths and buckets into a continuous space ranging from 0 to `parallelism - 1`. In instances where the parallelism is a prime number, the modulo operation tends to mitigate uneven distribution slightly more effectively due to the unique properties of primes reducing common residue patterns. Nonetheless, in practical deployments where parallelism often does not conform to prime numbers, the modulo operation results in hash values with identical remainders being grouped within the same subtask, thereby exacerbating the problem of non-uniform data distribution. In essence, the categorization nature of modulo operations amplifies data skew under non-prime parallelism scenarios. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org