eric9204 commented on issue #11288:
URL: https://github.com/apache/hudi/issues/11288#issuecomment-2132903234

   This issue is primarily attributed to hash collisions and the implications 
of modulo operations in distribution.
   
   Firstly, bucket identifiers are intended to be consecutive, while partition 
paths may either follow a sequential or random pattern. The combination of 
different partitions and buckets should ideally yield unique hash values. 
However, when inferior hash functions are employed and subjected to modulo 
operations with respect to the level of parallelism, they can lead to a higher 
likelihood of hash collisions. Consequently, post-modulo calculation, data 
distribution across subtasks becomes uneven.
   
   Secondly, at its core, the modulo operation serves to map the combined hash 
values of partition paths and buckets into a continuous space ranging from 0 to 
`parallelism - 1`. In instances where the parallelism is a prime number, the 
modulo operation tends to mitigate uneven distribution slightly more 
effectively due to the unique properties of primes reducing common residue 
patterns. Nonetheless, in practical deployments where parallelism often does 
not conform to prime numbers, the modulo operation results in hash values with 
identical remainders being grouped within the same subtask, thereby 
exacerbating the problem of non-uniform data distribution. In essence, the 
categorization nature of modulo operations amplifies data skew under non-prime 
parallelism scenarios.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to