Handling load distribution and addressing data skew.

Karthick Fri, 16 Aug 2024 16:55:25 -0700

Hi Team,

I'm using repartition and sortWithinPartitions to maintain field-based
ordering across partitions, but I'm facing data skewness among the
partitions. I have 96 partitions, and I'm working with 500 distinct keys.
While reviewing the Spark UI, I noticed that a few partitions are
underutilized while others are overutilized.


This seems to be a hashing problem. Can anyone suggest a better hashing
technique or approach to mitigate this issue?

Thanks in advance for your help.

Handling load distribution and addressing data skew.

Reply via email to