Hi Team, I'm using repartition and sortWithinPartitions to maintain field-based ordering across partitions, but I'm facing data skewness among the partitions. I have 96 partitions, and I'm working with 500 distinct keys. While reviewing the Spark UI, I noticed that a few partitions are underutilized while others are overutilized.
This seems to be a hashing problem. Can anyone suggest a better hashing technique or approach to mitigate this issue? Thanks in advance for your help.