[Spark DataFrame]: How to solve data skew after repartition?

ly Mon, 01 Nov 2021 05:05:06 -0700

When Spark loads data into object storage systems like HDFS, S3 etc, it can 
result in large number of small files. To solve this problem, a common method 
is to repartition before writing the results. However, this may cause data 
skew. If the number of distinct value of the repartitioned key is limited, then 
we can use a custom partitioner to tackle the skew. But what if it is infinite? 
Is there any method to address the data skew after repartitioning?


One way I can think of is to use AQE. Maybe we can added a new implementation 
of CustomShuffleReaderRule to let spark automatically split large partitions, 
just like what spark did in OptimizeSkewedJoin.

[Spark DataFrame]: How to solve data skew after repartition?

Reply via email to