Partition by on dataframe causing a Sort

Nikhil Goyal Thu, 20 Apr 2023 13:29:10 -0700

Hi folks,

We are writing a dataframe and doing a partitionby() on it.
df.write.partitionBy('col').parquet('output')


Job is running super slow because internally per partition it is doing a
sort before starting to output to the final location. This sort isn't
useful in any way since # of files will remain the same. I was wondering if
we can have spark just open multiple file pointers and keep appending data
as it receives and close all the pointers when it's done. This will reduce
the memory footprint and will speed up the performance as we will
eliminate a sort. We can implement a custom source but unable to see if we
can really control this behavior in the sink. If anyone has any suggestions
please let me know.

Thanks
Nikhil

Partition by on dataframe causing a Sort

Reply via email to