Re: Partition by on dataframe causing a Sort

2023-04-20 Thread Nikhil Goyal
Is it possible to use MultipleOutputs and define a custom OutputFormat and then use `saveAsHadoopFile` to be able to achieve this? On Thu, Apr 20, 2023 at 1:29 PM Nikhil Goyal wrote: > Hi folks, > > We are writing a dataframe and doing a partitionby() on it. >

Partition by on dataframe causing a Sort

2023-04-20 Thread Nikhil Goyal
Hi folks, We are writing a dataframe and doing a partitionby() on it. df.write.partitionBy('col').parquet('output') Job is running super slow because internally per partition it is doing a sort before starting to output to the final location. This sort isn't useful in any way since # of files