subject:"Partition by on dataframe causing a Sort"

Re: Partition by on dataframe causing a Sort

2023-04-20 Thread Nikhil Goyal

Is it possible to use MultipleOutputs and define a custom OutputFormat and then use `saveAsHadoopFile` to be able to achieve this? On Thu, Apr 20, 2023 at 1:29 PM Nikhil Goyal wrote: > Hi folks, > > We are writing a dataframe and doing a partitionby() on it. >

Partition by on dataframe causing a Sort

2023-04-20 Thread Nikhil Goyal

Hi folks, We are writing a dataframe and doing a partitionby() on it. df.write.partitionBy('col').parquet('output') Job is running super slow because internally per partition it is doing a sort before starting to output to the final location. This sort isn't useful in any way since # of files