Is it possible to use MultipleOutputs and define a custom OutputFormat and
then use `saveAsHadoopFile` to be able to achieve this?
On Thu, Apr 20, 2023 at 1:29 PM Nikhil Goyal wrote:
> Hi folks,
>
> We are writing a dataframe and doing a partitionby() on it.
>
Hi folks,
We are writing a dataframe and doing a partitionby() on it.
df.write.partitionBy('col').parquet('output')
Job is running super slow because internally per partition it is doing a
sort before starting to output to the final location. This sort isn't
useful in any way since # of files