Hi folks, We are writing a dataframe and doing a partitionby() on it. df.write.partitionBy('col').parquet('output')
Job is running super slow because internally per partition it is doing a sort before starting to output to the final location. This sort isn't useful in any way since # of files will remain the same. I was wondering if we can have spark just open multiple file pointers and keep appending data as it receives and close all the pointers when it's done. This will reduce the memory footprint and will speed up the performance as we will eliminate a sort. We can implement a custom source but unable to see if we can really control this behavior in the sink. If anyone has any suggestions please let me know. Thanks Nikhil