Re: partitionBy creating lot of small files

2022-06-04 Thread Enrico Minack
You refer to df.write.partitionBy, which creates for each value of "col" a directory, and in worst-case writes one file per DataFrame partition. So the number of output files is controlled by cardinality of "col", which is your data and hence out of control, and the number of partitions of

partitionBy creating lot of small files

2022-06-04 Thread Nikhil Goyal
Hi all, Is there a way to use dataframe.partitionBy("col") and control the number of output files without doing a full repartition? The thing is some partitions have more data while some have less. Doing a .repartition is a costly operation. We want to control the size of the output files. Is it