Hi Spark devs,

I have sent an email about my problem some time ago where I want to merge a
large number of small files with Spark. Currently I am using Hive with the
CombineHiveInputFormat and I can control the size of the output files with
the max split size parameter (which is used for coalescing the input splits
by the CombineHiveInputFormat). My first attempt was to use coalesce(), but
since coalesce only considers the target number of partitions the output
file sizes were varying wildly.

What I think can be useful is to have an optional PartitionCoalescer
parameter (a new interface) in the coalesce() method (or maybe we can add a
new method ?) that the callers can implement for custom coalescing
strategies — for my use case I have already implemented a
SizeBasedPartitionCoalescer that coalesces partitions by looking at their
sizes and by using a max split size parameter, similar to the
CombineHiveInputFormat (I also had to expose HadoopRDD to get access to the
individual split sizes etc.).

What do you guys think about such a change, can it be useful to other users
as well? Or do you think that there is an easier way to accomplish the same
merge logic? If you think it may be useful, I already have an
implementation and I will be happy to work with the community to contribute
it.

Thanks,
Nezih
​

Reply via email to