I think this can be useful.

The only thing is that we are slowly migrating to the Dataset/DataFrame
API, and leave RDD mostly as is as a lower level API. Maybe we should do
both? In either case it would be great to discuss the API on a pull
request. Cheers.

On Wed, Feb 24, 2016 at 2:08 PM, Nezih Yigitbasi <
nyigitb...@netflix.com.invalid> wrote:

> Hi Spark devs,
>
> I have sent an email about my problem some time ago where I want to merge
> a large number of small files with Spark. Currently I am using Hive with
> the CombineHiveInputFormat and I can control the size of the output files
> with the max split size parameter (which is used for coalescing the input
> splits by the CombineHiveInputFormat). My first attempt was to use
> coalesce(), but since coalesce only considers the target number of
> partitions the output file sizes were varying wildly.
>
> What I think can be useful is to have an optional PartitionCoalescer
> parameter (a new interface) in the coalesce() method (or maybe we can add
> a new method ?) that the callers can implement for custom coalescing
> strategies — for my use case I have already implemented a
> SizeBasedPartitionCoalescer that coalesces partitions by looking at their
> sizes and by using a max split size parameter, similar to the
> CombineHiveInputFormat (I also had to expose HadoopRDD to get access to
> the individual split sizes etc.).
>
> What do you guys think about such a change, can it be useful to other
> users as well? Or do you think that there is an easier way to accomplish
> the same merge logic? If you think it may be useful, I already have an
> implementation and I will be happy to work with the community to contribute
> it.
>
> Thanks,
> Nezih
> ​
>

Reply via email to