Re: how about a custom coalesce() policy?

Reynold Xin Fri, 26 Feb 2016 00:04:05 -0800

Using the right email for Nezih


On Fri, Feb 26, 2016 at 12:01 AM, Reynold Xin <r...@databricks.com> wrote:

> I think this can be useful.
>
> The only thing is that we are slowly migrating to the Dataset/DataFrame
> API, and leave RDD mostly as is as a lower level API. Maybe we should do
> both? In either case it would be great to discuss the API on a pull
> request. Cheers.
>
> On Wed, Feb 24, 2016 at 2:08 PM, Nezih Yigitbasi <
> nyigitb...@netflix.com.invalid> wrote:
>
>> Hi Spark devs,
>>
>> I have sent an email about my problem some time ago where I want to merge
>> a large number of small files with Spark. Currently I am using Hive with
>> the CombineHiveInputFormat and I can control the size of the output
>> files with the max split size parameter (which is used for coalescing the
>> input splits by the CombineHiveInputFormat). My first attempt was to use
>> coalesce(), but since coalesce only considers the target number of
>> partitions the output file sizes were varying wildly.
>>
>> What I think can be useful is to have an optional PartitionCoalescer
>> parameter (a new interface) in the coalesce() method (or maybe we can
>> add a new method ?) that the callers can implement for custom coalescing
>> strategies — for my use case I have already implemented a
>> SizeBasedPartitionCoalescer that coalesces partitions by looking at
>> their sizes and by using a max split size parameter, similar to the
>> CombineHiveInputFormat (I also had to expose HadoopRDD to get access to
>> the individual split sizes etc.).
>>
>> What do you guys think about such a change, can it be useful to other
>> users as well? Or do you think that there is an easier way to accomplish
>> the same merge logic? If you think it may be useful, I already have an
>> implementation and I will be happy to work with the community to contribute
>> it.
>>
>> Thanks,
>> Nezih
>> 
>>
>
>

Re: how about a custom coalesce() policy?

Reply via email to