Sure, here <https://issues.apache.org/jira/browse/SPARK-14042> is the jira and this <https://github.com/apache/spark/pull/11865> is the PR.
Nezih On Sat, Apr 2, 2016 at 10:40 PM Hemant Bhanawat <hemant9...@gmail.com> wrote: > correcting email id for Nezih > > Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat-92a3811> > www.snappydata.io > > On Sun, Apr 3, 2016 at 11:09 AM, Hemant Bhanawat <hemant9...@gmail.com> > wrote: > >> Hi Nezih, >> >> Can you share JIRA and PR numbers? >> >> This partial de-coupling of data partitioning strategy and spark >> parallelism would be a useful feature for any data store. >> >> Hemant >> >> Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat-92a3811> >> www.snappydata.io >> >> On Fri, Apr 1, 2016 at 10:33 PM, Nezih Yigitbasi < >> nyigitb...@netflix.com.invalid> wrote: >> >>> Hey Reynold, >>> Created an issue (and a PR) for this change to get discussions started. >>> >>> Thanks, >>> Nezih >>> >>> On Fri, Feb 26, 2016 at 12:03 AM Reynold Xin <r...@databricks.com> >>> wrote: >>> >>>> Using the right email for Nezih >>>> >>>> >>>> On Fri, Feb 26, 2016 at 12:01 AM, Reynold Xin <r...@databricks.com> >>>> wrote: >>>> >>>>> I think this can be useful. >>>>> >>>>> The only thing is that we are slowly migrating to the >>>>> Dataset/DataFrame API, and leave RDD mostly as is as a lower level API. >>>>> Maybe we should do both? In either case it would be great to discuss the >>>>> API on a pull request. Cheers. >>>>> >>>>> On Wed, Feb 24, 2016 at 2:08 PM, Nezih Yigitbasi < >>>>> nyigitb...@netflix.com.invalid> wrote: >>>>> >>>>>> Hi Spark devs, >>>>>> >>>>>> I have sent an email about my problem some time ago where I want to >>>>>> merge a large number of small files with Spark. Currently I am using Hive >>>>>> with the CombineHiveInputFormat and I can control the size of the >>>>>> output files with the max split size parameter (which is used for >>>>>> coalescing the input splits by the CombineHiveInputFormat). My first >>>>>> attempt was to use coalesce(), but since coalesce only considers the >>>>>> target number of partitions the output file sizes were varying wildly. >>>>>> >>>>>> What I think can be useful is to have an optional PartitionCoalescer >>>>>> parameter (a new interface) in the coalesce() method (or maybe we >>>>>> can add a new method ?) that the callers can implement for custom >>>>>> coalescing strategies — for my use case I have already implemented a >>>>>> SizeBasedPartitionCoalescer that coalesces partitions by looking at >>>>>> their sizes and by using a max split size parameter, similar to the >>>>>> CombineHiveInputFormat (I also had to expose HadoopRDD to get access >>>>>> to the individual split sizes etc.). >>>>>> >>>>>> What do you guys think about such a change, can it be useful to other >>>>>> users as well? Or do you think that there is an easier way to accomplish >>>>>> the same merge logic? If you think it may be useful, I already have >>>>>> an implementation and I will be happy to work with the community to >>>>>> contribute it. >>>>>> >>>>>> Thanks, >>>>>> Nezih >>>>>> >>>>>> >>>>> >>>>> >>>> >> >