correcting email id for Nezih Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat-92a3811> www.snappydata.io
On Sun, Apr 3, 2016 at 11:09 AM, Hemant Bhanawat <hemant9...@gmail.com> wrote: > Hi Nezih, > > Can you share JIRA and PR numbers? > > This partial de-coupling of data partitioning strategy and spark > parallelism would be a useful feature for any data store. > > Hemant > > Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat-92a3811> > www.snappydata.io > > On Fri, Apr 1, 2016 at 10:33 PM, Nezih Yigitbasi < > nyigitb...@netflix.com.invalid> wrote: > >> Hey Reynold, >> Created an issue (and a PR) for this change to get discussions started. >> >> Thanks, >> Nezih >> >> On Fri, Feb 26, 2016 at 12:03 AM Reynold Xin <r...@databricks.com> wrote: >> >>> Using the right email for Nezih >>> >>> >>> On Fri, Feb 26, 2016 at 12:01 AM, Reynold Xin <r...@databricks.com> >>> wrote: >>> >>>> I think this can be useful. >>>> >>>> The only thing is that we are slowly migrating to the Dataset/DataFrame >>>> API, and leave RDD mostly as is as a lower level API. Maybe we should do >>>> both? In either case it would be great to discuss the API on a pull >>>> request. Cheers. >>>> >>>> On Wed, Feb 24, 2016 at 2:08 PM, Nezih Yigitbasi < >>>> nyigitb...@netflix.com.invalid> wrote: >>>> >>>>> Hi Spark devs, >>>>> >>>>> I have sent an email about my problem some time ago where I want to >>>>> merge a large number of small files with Spark. Currently I am using Hive >>>>> with the CombineHiveInputFormat and I can control the size of the >>>>> output files with the max split size parameter (which is used for >>>>> coalescing the input splits by the CombineHiveInputFormat). My first >>>>> attempt was to use coalesce(), but since coalesce only considers the >>>>> target number of partitions the output file sizes were varying wildly. >>>>> >>>>> What I think can be useful is to have an optional PartitionCoalescer >>>>> parameter (a new interface) in the coalesce() method (or maybe we can >>>>> add a new method ?) that the callers can implement for custom coalescing >>>>> strategies — for my use case I have already implemented a >>>>> SizeBasedPartitionCoalescer that coalesces partitions by looking at >>>>> their sizes and by using a max split size parameter, similar to the >>>>> CombineHiveInputFormat (I also had to expose HadoopRDD to get access >>>>> to the individual split sizes etc.). >>>>> >>>>> What do you guys think about such a change, can it be useful to other >>>>> users as well? Or do you think that there is an easier way to accomplish >>>>> the same merge logic? If you think it may be useful, I already have >>>>> an implementation and I will be happy to work with the community to >>>>> contribute it. >>>>> >>>>> Thanks, >>>>> Nezih >>>>> >>>>> >>>> >>>> >>> >