Thank you! It's a very nice improvement :). However, my situation is a bit different -- the code their tries to make each coalesced partition to have roughly same * number of parent partitions* , while in my case, the parent partitions could be quite imbalanced and I am trying to to make each coalesced partition to have roughly the same * SIZE *.
Of course, this requires the size of parent partitions to be known -- which is not a problem in my case as I would always generate it and cache it. This is probably not a common case thus I am happy to write my own (hacking) code to get it around -- but I need the location for each cached partitions... By the way: Is it possible to assign preferred locations to ParallelCollectionRDD? (e.g. RDDs generated by sc.parallize).. Sorry if it is a silly question... Best, Wenlei On Mon, Sep 2, 2013 at 12:28 AM, Reynold Xin <[email protected]> wrote: > Does this help you? https://github.com/mesos/spark/pull/832 > > > -- > Reynold Xin, AMPLab, UC Berkeley > http://rxin.org > > > > On Mon, Sep 2, 2013 at 3:24 PM, Wenlei Xie <[email protected]> wrote: > >> Hi, >> >> I am wondering if it is possible to get the partition position of cached >> RDD? I am asking this because I am trying to avoid shuffling when >> performing coalesce operation. And the size of my partitions could be quite >> imbalance thus CoalescedRDD would probably not be a good solution in my >> case. >> >> Thank you! >> >> Best, >> Wenlei >> >> -- >> Wenlei Xie (谢文磊) >> >> Department of Computer Science >> 5132 Upson Hall, Cornell University >> Ithaca, NY 14853, USA >> Phone: (607) 255-5577 >> Email: [email protected] >> > > -- Wenlei Xie (谢文磊) Department of Computer Science 5132 Upson Hall, Cornell University Ithaca, NY 14853, USA Phone: (607) 255-5577 Email: [email protected]
