Re: ReduceByKey and sorting within partitions

2015-05-04 Thread Koert Kuipers
shoot me an email if you need any help with spark-sorted. it does not (yet?) have a java api, so you will have to work in scala On Mon, May 4, 2015 at 4:05 PM, Burak Yavuz brk...@gmail.com wrote: I think this Spark Package may be what you're looking for!

Re: ReduceByKey and sorting within partitions

2015-05-04 Thread Imran Rashid
oh wow, that is a really interesting observation, Marco Jerry. I wonder if this is worth exposing in combineByKey()? I think Jerry's proposed workaround is all you can do for now -- use reflection to side-step the fact that the methods you need are private. On Mon, Apr 27, 2015 at 8:07 AM,

Re: ReduceByKey and sorting within partitions

2015-05-04 Thread Burak Yavuz
I think this Spark Package may be what you're looking for! http://spark-packages.org/package/tresata/spark-sorted Best, Burak On Mon, May 4, 2015 at 12:56 PM, Imran Rashid iras...@cloudera.com wrote: oh wow, that is a really interesting observation, Marco Jerry. I wonder if this is worth

Re: ReduceByKey and sorting within partitions

2015-04-29 Thread Marco
On 04/27/2015 06:00 PM, Ganelin, Ilya wrote: Marco - why do you want data sorted both within and across partitions? If you need to take an ordered sequence across all your data you need to either aggregate your RDD on the driver and sort it, or use zipWithIndex to apply an ordered index

Re: ReduceByKey and sorting within partitions

2015-04-27 Thread Saisai Shao
Hi Marco, As I know, current combineByKey() does not expose the related argument where you could set keyOrdering on the ShuffledRDD, since ShuffledRDD is package private, if you can get the ShuffledRDD through reflection or other way, the keyOrdering you set will be pushed down to shuffle. If you

RE: ReduceByKey and sorting within partitions

2015-04-27 Thread Ganelin, Ilya
Marco - why do you want data sorted both within and across partitions? If you need to take an ordered sequence across all your data you need to either aggregate your RDD on the driver and sort it, or use zipWithIndex to apply an ordered index to your data that matches the order it was stored on