Turned out that is was sufficient do to repartitionAndSortWithinPartitions
... so far so good ;)
On Tue, May 5, 2015 at 9:45 AM Marius Danciu
wrote:
> Hi Imran,
>
> Yes that's what MyPartitioner does. I do see (using traces from
> MyPartitioner) that the key is partitioned on partition 0 but the
Hi Imran,
Yes that's what MyPartitioner does. I do see (using traces from
MyPartitioner) that the key is partitioned on partition 0 but then I see
this record arriving in both Yarn containers (I see it in the logs).
Basically I need to emulate a Hadoop map-reduce job in Spark and groupByKey
seemed
Hi Marius,
I am also a little confused -- are you saying that myPartitions is
basically something like:
class MyPartitioner extends Partitioner {
def numPartitions = 1
def getPartition(key: Any) = 0
}
??
If so, I don't understand how you'd ever end up data in two partitions.
Indeed, than ev
.
From: Marius Danciu
Date: Tuesday, April 28, 2015 at 9:53 AM
To: Silvio Fiorito, user
Subject: Re: Spark partitioning question
Thank you Silvio,
I am aware of groubBy limitations and this is subject for replacement.
I did try repartitionAndSortWithinPartitions but then I end up with maybe too
Thank you Silvio,
I am aware of groubBy limitations and this is subject for replacement.
I did try repartitionAndSortWithinPartitions but then I end up with maybe
too much shuffling one from groupByKey and the other from repartition.
My expectation was that since N records are partitioned to the
Hi Marius,
What’s the expected output?
I would recommend avoiding the groupByKey if possible since it’s going to force
all records for each key to go to an executor which may overload it.
Also if you need to sort and repartition, try using
repartitionAndSortWithinPartitions to do it in one sho