Doesn't repartition call coalesce(shuffle=true)? On Jun 18, 2015 6:53 PM, "Du Li" <l...@yahoo-inc.com.invalid> wrote:
> I got the same problem with rdd,repartition() in my streaming app, which > generated a few huge partitions and many tiny partitions. The resulting > high data skew makes the processing time of a batch unpredictable and often > exceeding the batch interval. I eventually solved the problem by using > rdd.coalesce() instead, which however is expensive as it yields a lot of > shuffle traffic and also takes a long time. > > Du > > > > On Thursday, June 18, 2015 1:00 AM, Al M <alasdair.mcbr...@gmail.com> > wrote: > > > Thanks for the suggestion. Repartition didn't help us unfortunately. It > still puts everything into the same partition. > > We did manage to improve the situation by making a new partitioner that > extends HashPartitioner. It treats certain "exception" keys differently. > These keys that are known to appear very often are assigned random > partitions instead of using the existing partitioning mechanism. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Shuffle-produces-one-huge-partition-and-many-tiny-partitions-tp23358p23387.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > > >