Re: coalesce ending up very unbalanced - but why?

2016-12-16 Thread vaquar khan
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-rdd-partitions.html Regards, vaquar khan On Wed, Dec 14, 2016 at 12:15 PM, Vaibhav Sinha wrote: > Hi, > I see a similar behaviour in an exactly similar scenario at my deployment > as well. I am using

Re: coalesce ending up very unbalanced - but why?

2016-12-14 Thread Vaibhav Sinha
Hi, I see a similar behaviour in an exactly similar scenario at my deployment as well. I am using scala, so the behaviour is not limited to pyspark. In my observation 9 out of 10 partitions (as in my case) are of similar size ~38 GB each and final one is significantly larger ~59 GB. Prime number

Re: coalesce ending up very unbalanced - but why?

2016-12-14 Thread Dirceu Semighini Filho
Hello, We have done some test in here, and it seems that when we use prime number of partitions the data is more spread. This has to be with the hashpartitioning and the Java Hash algorithm. I don't know how your data is and how is this in python, but if you (can) implement a partitioner, or

Re: coalesce ending up very unbalanced - but why?

2016-12-14 Thread Adrian Bridgett
Since it's pyspark it's just using the default hash partitioning I believe. Trying a prime number (71 so that there's enough CPUs) doesn't seem to change anything. Out of curiousity why did you suggest that? Googling "spark coalesce prime" doesn't give me any clue :-) Adrian On 14/12/2016

Re: coalesce ending up very unbalanced - but why?

2016-12-14 Thread Dirceu Semighini Filho
Hi Adrian, Which kind of partitioning are you using? Have you already tried to coalesce it to a prime number? 2016-12-14 11:56 GMT-02:00 Adrian Bridgett : > I realise that coalesce() isn't guaranteed to be balanced and adding a > repartition() does indeed fix this (at the

coalesce ending up very unbalanced - but why?

2016-12-14 Thread Adrian Bridgett
I realise that coalesce() isn't guaranteed to be balanced and adding a repartition() does indeed fix this (at the cost of a large shuffle. I'm trying to understand _why_ it's so uneven (hopefully it helps someone else too). This is using spark v2.0.2 (pyspark). Essentially we're just