Hi, I see a similar behaviour in an exactly similar scenario at my deployment as well. I am using scala, so the behaviour is not limited to pyspark. In my observation 9 out of 10 partitions (as in my case) are of similar size ~38 GB each and final one is significantly larger ~59 GB. Prime number of partitions is an interesting approach I will try that out.
Best, Vaibhav. On 14 Dec 2016, 10:18 PM +0530, Dirceu Semighini Filho <dirceu.semigh...@gmail.com>, wrote: > Hello, > We have done some test in here, and it seems that when we use prime number of > partitions the data is more spread. > This has to be with the hashpartitioning and the Java Hash algorithm. > I don't know how your data is and how is this in python, but if you (can) > implement a partitioner, or change it from default, you will get a better > result. > > Dirceu > > 2016-12-14 12:41 GMT-02:00 Adrian Bridgett <adr...@opensignal.com > (mailto:adr...@opensignal.com)>: > > > > Since it's pyspark it's just using the default hash partitioning I believe. > > Trying a prime number (71 so that there's enough CPUs) doesn't seem to > > change anything. Out of curiousity why did you suggest that? Googling > > "spark coalesce prime" doesn't give me any clue :-) > > > > > > Adrian > > > > > > On 14/12/2016 13:58, Dirceu Semighini Filho wrote: > > > Hi Adrian, > > > Which kind of partitioning are you using? > > > Have you already tried to coalesce it to a prime number? > > > > > > > > > 2016-12-14 11:56 GMT-02:00 Adrian Bridgett <adr...@opensignal.com > > > (mailto:adr...@opensignal.com)>: > > > > I realise that coalesce() isn't guaranteed to be balanced and adding a > > > > repartition() does indeed fix this (at the cost of a large shuffle. > > > > > > > > I'm trying to understand _why_ it's so uneven (hopefully it helps > > > > someone else too). This is using spark v2.0.2 (pyspark). > > > > > > > > Essentially we're just reading CSVs into a DataFrame (which we persist > > > > serialised for some calculations), then writing it back out as PRQ. To > > > > avoid too many PRQ files I've set a coalesce of 72 (9 boxes, 8 CPUs > > > > each). > > > > > > > > The writers end up with about 700-900MB each (not bad). Except for one > > > > which is at 6GB before I killed it. > > > > > > > > Input data is 12000 gzipped CSV files in S3 (approx 30GB), named like > > > > this, almost all about 2MB each: > > > > s3://example-rawdata-prod/data/2016-12-13/v3.19.0/1481587209-i-da71c942-389.gz > > > > s3://example-rawdata-prod/data/2016-12-13/v3.19.0/1481587529-i-01d3dab021b760d29-334.gz > > > > > > > > (we're aware that this isn't an ideal naming convention from an S3 > > > > performance PoV). > > > > > > > > The actual CSV file format is: > > > > UUID\tINT\tINT\... . (wide rows - about 300 columns) > > > > > > > > e.g.: > > > > 17f9c2a7-ddf6-42d3-bada-63b845cb33a5 1481587198750 (tel:1481587198750) > > > > 11213.... > > > > 1d723493-5341-450d-a506-5c96ce0697f0 1481587198751 (tel:1481587198751) > > > > 11212 ... > > > > 64cec96f-732c-44b8-a02e-098d5b63ad77 1481587198752 (tel:1481587198752) > > > > 11211 ... > > > > > > > > The dataframe seems to be stored evenly on all the nodes (according to > > > > the storage tab) and all the blocks are the same size. Most of the > > > > tasks are executed at NODE_LOCAL locality (although there are a few > > > > ANY). The oversized task is NODE_LOCAL though. > > > > > > > > The reading and calculations all seem evenly spread, confused why the > > > > writes aren't as I'd expect the input partitions to be even, what's > > > > causing and what we can do? Maybe it's possible for coalesce() to be a > > > > bit smarter in terms of which partitions it coalesces - balancing the > > > > size of the final partitions rather than the number of source > > > > partitions in each final partition. > > > > > > > > Thanks for any light you can shine! > > > > > > > > Adrian > > > > > > > > --------------------------------------------------------------------- > > > > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > > > (mailto:user-unsubscr...@spark.apache.org) > > > > > > > > > > > -- > > Adrian Bridgett | Sysadmin Engineer, OpenSignal (http://www.opensignal.com) > > _____________________________________________________ > > > > Office: 3rd Floor, The Angel Office, 2 Angel Square, London, EC1V 1NY > > Phone #: +44 777-377-8251 (tel:+44%20777-377-8251) > > > > Skype: abridgett | @adrianbridgett (http://twitter.com/adrianbridgett) | > > LinkedIn link (https://uk.linkedin.com/in/abridgett) > > > > _____________________________________________________ > > > > >