https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-rdd-partitions.html
Regards, vaquar khan On Wed, Dec 14, 2016 at 12:15 PM, Vaibhav Sinha <mail.vsi...@gmail.com> wrote: > Hi, > I see a similar behaviour in an exactly similar scenario at my deployment > as well. I am using scala, so the behaviour is not limited to pyspark. > In my observation 9 out of 10 partitions (as in my case) are of similar > size ~38 GB each and final one is significantly larger ~59 GB. > Prime number of partitions is an interesting approach I will try that out. > > Best, > Vaibhav. > > On 14 Dec 2016, 10:18 PM +0530, Dirceu Semighini Filho < > dirceu.semigh...@gmail.com>, wrote: > > Hello, > We have done some test in here, and it seems that when we use prime number > of partitions the data is more spread. > This has to be with the hashpartitioning and the Java Hash algorithm. > I don't know how your data is and how is this in python, but if you (can) > implement a partitioner, or change it from default, you will get a better > result. > > Dirceu > > 2016-12-14 12:41 GMT-02:00 Adrian Bridgett <adr...@opensignal.com>: > >> Since it's pyspark it's just using the default hash partitioning I >> believe. Trying a prime number (71 so that there's enough CPUs) doesn't >> seem to change anything. Out of curiousity why did you suggest that? >> Googling "spark coalesce prime" doesn't give me any clue :-) >> Adrian >> >> >> On 14/12/2016 13:58, Dirceu Semighini Filho wrote: >> >> Hi Adrian, >> Which kind of partitioning are you using? >> Have you already tried to coalesce it to a prime number? >> >> >> 2016-12-14 11:56 GMT-02:00 Adrian Bridgett <adr...@opensignal.com>: >> >>> I realise that coalesce() isn't guaranteed to be balanced and adding a >>> repartition() does indeed fix this (at the cost of a large shuffle. >>> >>> I'm trying to understand _why_ it's so uneven (hopefully it helps >>> someone else too). This is using spark v2.0.2 (pyspark). >>> >>> Essentially we're just reading CSVs into a DataFrame (which we persist >>> serialised for some calculations), then writing it back out as PRQ. To >>> avoid too many PRQ files I've set a coalesce of 72 (9 boxes, 8 CPUs each). >>> >>> The writers end up with about 700-900MB each (not bad). Except for one >>> which is at 6GB before I killed it. >>> >>> Input data is 12000 gzipped CSV files in S3 (approx 30GB), named like >>> this, almost all about 2MB each: >>> s3://example-rawdata-prod/data/2016-12-13/v3.19.0/1481587209 >>> -i-da71c942-389.gz >>> s3://example-rawdata-prod/data/2016-12-13/v3.19.0/1481587529 >>> -i-01d3dab021b760d29-334.gz >>> >>> (we're aware that this isn't an ideal naming convention from an S3 >>> performance PoV). >>> >>> The actual CSV file format is: >>> UUID\tINT\tINT\... . (wide rows - about 300 columns) >>> >>> e.g.: >>> 17f9c2a7-ddf6-42d3-bada-63b845cb33a5 1481587198750 11213.... >>> 1d723493-5341-450d-a506-5c96ce0697f0 1481587198751 11212 ... >>> 64cec96f-732c-44b8-a02e-098d5b63ad77 1481587198752 11211 ... >>> >>> The dataframe seems to be stored evenly on all the nodes (according to >>> the storage tab) and all the blocks are the same size. Most of the tasks >>> are executed at NODE_LOCAL locality (although there are a few ANY). The >>> oversized task is NODE_LOCAL though. >>> >>> The reading and calculations all seem evenly spread, confused why the >>> writes aren't as I'd expect the input partitions to be even, what's causing >>> and what we can do? Maybe it's possible for coalesce() to be a bit smarter >>> in terms of which partitions it coalesces - balancing the size of the final >>> partitions rather than the number of source partitions in each final >>> partition. >>> >>> Thanks for any light you can shine! >>> >>> Adrian >>> >>> --------------------------------------------------------------------- >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>> >>> >> >> -- >> *Adrian Bridgett* | Sysadmin Engineer, OpenSignal >> <http://www.opensignal.com> >> _____________________________________________________ >> Office: 3rd Floor, The Angel Office, 2 Angel Square, London, EC1V 1NY >> Phone #: +44 777-377-8251 <+44%20777-377-8251> >> Skype: abridgett | @adrianbridgett <http://twitter.com/adrianbridgett> >> | LinkedIn link <https://uk.linkedin.com/in/abridgett> >> _____________________________________________________ >> > > -- Regards, Vaquar Khan +1 -224-436-0783 IT Architect / Lead Consultant Greater Chicago