coalesce ending up very unbalanced - but why?

Adrian Bridgett Wed, 14 Dec 2016 05:56:51 -0800

I realise that coalesce() isn't guaranteed to be balanced and adding arepartition() does indeed fix this (at the cost of a large shuffle.

I'm trying to understand _why_ it's so uneven (hopefully it helpssomeone else too). This is using spark v2.0.2 (pyspark).

Essentially we're just reading CSVs into a DataFrame (which we persistserialised for some calculations), then writing it back out as PRQ. Toavoid too many PRQ files I've set a coalesce of 72 (9 boxes, 8 CPUs each).

The writers end up with about 700-900MB each (not bad). Except for onewhich is at 6GB before I killed it.

Input data is 12000 gzipped CSV files in S3 (approx 30GB), named likethis, almost all about 2MB each:

s3://example-rawdata-prod/data/2016-12-13/v3.19.0/1481587209-i-da71c942-389.gz
s3://example-rawdata-prod/data/2016-12-13/v3.19.0/1481587529-i-01d3dab021b760d29-334.gz

(we're aware that this isn't an ideal naming convention from an S3performance PoV).


The actual CSV file format is:
UUID\tINT\tINT\... . (wide rows - about 300 columns)

e.g.:
17f9c2a7-ddf6-42d3-bada-63b845cb33a5    1481587198750   11213....
1d723493-5341-450d-a506-5c96ce0697f0    1481587198751   11212 ...
64cec96f-732c-44b8-a02e-098d5b63ad77    1481587198752   11211 ...

The dataframe seems to be stored evenly on all the nodes (according tothe storage tab) and all the blocks are the same size. Most of thetasks are executed at NODE_LOCAL locality (although there are a fewANY). The oversized task is NODE_LOCAL though.

The reading and calculations all seem evenly spread, confused why thewrites aren't as I'd expect the input partitions to be even, what'scausing and what we can do? Maybe it's possible for coalesce() to be abit smarter in terms of which partitions it coalesces - balancing thesize of the final partitions rather than the number of source partitionsin each final partition.


Thanks for any light you can shine!

Adrian

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

coalesce ending up very unbalanced - but why?

Reply via email to