Right, I figured I'd need a custom partitioner from what I've read around! Documentation on this is super sparse; do you have any recommended links on solving data skew and/or creating custom partitioners in Spark 1.4?
I'd also love to hear if this is an unusual problem with my type of set-up - if the cluster should be able to handle this, if it were somehow configured differently. Thank you, Mo Sent from my iPhone > On Jul 6, 2015, at 8:12 PM, ayan guha <guha.a...@gmail.com> wrote: > > You can bump up number of partition by a parameter in join operator. > However you have a data skew problem which you need to resolve using a > reasonable partition by function > >> On 7 Jul 2015 08:57, "Mohammed Omer" <beancinemat...@gmail.com> wrote: >> Afternoon all, >> >> Really loving this project and the community behind it. Thank you all for >> your hard work. >> >> >> >> This past week, though, I've been having a hard time getting my first >> deployed job to run without failing at the same point every time: Right >> after a leftOuterJoin, most partitions (600 total) are small (1-100MB), >> while some others are large (3-6GB). The large ones consistently spill >> 20-60GB into memory, and eventually fail. >> >> >> >> If I could only get the partitions to be smaller, right out of the >> leftOuterJoin, it seems like the job would run fine. >> >> >> >> I've tried trawling through the logs, but it hasn't been very fruitful in >> finding out what, specifically, is the issue. >> >> >> >> Cluster setup: >> >> * 6 worker nodes (16 cores, 104GB Memory, 500GB storage) >> >> * 1 master (same config as above) >> >> >> >> Running Spark on YARN, with: >> >> >> >> storage.memoryFraction = .3 >> >> --executors = 6 >> >> --executor-cores = 12 >> >> --executor-memory = kind of confusing due to YARN, but basically in the >> Spark monitor site's Executors page, it shows each as running with 18.8GB >> memory, though I know usage is much larger due to YARN managing various >> pieces. (Total memory available to yarn shows 480GB, with 270GB currently >> used). >> >> >> >> Screenshot of the task page: http://i.imgur.com/xG3KdEl.png >> >> Code: >> https://gist.github.com/momer/8bc03c60a639e5c04eda#file-spark-scala-L60 (see >> line 60 for the relevant area) >> >> >> >> Any pointers in the right direction, or advice on articles to read, or even >> debugging / settings advice or recommendations would be extremely helpful. >> I'll put a bounty on this of $50 donation to the ASF! :D >> >> >> >> Thank you all for reading (and hopefully replying!), >> >> >> >> Mo Omer