Right, I figured I'd need a custom partitioner from what I've read around!

Documentation on this is super sparse; do you have any recommended links on 
solving data skew and/or creating custom partitioners in Spark 1.4?

I'd also love to hear if this is an unusual problem with my type of set-up - if 
the cluster should be able to handle this, if it were somehow configured 
differently.

Thank you,

Mo

Sent from my iPhone

> On Jul 6, 2015, at 8:12 PM, ayan guha <guha.a...@gmail.com> wrote:
> 
> You can bump up number of partition by a parameter in join operator. 
> However you have a data skew problem which you need to resolve using a 
> reasonable partition by function
> 
>> On 7 Jul 2015 08:57, "Mohammed Omer" <beancinemat...@gmail.com> wrote:
>> Afternoon all,
>> 
>> Really loving this project and the community behind it. Thank you all for 
>> your hard work. 
>> 
>> 
>> 
>> This past week, though, I've been having a hard time getting my first 
>> deployed job to run without failing at the same point every time: Right 
>> after a leftOuterJoin, most partitions (600 total) are small (1-100MB), 
>> while some others are large (3-6GB). The large ones consistently spill 
>> 20-60GB into memory, and eventually fail.
>> 
>> 
>> 
>> If I could only get the partitions to be smaller, right out of the 
>> leftOuterJoin, it seems like the job would run fine.
>> 
>> 
>> 
>> I've tried trawling through the logs, but it hasn't been very fruitful in 
>> finding out what, specifically, is the issue. 
>> 
>> 
>> 
>> Cluster setup:
>> 
>> * 6 worker nodes (16 cores, 104GB Memory, 500GB storage)
>> 
>> * 1 master (same config as above)
>> 
>> 
>> 
>> Running Spark on YARN, with:
>> 
>> 
>> 
>> storage.memoryFraction = .3
>> 
>> --executors = 6
>> 
>> --executor-cores = 12
>> 
>> --executor-memory = kind of confusing due to YARN, but basically in the 
>> Spark monitor site's Executors page, it shows each as running with 18.8GB 
>> memory, though I know usage is much larger due to YARN managing various 
>> pieces. (Total memory available to yarn shows 480GB, with 270GB currently 
>> used).
>> 
>> 
>> 
>> Screenshot of the task page: http://i.imgur.com/xG3KdEl.png
>> 
>> Code: 
>> https://gist.github.com/momer/8bc03c60a639e5c04eda#file-spark-scala-L60 (see 
>> line 60 for the relevant area)
>> 
>>  
>> 
>> Any pointers in the right direction, or advice on articles to read, or even 
>> debugging / settings advice or recommendations would be extremely helpful. 
>> I'll put a bounty on this of $50 donation to the ASF! :D
>> 
>> 
>> 
>> Thank you all for reading (and hopefully replying!),
>> 
>> 
>> 
>> Mo Omer

Reply via email to