Re: Shuffle memory woes

2016-02-08 Thread Igor Berman
t;>> ------ 原始邮件 -- >>> *发件人:* "Corey Nolet";<cjno...@gmail.com>; >>> *发送时间:* 2016年2月7日(星期天) 晚上8:56 >>> *收件人:* "Igor Berman"<igor.ber...@gmail.com>; >>> *抄送:* "user"<user@spark.apache.org&

Re: Shuffle memory woes

2016-02-08 Thread Corey Nolet
u use sc.text() ? If memory is not >>>> enough, spark will spill 3-4x of input data to disk. >>>> >>>> >>>> -- 原始邮件 -- >>>> *发件人:* "Corey Nolet";<cjno...@gmail.com>; >>>> *发送时间:* 2016年2月7日(星期天)

Re: Shuffle memory woes

2016-02-07 Thread Igor Berman
so can you provide code snippets: especially it's interesting to see what are your transformation chain, how many partitions are there on each side of shuffle operation the question is why it can't fit stuff in memory when you are shuffling - maybe your partitioner on "reduce" side is not

Re: Shuffle memory woes

2016-02-07 Thread Corey Nolet
As for the second part of your questions- we have a fairly complex join process which requires a ton of stage orchestration from our driver. I've written some code to be able to walk down our DAG tree and execute siblings in the tree concurrently where possible (forcing cache to disk on children

Re: Shuffle memory woes

2016-02-07 Thread Corey Nolet
Igor, I don't think the question is "why can't it fit stuff in memory". I know why it can't fit stuff in memory- because it's a large dataset that needs to have a reduceByKey() run on it. My understanding is that when it doesn't fit into memory it needs to spill in order to consolidate

?????? Shuffle memory woes

2016-02-07 Thread Sea
;;<cjno...@gmail.com>; : 2016??2??7??(??) 8:56 ??: "Igor Berman"<igor.ber...@gmail.com>; : "user"<user@spark.apache.org>; : Re: Shuffle memory woes As for the second part of your questions- we have a fairly complex join pro

Re: Shuffle memory woes

2016-02-07 Thread Corey Nolet
--- >> *发件人:* "Corey Nolet";<cjno...@gmail.com>; >> *发送时间:* 2016年2月7日(星期天) 晚上8:56 >> *收件人:* "Igor Berman"<igor.ber...@gmail.com>; >> *抄送:* "user"<user@spark.apache.org>; >> *主题:* Re: Shuffle memory woes >> >> As

Re: Shuffle memory woes

2016-02-07 Thread Charles Chao
ough, > spark will spill 3-4x of input data to disk. > > > -- 原始邮件 -- > *发件人:* "Corey Nolet";<cjno...@gmail.com>; > *发送时间:* 2016年2月7日(星期天) 晚上8:56 > *收件人:* "Igor Berman"<igor.ber...@gmail.com>; > *抄送:* "user&qu

Re: Shuffle memory woes

2016-02-06 Thread Corey Nolet
Igor, Thank you for the response but unfortunately, the problem I'm referring to goes beyond this. I have set the shuffle memory fraction to be 90% and set the cache memory to be 0. Repartitioning the RDD helped a tad on the map side but didn't do much for the spilling when there was no longer

Re: Shuffle memory woes

2016-02-06 Thread Igor Berman
Hi, usually you can solve this by 2 steps make rdd to have more partitions play with shuffle memory fraction in spark 1.6 cache vs shuffle memory fractions are adjusted automatically On 5 February 2016 at 23:07, Corey Nolet wrote: > I just recently had a discovery that my

Shuffle memory woes

2016-02-05 Thread Corey Nolet
I just recently had a discovery that my jobs were taking several hours to completely because of excess shuffle spills. What I found was that when I hit the high point where I didn't have enough memory for the shuffles to store all of their file consolidations at once, it could spill so many times