t;>> ------ 原始邮件 --
>>> *发件人:* "Corey Nolet";<cjno...@gmail.com>;
>>> *发送时间:* 2016年2月7日(星期天) 晚上8:56
>>> *收件人:* "Igor Berman"<igor.ber...@gmail.com>;
>>> *抄送:* "user"<user@spark.apache.org&
u use sc.text() ? If memory is not
>>>> enough, spark will spill 3-4x of input data to disk.
>>>>
>>>>
>>>> -- 原始邮件 --
>>>> *发件人:* "Corey Nolet";<cjno...@gmail.com>;
>>>> *发送时间:* 2016年2月7日(星期天)
so can you provide code snippets: especially it's interesting to see what
are your transformation chain, how many partitions are there on each side
of shuffle operation
the question is why it can't fit stuff in memory when you are shuffling -
maybe your partitioner on "reduce" side is not
As for the second part of your questions- we have a fairly complex join
process which requires a ton of stage orchestration from our driver. I've
written some code to be able to walk down our DAG tree and execute siblings
in the tree concurrently where possible (forcing cache to disk on children
Igor,
I don't think the question is "why can't it fit stuff in memory". I know
why it can't fit stuff in memory- because it's a large dataset that needs
to have a reduceByKey() run on it. My understanding is that when it doesn't
fit into memory it needs to spill in order to consolidate
;;<cjno...@gmail.com>;
: 2016??2??7??(??) 8:56
??: "Igor Berman"<igor.ber...@gmail.com>;
: "user"<user@spark.apache.org>;
: Re: Shuffle memory woes
As for the second part of your questions- we have a fairly complex join pro
---
>> *发件人:* "Corey Nolet";<cjno...@gmail.com>;
>> *发送时间:* 2016年2月7日(星期天) 晚上8:56
>> *收件人:* "Igor Berman"<igor.ber...@gmail.com>;
>> *抄送:* "user"<user@spark.apache.org>;
>> *主题:* Re: Shuffle memory woes
>>
>> As
ough,
> spark will spill 3-4x of input data to disk.
>
>
> -- 原始邮件 --
> *发件人:* "Corey Nolet";<cjno...@gmail.com>;
> *发送时间:* 2016年2月7日(星期天) 晚上8:56
> *收件人:* "Igor Berman"<igor.ber...@gmail.com>;
> *抄送:* "user&qu
Igor,
Thank you for the response but unfortunately, the problem I'm referring to
goes beyond this. I have set the shuffle memory fraction to be 90% and set
the cache memory to be 0. Repartitioning the RDD helped a tad on the map
side but didn't do much for the spilling when there was no longer
Hi,
usually you can solve this by 2 steps
make rdd to have more partitions
play with shuffle memory fraction
in spark 1.6 cache vs shuffle memory fractions are adjusted automatically
On 5 February 2016 at 23:07, Corey Nolet wrote:
> I just recently had a discovery that my
I just recently had a discovery that my jobs were taking several hours to
completely because of excess shuffle spills. What I found was that when I
hit the high point where I didn't have enough memory for the shuffles to
store all of their file consolidations at once, it could spill so many
times
11 matches
Mail list logo