RE: Spilled Records

Saurabh Dutta Tue, 22 Feb 2011 15:26:36 -0800

Even if you have 4 GB RAM you should be able to optimize spills. I don't think 
it should be an issue. What you need to do is write the program efficiently and 
configure the parameters right. There is no perfect values for these and the 
values depend on the kind of tasks you're performing.


What you need to do is:

1. Write your map and reduce functions to use as little memory as possible. 
They should not be using an unlimited amount of memory. For example you cand do 
this by avoiding to accumulate values in a map.

2. Write a combiner function and specify the minimum number of spill files 
needed for the combiner to run
        min.num.spills.for.cobine (default 3)

3. Tune the variables in the right way. We use buffering to minimize disk writes

        – io.sort.mb Size of map-side buffer to store and merge map output 
before spilling to disk. (Map-side buffer)

        – fs.inmemorysize.mb Size of reduce-side buffer for storing & merging 
multi-map output before spilling to disk. (Reduce side-buffer)

Thumb Rules for Tuning

        – Set these to ~70% of Java heap size. Pick heap sizes to utilize ~80% 
RAM across all processes (maps, reducers, TT, DN, other)

        – Set it small enough to avoid swap activity, but

        – Set it large enough to minimize disk spills.

        – Ensure that io.sort.factor is set large enough to allow full use of 
buffer space.

        – Balance space for output records (default 95%) & record meta-data (5%)

                • Use io.sort.spill.percent and io.sort.record.percent

There are some really good tips to tune your cluster in this presentation: 
http://www.slideshare.net/ydn/hadoop-summit-2010-tuning-hadoop-to-deliver-performance-to-your-application
________________________________________
From: maha [m...@umail.ucsb.edu]
Sent: Tuesday, February 22, 2011 12:19 PM
To: common-user@hadoop.apache.org
Subject: Re: Spilled Records

Thank you Saurabh, but the following setting didn't change # of spilled records:

conf.set("mapred.job.shuffle.merge.percent", ".9");//instead of .66
conf.set("mapred.inmem.merge.threshold", "10000000");// instead of 1000

IS it's because of my memory being 4GB ??

I'm using the pseudo distributed mode.

Thank you,
Maha

On Feb 21, 2011, at 7:46 PM, Saurabh Dutta wrote:

> Hi Maha,
>
> The spilled record has to do with the transient data during the map and 
> reduce operations. Note that it's not just the map operations that generate 
> the spilled records. When the in-memory buffer (controlled by 
> mapred.job.shuffle.merge.percent) runs out or reaches the threshold number of 
> map outputs (mapred.inmem.merge.threshold), it is merged and spilled to disk.
>
> You are going in the right direction by tuning the io.sort.mb parameter and 
> try increasing it further. If it still doesn't work out, try the 
> io.sort.factor, fs.inmemory.size.mb. Also, try the other two variables that i 
> mentioned earlier.
>
> Let us know what worked for you.
>
> Sincerely,
> Saurabh Dutta
> Impetus Infotech India Pvt. Ltd.,
> Sarda House, 24-B, Palasia, A.B.Road, Indore - 452 001
> Phone: +91-731-4269200 4623
> Fax: + 91-731-4071256
> Email: saurabh.du...@impetus.co.in
> www.impetus.com
> ________________________________________
> From: maha [m...@umail.ucsb.edu]
> Sent: Tuesday, February 22, 2011 8:21 AM
> To: common-user
> Subject: Spilled Records
>
> Hello every one,
>
> Does spilled records mean that the sort-buffer size for sorting is not enough 
> to sort all the input records, hence some records are written to local disk ?
>
> If so, I tried setting my io.sort.mb from the default 100 to 200 and there 
> was still the same # of spilled records. Why ?
>
> Does changing io.sort.record.percent to be .9 instead .8 might produce 
> unexpected exceptions ?
>
>
> Thank you,
> Maha
>
> ________________________________
>
> Impetus to Present Big Data -- Analytics Solutions and Strategies at TDWI 
> World Conference (Feb 13-18) in Las Vegas.We are also bringing cloud experts 
> together at CloudCamp, Delhi on Feb 12. CloudCamp is an unconference where 
> early adopters of Cloud Computing technologies exchange ideas.
>
> Click http://www.impetus.com to know more.
>
>
> NOTE: This message may contain information that is confidential, proprietary, 
> privileged or otherwise protected by law. The message is intended solely for 
> the named addressee. If received in error, please destroy and notify the 
> sender. Any use of this email is prohibited when received in error. Impetus 
> does not represent, warrant and/or guarantee, that the integrity of this 
> communication has been maintained nor that the communication is free of 
> errors, virus, interception or interference.


________________________________

Impetus to Present Big Data -- Analytics Solutions and Strategies at TDWI World 
Conference (Feb 13-18) in Las Vegas.We are also bringing cloud experts together 
at CloudCamp, Delhi on Feb 12. CloudCamp is an unconference where early 
adopters of Cloud Computing technologies exchange ideas.

Click http://www.impetus.com to know more.


NOTE: This message may contain information that is confidential, proprietary, 
privileged or otherwise protected by law. The message is intended solely for 
the named addressee. If received in error, please destroy and notify the 
sender. Any use of this email is prohibited when received in error. Impetus 
does not represent, warrant and/or guarantee, that the integrity of this 
communication has been maintained nor that the communication is free of errors, 
virus, interception or interference.

RE: Spilled Records

Reply via email to