Re: Spilled Records

2012-02-28 Thread Jie Li
Hello Dan,

The fact that the spilled records are double as the output records means
the map task produces more than one spill file, and these spill files are
read, merged and written to a single file, thus each record is spilled
twice.

I can't infer anything from the numbers of the two tasks. Could you provide
more info, such as what the application is doing?

If you like, you can also try our tool Starfish to see what's going on
behind.

Thanks,
Jie
--
Starfish is an intelligent performance tuning tool for Hadoop.
Homepage: www.cs.duke.edu/starfish/
Mailing list: http://groups.google.com/group/hadoop-starfish


On Tue, Feb 28, 2012 at 8:25 AM, Daniel Baptista 
daniel.bapti...@performgroup.com wrote:

 Hi All,

 I am trying to improve the performance of my hadoop cluster and would like
 to get some feedback on a couple of numbers that I am seeing.

 Below is the output from a single task (1 of 16) that took 3 mins 40
 Seconds

 FileSystemCounters
 FILE_BYTES_READ 214,653,748
 HDFS_BYTES_READ 67,108,864
 FILE_BYTES_WRITTEN 429,278,388

 Map-Reduce Framework
 Combine output records 0
 Map input records 2,221,478
 Spilled Records 4,442,956
 Map output bytes 210,196,148
 Combine input records 0
 Map output records 2,221,478

 And another task in the same job (16 of 16) that took 7 minutes and 19
 seconds

 FileSystemCounters
 FILE_BYTES_READ 199,003,192
 HDFS_BYTES_READ 58,434,476
 FILE_BYTES_WRITTEN 397,975,310

 Map-Reduce Framework
 Combine output records 0
 Map input records 2,086,789
 Spilled Records 4,173,578 Map output bytes
 194,813,958
 Combine input records 0 Map output records 2,086,789

 Can anybody determine anything from these figures?

 The first task is twice as quick as the second yet the input and output
 are comparable (certainly not double). In all of the tasks (in this and
 other jobs) the spilled records are always double the output records, this
 can't be 'normal'?

 Am I clutching at straws (it feels like I am).

 Thanks in advance, Dan.




RE: Spilled Records

2012-02-28 Thread Daniel Baptista
Hi Jie,

To be honest I don't think I understand enough of what our job is doing to be 
able to explain it. 

Thanks for the response though, I had figured that I was grasping at straws.

I have looped at Starfish however all our jobs are submitted via Apache Pig so 
I don't know if it would be much good.

Thanks again, Dan. 

-Original Message-
From: Jie Li [mailto:ji...@cs.duke.edu] 
Sent: 28 February 2012 16:35
To: common-user@hadoop.apache.org
Subject: Re: Spilled Records

Hello Dan,

The fact that the spilled records are double as the output records means
the map task produces more than one spill file, and these spill files are
read, merged and written to a single file, thus each record is spilled
twice.

I can't infer anything from the numbers of the two tasks. Could you provide
more info, such as what the application is doing?

If you like, you can also try our tool Starfish to see what's going on
behind.

Thanks,
Jie
--
Starfish is an intelligent performance tuning tool for Hadoop.
Homepage: www.cs.duke.edu/starfish/
Mailing list: http://groups.google.com/group/hadoop-starfish


On Tue, Feb 28, 2012 at 8:25 AM, Daniel Baptista 
daniel.bapti...@performgroup.com wrote:

 Hi All,

 I am trying to improve the performance of my hadoop cluster and would like
 to get some feedback on a couple of numbers that I am seeing.

 Below is the output from a single task (1 of 16) that took 3 mins 40
 Seconds

 FileSystemCounters
 FILE_BYTES_READ 214,653,748
 HDFS_BYTES_READ 67,108,864
 FILE_BYTES_WRITTEN 429,278,388

 Map-Reduce Framework
 Combine output records 0
 Map input records 2,221,478
 Spilled Records 4,442,956
 Map output bytes 210,196,148
 Combine input records 0
 Map output records 2,221,478

 And another task in the same job (16 of 16) that took 7 minutes and 19
 seconds

 FileSystemCounters
 FILE_BYTES_READ 199,003,192
 HDFS_BYTES_READ 58,434,476
 FILE_BYTES_WRITTEN 397,975,310

 Map-Reduce Framework
 Combine output records 0
 Map input records 2,086,789
 Spilled Records 4,173,578 Map output bytes
 194,813,958
 Combine input records 0 Map output records 2,086,789

 Can anybody determine anything from these figures?

 The first task is twice as quick as the second yet the input and output
 are comparable (certainly not double). In all of the tasks (in this and
 other jobs) the spilled records are always double the output records, this
 can't be 'normal'?

 Am I clutching at straws (it feels like I am).

 Thanks in advance, Dan.




Re: Spilled Records

2012-02-28 Thread Jie Li
Hi Dan,

You might want to post your Pig script to the Pig user mailing list.
Previously I did some experiments on Pig and Hive and I'll also be
interested in looking into your script.

Yeah Starfish now only supports Hadoop job-level tuning, and supporting
workflow like Pig and Hive is our top priority. We'll let you know once
we're ready.

Thanks,
Jie

On Tue, Feb 28, 2012 at 11:57 AM, Daniel Baptista 
daniel.bapti...@performgroup.com wrote:

 Hi Jie,

 To be honest I don't think I understand enough of what our job is doing to
 be able to explain it.

 Thanks for the response though, I had figured that I was grasping at
 straws.

 I have looped at Starfish however all our jobs are submitted via Apache
 Pig so I don't know if it would be much good.

 Thanks again, Dan.

 -Original Message-
 From: Jie Li [mailto:ji...@cs.duke.edu]
 Sent: 28 February 2012 16:35
 To: common-user@hadoop.apache.org
 Subject: Re: Spilled Records

 Hello Dan,

 The fact that the spilled records are double as the output records means
 the map task produces more than one spill file, and these spill files are
 read, merged and written to a single file, thus each record is spilled
 twice.

 I can't infer anything from the numbers of the two tasks. Could you provide
 more info, such as what the application is doing?

 If you like, you can also try our tool Starfish to see what's going on
 behind.

 Thanks,
 Jie
 --
 Starfish is an intelligent performance tuning tool for Hadoop.
 Homepage: www.cs.duke.edu/starfish/
 Mailing list: http://groups.google.com/group/hadoop-starfish


 On Tue, Feb 28, 2012 at 8:25 AM, Daniel Baptista 
 daniel.bapti...@performgroup.com wrote:

  Hi All,
 
  I am trying to improve the performance of my hadoop cluster and would
 like
  to get some feedback on a couple of numbers that I am seeing.
 
  Below is the output from a single task (1 of 16) that took 3 mins 40
  Seconds
 
  FileSystemCounters
  FILE_BYTES_READ 214,653,748
  HDFS_BYTES_READ 67,108,864
  FILE_BYTES_WRITTEN 429,278,388
 
  Map-Reduce Framework
  Combine output records 0
  Map input records 2,221,478
  Spilled Records 4,442,956
  Map output bytes 210,196,148
  Combine input records 0
  Map output records 2,221,478
 
  And another task in the same job (16 of 16) that took 7 minutes and 19
  seconds
 
  FileSystemCounters
  FILE_BYTES_READ 199,003,192
  HDFS_BYTES_READ 58,434,476
  FILE_BYTES_WRITTEN 397,975,310
 
  Map-Reduce Framework
  Combine output records 0
  Map input records 2,086,789
  Spilled Records 4,173,578 Map output bytes
  194,813,958
  Combine input records 0 Map output records 2,086,789
 
  Can anybody determine anything from these figures?
 
  The first task is twice as quick as the second yet the input and output
  are comparable (certainly not double). In all of the tasks (in this and
  other jobs) the spilled records are always double the output records,
 this
  can't be 'normal'?
 
  Am I clutching at straws (it feels like I am).
 
  Thanks in advance, Dan.
 
 




RE: Spilled Records

2011-02-22 Thread Saurabh Dutta
Even if you have 4 GB RAM you should be able to optimize spills. I don't think 
it should be an issue. What you need to do is write the program efficiently and 
configure the parameters right. There is no perfect values for these and the 
values depend on the kind of tasks you're performing.

What you need to do is:

1. Write your map and reduce functions to use as little memory as possible. 
They should not be using an unlimited amount of memory. For example you cand do 
this by avoiding to accumulate values in a map.

2. Write a combiner function and specify the minimum number of spill files 
needed for the combiner to run
min.num.spills.for.cobine (default 3)

3. Tune the variables in the right way. We use buffering to minimize disk writes

– io.sort.mb Size of map-side buffer to store and merge map output 
before spilling to disk. (Map-side buffer)

– fs.inmemorysize.mb Size of reduce-side buffer for storing  merging 
multi-map output before spilling to disk. (Reduce side-buffer)

Thumb Rules for Tuning

– Set these to ~70% of Java heap size. Pick heap sizes to utilize ~80% 
RAM across all processes (maps, reducers, TT, DN, other)

– Set it small enough to avoid swap activity, but

– Set it large enough to minimize disk spills.

– Ensure that io.sort.factor is set large enough to allow full use of 
buffer space.

– Balance space for output records (default 95%)  record meta-data (5%)

• Use io.sort.spill.percent and io.sort.record.percent

There are some really good tips to tune your cluster in this presentation: 
http://www.slideshare.net/ydn/hadoop-summit-2010-tuning-hadoop-to-deliver-performance-to-your-application

From: maha [m...@umail.ucsb.edu]
Sent: Tuesday, February 22, 2011 12:19 PM
To: common-user@hadoop.apache.org
Subject: Re: Spilled Records

Thank you Saurabh, but the following setting didn't change # of spilled records:

conf.set(mapred.job.shuffle.merge.percent, .9);//instead of .66
conf.set(mapred.inmem.merge.threshold, 1000);// instead of 1000

IS it's because of my memory being 4GB ??

I'm using the pseudo distributed mode.

Thank you,
Maha

On Feb 21, 2011, at 7:46 PM, Saurabh Dutta wrote:

 Hi Maha,

 The spilled record has to do with the transient data during the map and 
 reduce operations. Note that it's not just the map operations that generate 
 the spilled records. When the in-memory buffer (controlled by 
 mapred.job.shuffle.merge.percent) runs out or reaches the threshold number of 
 map outputs (mapred.inmem.merge.threshold), it is merged and spilled to disk.

 You are going in the right direction by tuning the io.sort.mb parameter and 
 try increasing it further. If it still doesn't work out, try the 
 io.sort.factor, fs.inmemory.size.mb. Also, try the other two variables that i 
 mentioned earlier.

 Let us know what worked for you.

 Sincerely,
 Saurabh Dutta
 Impetus Infotech India Pvt. Ltd.,
 Sarda House, 24-B, Palasia, A.B.Road, Indore - 452 001
 Phone: +91-731-4269200 4623
 Fax: + 91-731-4071256
 Email: saurabh.du...@impetus.co.in
 www.impetus.com
 
 From: maha [m...@umail.ucsb.edu]
 Sent: Tuesday, February 22, 2011 8:21 AM
 To: common-user
 Subject: Spilled Records

 Hello every one,

 Does spilled records mean that the sort-buffer size for sorting is not enough 
 to sort all the input records, hence some records are written to local disk ?

 If so, I tried setting my io.sort.mb from the default 100 to 200 and there 
 was still the same # of spilled records. Why ?

 Does changing io.sort.record.percent to be .9 instead .8 might produce 
 unexpected exceptions ?


 Thank you,
 Maha

 

 Impetus to Present Big Data -- Analytics Solutions and Strategies at TDWI 
 World Conference (Feb 13-18) in Las Vegas.We are also bringing cloud experts 
 together at CloudCamp, Delhi on Feb 12. CloudCamp is an unconference where 
 early adopters of Cloud Computing technologies exchange ideas.

 Click http://www.impetus.com to know more.


 NOTE: This message may contain information that is confidential, proprietary, 
 privileged or otherwise protected by law. The message is intended solely for 
 the named addressee. If received in error, please destroy and notify the 
 sender. Any use of this email is prohibited when received in error. Impetus 
 does not represent, warrant and/or guarantee, that the integrity of this 
 communication has been maintained nor that the communication is free of 
 errors, virus, interception or interference.




Impetus to Present Big Data -- Analytics Solutions and Strategies at TDWI World 
Conference (Feb 13-18) in Las Vegas.We are also bringing cloud experts together 
at CloudCamp, Delhi on Feb 12. CloudCamp is an unconference where early 
adopters of Cloud Computing technologies exchange ideas.

Click http://www.impetus.com

Re: Spilled Records

2011-02-22 Thread maha
Thanks a bunch Saurabh!  I'd better start optimizing my code then :)

Maha

On Feb 22, 2011, at 3:26 PM, Saurabh Dutta wrote:

 Even if you have 4 GB RAM you should be able to optimize spills. I don't 
 think it should be an issue. What you need to do is write the program 
 efficiently and configure the parameters right. There is no perfect values 
 for these and the values depend on the kind of tasks you're performing.
 
 What you need to do is:
 
 1. Write your map and reduce functions to use as little memory as possible. 
 They should not be using an unlimited amount of memory. For example you cand 
 do this by avoiding to accumulate values in a map.
 
 2. Write a combiner function and specify the minimum number of spill files 
 needed for the combiner to run
min.num.spills.for.cobine (default 3)
 
 3. Tune the variables in the right way. We use buffering to minimize disk 
 writes
 
– io.sort.mb Size of map-side buffer to store and merge map output 
 before spilling to disk. (Map-side buffer)
 
– fs.inmemorysize.mb Size of reduce-side buffer for storing  merging 
 multi-map output before spilling to disk. (Reduce side-buffer)
 
 Thumb Rules for Tuning
 
– Set these to ~70% of Java heap size. Pick heap sizes to utilize ~80% 
 RAM across all processes (maps, reducers, TT, DN, other)
 
– Set it small enough to avoid swap activity, but
 
– Set it large enough to minimize disk spills.
 
– Ensure that io.sort.factor is set large enough to allow full use of 
 buffer space.
 
– Balance space for output records (default 95%)  record meta-data 
 (5%)
 
• Use io.sort.spill.percent and io.sort.record.percent
 
 There are some really good tips to tune your cluster in this presentation: 
 http://www.slideshare.net/ydn/hadoop-summit-2010-tuning-hadoop-to-deliver-performance-to-your-application
 
 From: maha [m...@umail.ucsb.edu]
 Sent: Tuesday, February 22, 2011 12:19 PM
 To: common-user@hadoop.apache.org
 Subject: Re: Spilled Records
 
 Thank you Saurabh, but the following setting didn't change # of spilled 
 records:
 
 conf.set(mapred.job.shuffle.merge.percent, .9);//instead of .66
 conf.set(mapred.inmem.merge.threshold, 1000);// instead of 1000
 
 IS it's because of my memory being 4GB ??
 
 I'm using the pseudo distributed mode.
 
 Thank you,
 Maha
 
 On Feb 21, 2011, at 7:46 PM, Saurabh Dutta wrote:
 
 Hi Maha,
 
 The spilled record has to do with the transient data during the map and 
 reduce operations. Note that it's not just the map operations that generate 
 the spilled records. When the in-memory buffer (controlled by 
 mapred.job.shuffle.merge.percent) runs out or reaches the threshold number 
 of map outputs (mapred.inmem.merge.threshold), it is merged and spilled to 
 disk.
 
 You are going in the right direction by tuning the io.sort.mb parameter and 
 try increasing it further. If it still doesn't work out, try the 
 io.sort.factor, fs.inmemory.size.mb. Also, try the other two variables that 
 i mentioned earlier.
 
 Let us know what worked for you.
 
 Sincerely,
 Saurabh Dutta
 Impetus Infotech India Pvt. Ltd.,
 Sarda House, 24-B, Palasia, A.B.Road, Indore - 452 001
 Phone: +91-731-4269200 4623
 Fax: + 91-731-4071256
 Email: saurabh.du...@impetus.co.in
 www.impetus.com
 
 From: maha [m...@umail.ucsb.edu]
 Sent: Tuesday, February 22, 2011 8:21 AM
 To: common-user
 Subject: Spilled Records
 
 Hello every one,
 
 Does spilled records mean that the sort-buffer size for sorting is not 
 enough to sort all the input records, hence some records are written to 
 local disk ?
 
 If so, I tried setting my io.sort.mb from the default 100 to 200 and there 
 was still the same # of spilled records. Why ?
 
 Does changing io.sort.record.percent to be .9 instead .8 might produce 
 unexpected exceptions ?
 
 
 Thank you,
 Maha
 
 
 
 Impetus to Present Big Data -- Analytics Solutions and Strategies at TDWI 
 World Conference (Feb 13-18) in Las Vegas.We are also bringing cloud experts 
 together at CloudCamp, Delhi on Feb 12. CloudCamp is an unconference where 
 early adopters of Cloud Computing technologies exchange ideas.
 
 Click http://www.impetus.com to know more.
 
 
 NOTE: This message may contain information that is confidential, 
 proprietary, privileged or otherwise protected by law. The message is 
 intended solely for the named addressee. If received in error, please 
 destroy and notify the sender. Any use of this email is prohibited when 
 received in error. Impetus does not represent, warrant and/or guarantee, 
 that the integrity of this communication has been maintained nor that the 
 communication is free of errors, virus, interception or interference.
 
 
 
 
 Impetus to Present Big Data -- Analytics Solutions and Strategies at TDWI 
 World Conference (Feb 13-18) in Las

RE: Spilled Records

2011-02-21 Thread Saurabh Dutta
Hi Maha,

The spilled record has to do with the transient data during the map and reduce 
operations. Note that it's not just the map operations that generate the 
spilled records. When the in-memory buffer (controlled by 
mapred.job.shuffle.merge.percent) runs out or reaches the threshold number of 
map outputs (mapred.inmem.merge.threshold), it is merged and spilled to disk.

You are going in the right direction by tuning the io.sort.mb parameter and try 
increasing it further. If it still doesn't work out, try the io.sort.factor, 
fs.inmemory.size.mb. Also, try the other two variables that i mentioned earlier.

Let us know what worked for you.

Sincerely,
Saurabh Dutta
Impetus Infotech India Pvt. Ltd.,
Sarda House, 24-B, Palasia, A.B.Road, Indore - 452 001
Phone: +91-731-4269200 4623
Fax: + 91-731-4071256
Email: saurabh.du...@impetus.co.in
www.impetus.com

From: maha [m...@umail.ucsb.edu]
Sent: Tuesday, February 22, 2011 8:21 AM
To: common-user
Subject: Spilled Records

Hello every one,

 Does spilled records mean that the sort-buffer size for sorting is not enough 
to sort all the input records, hence some records are written to local disk ?

 If so, I tried setting my io.sort.mb from the default 100 to 200 and there was 
still the same # of spilled records. Why ?

 Does changing io.sort.record.percent to be .9 instead .8 might produce 
unexpected exceptions ?


Thank you,
Maha



Impetus to Present Big Data -- Analytics Solutions and Strategies at TDWI World 
Conference (Feb 13-18) in Las Vegas.We are also bringing cloud experts together 
at CloudCamp, Delhi on Feb 12. CloudCamp is an unconference where early 
adopters of Cloud Computing technologies exchange ideas.

Click http://www.impetus.com to know more.


NOTE: This message may contain information that is confidential, proprietary, 
privileged or otherwise protected by law. The message is intended solely for 
the named addressee. If received in error, please destroy and notify the 
sender. Any use of this email is prohibited when received in error. Impetus 
does not represent, warrant and/or guarantee, that the integrity of this 
communication has been maintained nor that the communication is free of errors, 
virus, interception or interference.


Re: Spilled Records

2011-02-21 Thread maha
Thank you Saurabh, but the following setting didn't change # of spilled records:

conf.set(mapred.job.shuffle.merge.percent, .9);//instead of .66
conf.set(mapred.inmem.merge.threshold, 1000);// instead of 1000

IS it's because of my memory being 4GB ??   

I'm using the pseudo distributed mode. 

Thank you,
Maha

On Feb 21, 2011, at 7:46 PM, Saurabh Dutta wrote:

 Hi Maha,
 
 The spilled record has to do with the transient data during the map and 
 reduce operations. Note that it's not just the map operations that generate 
 the spilled records. When the in-memory buffer (controlled by 
 mapred.job.shuffle.merge.percent) runs out or reaches the threshold number of 
 map outputs (mapred.inmem.merge.threshold), it is merged and spilled to disk.
 
 You are going in the right direction by tuning the io.sort.mb parameter and 
 try increasing it further. If it still doesn't work out, try the 
 io.sort.factor, fs.inmemory.size.mb. Also, try the other two variables that i 
 mentioned earlier.
 
 Let us know what worked for you.
 
 Sincerely,
 Saurabh Dutta
 Impetus Infotech India Pvt. Ltd.,
 Sarda House, 24-B, Palasia, A.B.Road, Indore - 452 001
 Phone: +91-731-4269200 4623
 Fax: + 91-731-4071256
 Email: saurabh.du...@impetus.co.in
 www.impetus.com
 
 From: maha [m...@umail.ucsb.edu]
 Sent: Tuesday, February 22, 2011 8:21 AM
 To: common-user
 Subject: Spilled Records
 
 Hello every one,
 
 Does spilled records mean that the sort-buffer size for sorting is not enough 
 to sort all the input records, hence some records are written to local disk ?
 
 If so, I tried setting my io.sort.mb from the default 100 to 200 and there 
 was still the same # of spilled records. Why ?
 
 Does changing io.sort.record.percent to be .9 instead .8 might produce 
 unexpected exceptions ?
 
 
 Thank you,
 Maha
 
 
 
 Impetus to Present Big Data -- Analytics Solutions and Strategies at TDWI 
 World Conference (Feb 13-18) in Las Vegas.We are also bringing cloud experts 
 together at CloudCamp, Delhi on Feb 12. CloudCamp is an unconference where 
 early adopters of Cloud Computing technologies exchange ideas.
 
 Click http://www.impetus.com to know more.
 
 
 NOTE: This message may contain information that is confidential, proprietary, 
 privileged or otherwise protected by law. The message is intended solely for 
 the named addressee. If received in error, please destroy and notify the 
 sender. Any use of this email is prohibited when received in error. Impetus 
 does not represent, warrant and/or guarantee, that the integrity of this 
 communication has been maintained nor that the communication is free of 
 errors, virus, interception or interference.