Re: Spilled Records

2012-02-28 Thread Jie Li
> From: Jie Li [mailto:ji...@cs.duke.edu] > Sent: 28 February 2012 16:35 > To: common-user@hadoop.apache.org > Subject: Re: Spilled Records > > Hello Dan, > > The fact that the spilled records are double as the output records means > the map task produces more than one sp

RE: Spilled Records

2012-02-28 Thread Daniel Baptista
t would be much good. Thanks again, Dan. -Original Message- From: Jie Li [mailto:ji...@cs.duke.edu] Sent: 28 February 2012 16:35 To: common-user@hadoop.apache.org Subject: Re: Spilled Records Hello Dan, The fact that the spilled records are double as the output records means the map tas

Re: Spilled Records

2012-02-28 Thread Jie Li
Hello Dan, The fact that the spilled records are double as the output records means the map task produces more than one spill file, and these spill files are read, merged and written to a single file, thus each record is spilled twice. I can't infer anything from the numbers of the two

Spilled Records

2012-02-28 Thread Daniel Baptista
67,108,864 FILE_BYTES_WRITTEN 429,278,388 Map-Reduce Framework Combine output records 0 Map input records 2,221,478 Spilled Records 4,442,956 Map output bytes 210,196,148 Combine input records 0 Map output records 2,221,478 And another task in the same job (16 of 16) that took 7 minutes and 19 seconds

Re: Spilled Records

2011-02-22 Thread maha
________ > From: maha [m...@umail.ucsb.edu] > Sent: Tuesday, February 22, 2011 12:19 PM > To: common-user@hadoop.apache.org > Subject: Re: Spilled Records > > Thank you Saurabh, but the following setting didn't change # of spilled > records:

RE: Spilled Records

2011-02-22 Thread Saurabh Dutta
From: maha [m...@umail.ucsb.edu] Sent: Tuesday, February 22, 2011 12:19 PM To: common-user@hadoop.apache.org Subject: Re: Spilled Records Thank you Saurabh, but the following setting didn't change # of spilled records: conf.set("mapred.job.shuffle.merge.percent"

Re: Spilled Records

2011-02-21 Thread maha
Thank you Saurabh, but the following setting didn't change # of spilled records: conf.set("mapred.job.shuffle.merge.percent", ".9");//instead of .66 conf.set("mapred.inmem.merge.threshold", "1000");// instead of 1000 IS it's because of my me

RE: Spilled Records

2011-02-21 Thread Saurabh Dutta
Hi Maha, The spilled record has to do with the transient data during the map and reduce operations. Note that it's not just the map operations that generate the spilled records. When the in-memory buffer (controlled by mapred.job.shuffle.merge.percent) runs out or reaches the threshold n

Spilled Records

2011-02-21 Thread maha
Hello every one, Does spilled records mean that the sort-buffer size for sorting is not enough to sort all the input records, hence some records are written to local disk ? If so, I tried setting my io.sort.mb from the default 100 to 200 and there was still the same # of spilled records. Why

Re: Why is Spilled Records always equal to Map output records

2009-07-14 Thread Mu Qiao
Thanks. It's clear now. :) On Wed, Jul 15, 2009 at 11:40 AM, Jothi Padmanabhan wrote: > It is true, map writes its output to a memory buffer. But when the map > process is complete, the contents of this buffer are sorted and spilled to > the disk so that the Task Tracker running on that node can

Re: Why is Spilled Records always equal to Map output records

2009-07-14 Thread Jothi Padmanabhan
It is true, map writes its output to a memory buffer. But when the map process is complete, the contents of this buffer are sorted and spilled to the disk so that the Task Tracker running on that node can serve these map outputs to the requesting reducers. On 7/15/09 7:59 AM, "Mu Qiao" wrote: >

Re: Why is Spilled Records always equal to Map output records

2009-07-14 Thread Mu Qiao
Thanks. But when I refer to "Hadoop: The Definitive Guide" chapter 6, I find that the map writes its outputs to a memory buffer(not to local disk) whose size is controlled by io.sort.mb. Only the buffer reaches its threshold, it will spill the outputs to local disk. If that is true, I can't see any

Re: Why is Spilled Records always equal to Map output records

2009-07-14 Thread Owen O'Malley
There is no requirement that all of the reduces are running while the map is running. The dataflow is that the map writes its output to local disk and that the reduces pull the map outputs when they need them. There are threads handling sorting and spill of the records to disk, but that doesn't rem

Re: Why is Spilled Records always equal to Map output records

2009-07-13 Thread Mu Qiao
restart the reduce jobs that use those > spilled records in case of a reduce task failure. > > Dali > On Mon, Jul 13, 2009 at 6:32 PM, Mu Qiao wrote: > > > Thank you. But why need map outputs to be written to disk at least once? > I > > think my io.sort.mb is large eno

Re: Why is Spilled Records always equal to Map output records

2009-07-13 Thread Dali Kilani
If I am not mistaken (I am new to this stuff), that's because you need to have a checkpoint from which you can restart the reduce jobs that use those spilled records in case of a reduce task failure. Dali On Mon, Jul 13, 2009 at 6:32 PM, Mu Qiao wrote: > Thank you. But why need map ou

Re: Why is Spilled Records always equal to Map output records

2009-07-13 Thread Mu Qiao
ote: > > I notice it from the web console after I've tried to run serveral jobs. >> Every one of the jobs has the number of Spilled Records equal to Map >> output >> records, even if there are only 5 map output records >> > > > This is good. The map ou

Re: Why is Spilled Records always equal to Map output records

2009-07-13 Thread Owen O'Malley
On Jul 12, 2009, at 3:55 AM, Mu Qiao wrote: I notice it from the web console after I've tried to run serveral jobs. Every one of the jobs has the number of Spilled Records equal to Map output records, even if there are only 5 map output records This is good. The map outputs need

Why is Spilled Records always equal to Map output records

2009-07-12 Thread Mu Qiao
Hi, everyone I'm a beginner of hadoop. I notice it from the web console after I've tried to run serveral jobs. Every one of the jobs has the number of Spilled Records equal to Map output records, even if there are only 5 map output records In the reduce phase, there are also spill