If I am not mistaken (I am new to this stuff), that's because you need to have a checkpoint from which you can restart the reduce jobs that use those spilled records in case of a reduce task failure.
Dali On Mon, Jul 13, 2009 at 6:32 PM, Mu Qiao <qiao...@gmail.com> wrote: > Thank you. But why need map outputs to be written to disk at least once? I > think my io.sort.mb is large enough to do in-memory operations. Could you > provide me some information about it? > > On Tue, Jul 14, 2009 at 1:27 AM, Owen O'Malley <omal...@apache.org> wrote: > > > > > On Jul 12, 2009, at 3:55 AM, Mu Qiao wrote: > > > > I notice it from the web console after I've tried to run serveral jobs. > >> Every one of the jobs has the number of Spilled Records equal to Map > >> output > >> records, even if there are only 5 map output records > >> > > > > > > This is good. The map outputs need to be written to disk at least once. > So > > if they are equal, things are fitting in memory. If multiple passes are > > needed, you'll see 2x or more spilled records. > > > > In the reduce phase, there are also spilled records which is equal to > >> reduce > >> input records. > >> > > > > This is reasonable, although 0.19 and 0.20 don't need to spill the > records > > in the reduce at all, if you make the buffer big enough. > > > > -- Owen > > > > > > -- > Best wishes, > Qiao Mu > -- Dali Kilani =========== Phone : (650) 492-5921 (Google Voice) E-Fax : (775) 552-2982