Re: Hadoop: Reduce exceeding 100% - a bug?

2009-07-11 Thread Prashant Ullegaddi
Hi all, Thanks for the replies. I'm using hadoop-0.18.3. We are actually indexing clueweb'09 dataset in a much-similar-to-Nutch way. Reduce creates a document (similar to lucene document which implements Writable) by adding the fields generated by Map. This document is put into output collector

RE: Hadoop: Reduce exceeding 100% - a bug?

2009-07-09 Thread Amogh Vasekar
exceeding 100% - a bug? Reduce tasks which require more than twenty minutes are not a problem. But you must emit some data periodically to inform the rest of the system that each reducer is still alive. Emitting a (k, v) output pair to the collector will reset the timer. Similarly, calling

Re: Hadoop: Reduce exceeding 100% - a bug?

2009-07-09 Thread Peter Skomoroch
I've seen this behavior before with reduces going over 100% on big jobs. What version of Hadoop are you using? I think there are some old bugs filed for this if you search the Jira. On Thu, Jul 9, 2009 at 5:31 PM, Aaron Kimball wrote: > Reduce tasks which require more than twenty minutes are no

Re: Hadoop: Reduce exceeding 100% - a bug?

2009-07-09 Thread Peter Skomoroch
Found it: Hadoop-5210 "Reduce Task Progress shows > 100% when the total size of map outputs (for a single reducer) is high " https://issues.apache.org/jira/browse/HADOOP-5210 On Thu, Jul 9, 2009 at 5:42 PM, Peter Skomoroch wrote: > I've seen this behavior before with reduces going over 100% on

Re: Hadoop: Reduce exceeding 100% - a bug?

2009-07-09 Thread Aaron Kimball
Reduce tasks which require more than twenty minutes are not a problem. But you must emit some data periodically to inform the rest of the system that each reducer is still alive. Emitting a (k, v) output pair to the collector will reset the timer. Similarly, calling Reporter.incrCounter() will also

Hadoop: Reduce exceeding 100% - a bug?

2009-07-09 Thread Prashant Ullegaddi
Hi Jothi, We are trying to index around 245GB compressed data (~1TB uncompressed) on a 9 node Hadoop cluster with 8 slaves and 1 master. In Map, we are just parsing the files, passing the same to reduce. In Reduce, we are indexing the parsed data in much like Nutch style. When we ran the job, map