Reduce tasks which require more than twenty minutes are not a problem. But
you must emit some data periodically to inform the rest of the system that
each reducer is still alive. Emitting a (k, v) output pair to the collector
will reset the timer. Similarly, calling Reporter.incrCounter() will also
reset the clock. So if you're doing a large amount of processing in a loop
before you emit your final key value pairs, you should periodically
increment a counter to allow the rest of the system to confirm that you're
not deadlocked.

I'm not sure why your progress went so high. I know that Hadoop has some
quirks related to compression. If you've got compressed data, then
percentages might be inaccurate since the completed/available_input data
ratio will be partially based on compressed sizes.
- Aaron

On Thu, Jul 9, 2009 at 12:24 PM, Prashant Ullegaddi <
prashant.ullega...@research.iiit.ac.in> wrote:

> Hi Jothi,
>
> We are trying to index around 245GB compressed data (~1TB uncompressed)
> on a 9 node Hadoop cluster with 8 slaves and 1 master. In Map, we are
> just parsing the files, passing the same to reduce. In Reduce, we are
> indexing the parsed data in much like Nutch style.
>
> When we ran the job, map got over in less than 4hrs. But strange thing
> happened with reduces. They went past 100% progress (some 200%!). They
> showed 200+% before getting killed! Is this some kind of bug in Hadoop?
>
> All eventually got killed saying "Task
> attempt_200907091637_0004_r_000000_0 failed to report status for 1201
> seconds. Killing!" But I guess indexing in reduce takes more than 1200+
> seconds. How to go about it?
>
>
> Thanks in advance,
> Prashant,
> Search and Information Extraction Lab,
> IIIT-Hyderabad,
> INDIA.
>
>

Reply via email to