Re: I am trying to run a large job and it is consistently failing with timeout - nothing happens for 600 sec

Steve Lewis Wed, 18 Jan 2012 16:22:26 -0800

1) I do a lot of progress reporting
2) Why would the job succeed when the only change in the code is
      if(NumberWrites++ % 100 == 0)
              context.write(key,value);
comment out the test  allowing full writes and the job fails
Since every write is a report I assume that something in the write code or
other hadoop code for dealing with output if failing. I do increment a
counter for every write or in the case of the above code potential write
What I am seeing is that where ever the timeout occurs it is not in a place
where I am capable of inserting more reporting




On Wed, Jan 18, 2012 at 4:01 PM, Leonardo Urbina <lurb...@mit.edu> wrote:

> Perhaps you are not reporting progress throughout your task. If you
> happen to run a job large enough job you hit the the default timeout
> mapred.task.timeout  (that defaults to 10 min). Perhaps you should
> consider reporting progress in your mapper/reducer by calling
> progress() on the Reporter object. Check tip 7 of this link:
>
> http://www.cloudera.com/blog/2009/05/10-mapreduce-tips/
>
> Hope that helps,
> -Leo
>
> Sent from my phone
>
> On Jan 18, 2012, at 6:46 PM, Steve Lewis <lordjoe2...@gmail.com> wrote:
>
> > I KNOW is is a task timeout - what I do NOT know is WHY merely cutting
> the
> > number of writes causes it to go away. It seems to imply that some
> > context.write operation or something downstream from that is taking a
> huge
> > amount of time and that is all hadoop internal code - not mine so my
> > question is why should increasing the number and volume of wriotes cause
> a
> > task to time out
> >
> > On Wed, Jan 18, 2012 at 2:33 PM, Tom Melendez <t...@supertom.com> wrote:
> >
> >> Sounds like mapred.task.timeout?  The default is 10 minutes.
> >>
> >> http://hadoop.apache.org/common/docs/current/mapred-default.html
> >>
> >> Thanks,
> >>
> >> Tom
> >>
> >> On Wed, Jan 18, 2012 at 2:05 PM, Steve Lewis <lordjoe2...@gmail.com>
> >> wrote:
> >>> The map tasks fail timing out after 600 sec.
> >>> I am processing one 9 GB file with 16,000,000 records. Each record
> (think
> >>> is it as a line)  generates hundreds of key value pairs.
> >>> The job is unusual in that the output of the mapper in terms of records
> >> or
> >>> bytes orders of magnitude larger than the input.
> >>> I have no idea what is slowing down the job except that the problem is
> in
> >>> the writes.
> >>>
> >>> If I change the job to merely bypass a fraction of the context.write
> >>> statements the job succeeds.
> >>> This is one map task that failed and one that succeeded - I cannot
> >>> understand how a write can take so long
> >>> or what else the mapper might be doing
> >>>
> >>> JOB FAILED WITH TIMEOUT
> >>>
> >>> *Parser*TotalProteins90,103NumberFragments10,933,089
> >>>
> >>
> *FileSystemCounters*HDFS_BYTES_READ67,245,605FILE_BYTES_WRITTEN444,054,807
> >>> *Map-Reduce Framework*Combine output records10,033,499Map input records
> >>> 90,103Spilled Records10,032,836Map output bytes3,520,182,794Combine
> input
> >>> records10,844,881Map output records10,933,089
> >>> Same code but fewer writes
> >>> JOB SUCCEEDED
> >>>
> >>> *Parser*TotalProteins90,103NumberFragments206,658,758
> >>> *FileSystemCounters*FILE_BYTES_READ111,578,253HDFS_BYTES_READ67,245,607
> >>> FILE_BYTES_WRITTEN220,169,922
> >>> *Map-Reduce Framework*Combine output records4,046,128Map input
> >>> records90,103Spilled
> >>> Records4,046,128Map output bytes662,354,413Combine input
> >> records4,098,609Map
> >>> output records2,066,588
> >>> Any bright ideas
> >>> --
> >>> Steven M. Lewis PhD
> >>> 4221 105th Ave NE
> >>> Kirkland, WA 98033
> >>> 206-384-1340 (cell)
> >>> Skype lordjoe_com
> >>
> >
> >
> >
> > --
> > Steven M. Lewis PhD
> > 4221 105th Ave NE
> > Kirkland, WA 98033
> > 206-384-1340 (cell)
> > Skype lordjoe_com
>



-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Re: I am trying to run a large job and it is consistently failing with timeout - nothing happens for 600 sec

Reply via email to