Re: I am trying to run a large job and it is consistently failing with timeout - nothing happens for 600 sec

Steve Lewis Wed, 18 Jan 2012 18:10:58 -0800

In my hands the problem occurs in all map jobs - an associate with a
different cluster - mine has 8 nodes - his 40 reports 80% of map tasks fail
with a few succeeding -
I suspect some kind of an I/O waiot but fail to see how it gets to 600sec


On Wed, Jan 18, 2012 at 4:50 PM, Raj V <rajv...@yahoo.com> wrote:

> Steve
>
> Does the timeout happen for all the map jobs? Are you using some kind of
> shared storage for map outputs? Any problems with the physical disks? If
> the shuffle phase has started could the disks be I/O waiting between the
> read and write?
>
> Raj
>
>
>
> >________________________________
> > From: Steve Lewis <lordjoe2...@gmail.com>
> >To: common-user@hadoop.apache.org
> >Sent: Wednesday, January 18, 2012 4:21 PM
> >Subject: Re: I am trying to run a large job and it is consistently
> failing with timeout - nothing happens for 600 sec
> >
> >1) I do a lot of progress reporting
> >2) Why would the job succeed when the only change in the code is
> >      if(NumberWrites++ % 100 == 0)
> >              context.write(key,value);
> >comment out the test  allowing full writes and the job fails
> >Since every write is a report I assume that something in the write code or
> >other hadoop code for dealing with output if failing. I do increment a
> >counter for every write or in the case of the above code potential write
> >What I am seeing is that where ever the timeout occurs it is not in a
> place
> >where I am capable of inserting more reporting
> >
> >
> >
> >On Wed, Jan 18, 2012 at 4:01 PM, Leonardo Urbina <lurb...@mit.edu> wrote:
> >
> >> Perhaps you are not reporting progress throughout your task. If you
> >> happen to run a job large enough job you hit the the default timeout
> >> mapred.task.timeout  (that defaults to 10 min). Perhaps you should
> >> consider reporting progress in your mapper/reducer by calling
> >> progress() on the Reporter object. Check tip 7 of this link:
> >>
> >> http://www.cloudera.com/blog/2009/05/10-mapreduce-tips/
> >>
> >> Hope that helps,
> >> -Leo
> >>
> >> Sent from my phone
> >>
> >> On Jan 18, 2012, at 6:46 PM, Steve Lewis <lordjoe2...@gmail.com> wrote:
> >>
> >> > I KNOW is is a task timeout - what I do NOT know is WHY merely cutting
> >> the
> >> > number of writes causes it to go away. It seems to imply that some
> >> > context.write operation or something downstream from that is taking a
> >> huge
> >> > amount of time and that is all hadoop internal code - not mine so my
> >> > question is why should increasing the number and volume of wriotes
> cause
> >> a
> >> > task to time out
> >> >
> >> > On Wed, Jan 18, 2012 at 2:33 PM, Tom Melendez <t...@supertom.com>
> wrote:
> >> >
> >> >> Sounds like mapred.task.timeout?  The default is 10 minutes.
> >> >>
> >> >> http://hadoop.apache.org/common/docs/current/mapred-default.html
> >> >>
> >> >> Thanks,
> >> >>
> >> >> Tom
> >> >>
> >> >> On Wed, Jan 18, 2012 at 2:05 PM, Steve Lewis <lordjoe2...@gmail.com>
> >> >> wrote:
> >> >>> The map tasks fail timing out after 600 sec.
> >> >>> I am processing one 9 GB file with 16,000,000 records. Each record
> >> (think
> >> >>> is it as a line)  generates hundreds of key value pairs.
> >> >>> The job is unusual in that the output of the mapper in terms of
> records
> >> >> or
> >> >>> bytes orders of magnitude larger than the input.
> >> >>> I have no idea what is slowing down the job except that the problem
> is
> >> in
> >> >>> the writes.
> >> >>>
> >> >>> If I change the job to merely bypass a fraction of the context.write
> >> >>> statements the job succeeds.
> >> >>> This is one map task that failed and one that succeeded - I cannot
> >> >>> understand how a write can take so long
> >> >>> or what else the mapper might be doing
> >> >>>
> >> >>> JOB FAILED WITH TIMEOUT
> >> >>>
> >> >>> *Parser*TotalProteins90,103NumberFragments10,933,089
> >> >>>
> >> >>
> >>
> *FileSystemCounters*HDFS_BYTES_READ67,245,605FILE_BYTES_WRITTEN444,054,807
> >> >>> *Map-Reduce Framework*Combine output records10,033,499Map input
> records
> >> >>> 90,103Spilled Records10,032,836Map output bytes3,520,182,794Combine
> >> input
> >> >>> records10,844,881Map output records10,933,089
> >> >>> Same code but fewer writes
> >> >>> JOB SUCCEEDED
> >> >>>
> >> >>> *Parser*TotalProteins90,103NumberFragments206,658,758
> >> >>>
> *FileSystemCounters*FILE_BYTES_READ111,578,253HDFS_BYTES_READ67,245,607
> >> >>> FILE_BYTES_WRITTEN220,169,922
> >> >>> *Map-Reduce Framework*Combine output records4,046,128Map input
> >> >>> records90,103Spilled
> >> >>> Records4,046,128Map output bytes662,354,413Combine input
> >> >> records4,098,609Map
> >> >>> output records2,066,588
> >> >>> Any bright ideas
> >> >>> --
> >> >>> Steven M. Lewis PhD
> >> >>> 4221 105th Ave NE
> >> >>> Kirkland, WA 98033
> >> >>> 206-384-1340 (cell)
> >> >>> Skype lordjoe_com
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > Steven M. Lewis PhD
> >> > 4221 105th Ave NE
> >> > Kirkland, WA 98033
> >> > 206-384-1340 (cell)
> >> > Skype lordjoe_com
> >>
> >
> >
> >
> >--
> >Steven M. Lewis PhD
> >4221 105th Ave NE
> >Kirkland, WA 98033
> >206-384-1340 (cell)
> >Skype lordjoe_com
> >
> >
> >
>



-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Re: I am trying to run a large job and it is consistently failing with timeout - nothing happens for 600 sec

Reply via email to