In my hands the problem occurs in all map jobs - an associate with a different cluster - mine has 8 nodes - his 40 reports 80% of map tasks fail with a few succeeding - I suspect some kind of an I/O waiot but fail to see how it gets to 600sec
On Wed, Jan 18, 2012 at 4:50 PM, Raj V <rajv...@yahoo.com> wrote: > Steve > > Does the timeout happen for all the map jobs? Are you using some kind of > shared storage for map outputs? Any problems with the physical disks? If > the shuffle phase has started could the disks be I/O waiting between the > read and write? > > Raj > > > > >________________________________ > > From: Steve Lewis <lordjoe2...@gmail.com> > >To: common-user@hadoop.apache.org > >Sent: Wednesday, January 18, 2012 4:21 PM > >Subject: Re: I am trying to run a large job and it is consistently > failing with timeout - nothing happens for 600 sec > > > >1) I do a lot of progress reporting > >2) Why would the job succeed when the only change in the code is > > if(NumberWrites++ % 100 == 0) > > context.write(key,value); > >comment out the test allowing full writes and the job fails > >Since every write is a report I assume that something in the write code or > >other hadoop code for dealing with output if failing. I do increment a > >counter for every write or in the case of the above code potential write > >What I am seeing is that where ever the timeout occurs it is not in a > place > >where I am capable of inserting more reporting > > > > > > > >On Wed, Jan 18, 2012 at 4:01 PM, Leonardo Urbina <lurb...@mit.edu> wrote: > > > >> Perhaps you are not reporting progress throughout your task. If you > >> happen to run a job large enough job you hit the the default timeout > >> mapred.task.timeout (that defaults to 10 min). Perhaps you should > >> consider reporting progress in your mapper/reducer by calling > >> progress() on the Reporter object. Check tip 7 of this link: > >> > >> http://www.cloudera.com/blog/2009/05/10-mapreduce-tips/ > >> > >> Hope that helps, > >> -Leo > >> > >> Sent from my phone > >> > >> On Jan 18, 2012, at 6:46 PM, Steve Lewis <lordjoe2...@gmail.com> wrote: > >> > >> > I KNOW is is a task timeout - what I do NOT know is WHY merely cutting > >> the > >> > number of writes causes it to go away. It seems to imply that some > >> > context.write operation or something downstream from that is taking a > >> huge > >> > amount of time and that is all hadoop internal code - not mine so my > >> > question is why should increasing the number and volume of wriotes > cause > >> a > >> > task to time out > >> > > >> > On Wed, Jan 18, 2012 at 2:33 PM, Tom Melendez <t...@supertom.com> > wrote: > >> > > >> >> Sounds like mapred.task.timeout? The default is 10 minutes. > >> >> > >> >> http://hadoop.apache.org/common/docs/current/mapred-default.html > >> >> > >> >> Thanks, > >> >> > >> >> Tom > >> >> > >> >> On Wed, Jan 18, 2012 at 2:05 PM, Steve Lewis <lordjoe2...@gmail.com> > >> >> wrote: > >> >>> The map tasks fail timing out after 600 sec. > >> >>> I am processing one 9 GB file with 16,000,000 records. Each record > >> (think > >> >>> is it as a line) generates hundreds of key value pairs. > >> >>> The job is unusual in that the output of the mapper in terms of > records > >> >> or > >> >>> bytes orders of magnitude larger than the input. > >> >>> I have no idea what is slowing down the job except that the problem > is > >> in > >> >>> the writes. > >> >>> > >> >>> If I change the job to merely bypass a fraction of the context.write > >> >>> statements the job succeeds. > >> >>> This is one map task that failed and one that succeeded - I cannot > >> >>> understand how a write can take so long > >> >>> or what else the mapper might be doing > >> >>> > >> >>> JOB FAILED WITH TIMEOUT > >> >>> > >> >>> *Parser*TotalProteins90,103NumberFragments10,933,089 > >> >>> > >> >> > >> > *FileSystemCounters*HDFS_BYTES_READ67,245,605FILE_BYTES_WRITTEN444,054,807 > >> >>> *Map-Reduce Framework*Combine output records10,033,499Map input > records > >> >>> 90,103Spilled Records10,032,836Map output bytes3,520,182,794Combine > >> input > >> >>> records10,844,881Map output records10,933,089 > >> >>> Same code but fewer writes > >> >>> JOB SUCCEEDED > >> >>> > >> >>> *Parser*TotalProteins90,103NumberFragments206,658,758 > >> >>> > *FileSystemCounters*FILE_BYTES_READ111,578,253HDFS_BYTES_READ67,245,607 > >> >>> FILE_BYTES_WRITTEN220,169,922 > >> >>> *Map-Reduce Framework*Combine output records4,046,128Map input > >> >>> records90,103Spilled > >> >>> Records4,046,128Map output bytes662,354,413Combine input > >> >> records4,098,609Map > >> >>> output records2,066,588 > >> >>> Any bright ideas > >> >>> -- > >> >>> Steven M. Lewis PhD > >> >>> 4221 105th Ave NE > >> >>> Kirkland, WA 98033 > >> >>> 206-384-1340 (cell) > >> >>> Skype lordjoe_com > >> >> > >> > > >> > > >> > > >> > -- > >> > Steven M. Lewis PhD > >> > 4221 105th Ave NE > >> > Kirkland, WA 98033 > >> > 206-384-1340 (cell) > >> > Skype lordjoe_com > >> > > > > > > > >-- > >Steven M. Lewis PhD > >4221 105th Ave NE > >Kirkland, WA 98033 > >206-384-1340 (cell) > >Skype lordjoe_com > > > > > > > -- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com