Re: I am trying to run a large job and it is consistently failing with timeout - nothing happens for 600 sec

Raj Vishwanthan Wed, 18 Jan 2012 19:29:14 -0800

You can try the following
- make it into a map only job (for debug  purposes)
- start your shuffle phase after all the maps are complete( there is a 
parameter for this)
-characterize your disks for performance


Raj 


Sent from Samsung Mobile

Steve Lewis <lordjoe2...@gmail.com> wrote:

In my hands the problem occurs in all map jobs - an associate with a different 
cluster - mine has 8 nodes - his 40 reports 80% of map tasks fail with a few 
succeeding - 
I suspect some kind of an I/O waiot but fail to see how it gets to 600sec

On Wed, Jan 18, 2012 at 4:50 PM, Raj V <rajv...@yahoo.com> wrote:
Steve

Does the timeout happen for all the map jobs? Are you using some kind of shared 
storage for map outputs? Any problems with the physical disks? If the shuffle 
phase has started could the disks be I/O waiting between the read and write?

Raj



>________________________________
> From: Steve Lewis <lordjoe2...@gmail.com>
>To: common-user@hadoop.apache.org
>Sent: Wednesday, January 18, 2012 4:21 PM
>Subject: Re: I am trying to run a large job and it is consistently failing 
>with timeout - nothing happens for 600 sec
>
>1) I do a lot of progress reporting
>2) Why would the job succeed when the only change in the code is
>      if(NumberWrites++ % 100 == 0)
>              context.write(key,value);
>comment out the test  allowing full writes and the job fails
>Since every write is a report I assume that something in the write code or
>other hadoop code for dealing with output if failing. I do increment a
>counter for every write or in the case of the above code potential write
>What I am seeing is that where ever the timeout occurs it is not in a place
>where I am capable of inserting more reporting
>
>
>
>On Wed, Jan 18, 2012 at 4:01 PM, Leonardo Urbina <lurb...@mit.edu> wrote:
>
>> Perhaps you are not reporting progress throughout your task. If you
>> happen to run a job large enough job you hit the the default timeout
>> mapred.task.timeout  (that defaults to 10 min). Perhaps you should
>> consider reporting progress in your mapper/reducer by calling
>> progress() on the Reporter object. Check tip 7 of this link:
>>
>> http://www.cloudera.com/blog/2009/05/10-mapreduce-tips/
>>
>> Hope that helps,
>> -Leo
>>
>> Sent from my phone
>>
>> On Jan 18, 2012, at 6:46 PM, Steve Lewis <lordjoe2...@gmail.com> wrote:
>>
>> > I KNOW is is a task timeout - what I do NOT know is WHY merely cutting
>> the
>> > number of writes causes it to go away. It seems to imply that some
>> > context.write operation or something downstream from that is taking a
>> huge
>> > amount of time and that is all hadoop internal code - not mine so my
>> > question is why should increasing the number and volume of wriotes cause
>> a
>> > task to time out
>> >
>> > On Wed, Jan 18, 2012 at 2:33 PM, Tom Melendez <t...@supertom.com> wrote:
>> >
>> >> Sounds like mapred.task.timeout?  The default is 10 minutes.
>> >>
>> >> http://hadoop.apache.org/common/docs/current/mapred-default.html
>> >>
>> >> Thanks,
>> >>
>> >> Tom
>> >>
>> >> On Wed, Jan 18, 2012 at 2:05 PM, Steve Lewis <lordjoe2...@gmail.com>
>> >> wrote:
>> >>> The map tasks fail timing out after 600 sec.
>> >>> I am processing one 9 GB file with 16,000,000 records. Each record
>> (think
>> >>> is it as a line)  generates hundreds of key value pairs.
>> >>> The job is unusual in that the output of the mapper in terms of records
>> >> or
>> >>> bytes orders of magnitude larger than the input.
>> >>> I have no idea what is slowing down the job except that the problem is
>> in
>> >>> the writes.
>> >>>
>> >>> If I change the job to merely bypass a fraction of the context.write
>> >>> statements the job succeeds.
>> >>> This is one map task that failed and one that succeeded - I cannot
>> >>> understand how a write can take so long
>> >>> or what else the mapper might be doing
>> >>>
>> >>> JOB FAILED WITH TIMEOUT
>> >>>
>> >>> *Parser*TotalProteins90,103NumberFragments10,933,089
>> >>>
>> >>
>> *FileSystemCounters*HDFS_BYTES_READ67,245,605FILE_BYTES_WRITTEN444,054,807
>> >>> *Map-Reduce Framework*Combine output records10,033,499Map input records
>> >>> 90,103Spilled Records10,032,836Map output bytes3,520,182,794Combine
>> input
>> >>> records10,844,881Map output records10,933,089
>> >>> Same code but fewer writes
>> >>> JOB SUCCEEDED
>> >>>
>> >>> *Parser*TotalProteins90,103NumberFragments206,658,758
>> >>> *FileSystemCounters*FILE_BYTES_READ111,578,253HDFS_BYTES_READ67,245,607
>> >>> FILE_BYTES_WRITTEN220,169,922
>> >>> *Map-Reduce Framework*Combine output records4,046,128Map input
>> >>> records90,103Spilled
>> >>> Records4,046,128Map output bytes662,354,413Combine input
>> >> records4,098,609Map
>> >>> output records2,066,588
>> >>> Any bright ideas
>> >>> --
>> >>> Steven M. Lewis PhD
>> >>> 4221 105th Ave NE
>> >>> Kirkland, WA 98033
>> >>> 206-384-1340 (cell)
>> >>> Skype lordjoe_com
>> >>
>> >
>> >
>> >
>> > --
>> > Steven M. Lewis PhD
>> > 4221 105th Ave NE
>> > Kirkland, WA 98033
>> > 206-384-1340 (cell)
>> > Skype lordjoe_com
>>
>
>
>
>--
>Steven M. Lewis PhD
>4221 105th Ave NE
>Kirkland, WA 98033
>206-384-1340 (cell)
>Skype lordjoe_com
>
>
>



-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com




TODAY(Beta) • Powered by Yahoo!
TV chefs' feud heats up over diabetes
Anthony Bourdain takes a jab at Paula Deen after she reveals her diagnosis.
Privacy Policy

Re: I am trying to run a large job and it is consistently failing with timeout - nothing happens for 600 sec

Reply via email to