RE: Copyphase dies on timeout

Fuad Efendi Wed, 09 Dec 2009 11:02:06 -0800

> >> 2009-12-09 15:03:58,275 INFO org.apache.hadoop.mapred.ReduceTask:
> >> attempt_200912081402_0042_r_000000_0 Scheduled 0 outputs (0 slow hosts
> >> and39 dup hosts)


- Reduce was started
- you are probably using kind of web crawler (39 duplicate hosts, 0 slow hosts)


> >> Task attempt_200912081402_0042_r_000000_0 failed to report status for
> >> 902 seconds. Killing!

- Try to increase <name>mapred.task.timeout</name> 10 times; you are currently 
using 900 seconds.


Another use case which I had before (with Reducer): TIKA hangs forever on some 
weird HTML.

And another use case (BIXO): HttpClient configured with 20 attempts, 30 seconds 
timeout, which makes 20*30 of "non reporting delay" so that default 
(mapred.task.timeout)  600 seconds setting is not good.



Fuad Efendi
+1 416-993-2060
http://www.tokenizer.ca
Data Mining, Vertical Search



> -----Original Message-----
> From: Simon Willnauer [mailto:[email protected]]
> Sent: December-09-09 1:46 PM
> To: Fuad Efendi
> Cc: [email protected]
> Subject: Re: Copyphase dies on timeout
> 
> Hmm it looks like the reducer hasn't even started. and I do not have
> an external resource at all.
> 
> simon
> 
> On Wed, Dec 9, 2009 at 7:39 PM, Fuad Efendi <[email protected]> wrote:
> > I had similar issues before; Reduce can do some 'additional' processing
> and have dependency on external services.
> >
> > For instance, in my case it sends documents to SOLR (Lucene based). So
> that I have dependency on Lucene in a reducer. And, Lucene sometimes needs
> 30-50 minutes to perform index merge. If your timeout settings are not
> enough, job will be killed. You may also execute reporter.keepAlive() each
> 60 seconds (in a separate thread spawned by your reducer).
> >
> > See at <name>mapred.task.timeout</name>, and use reporter.progress(),
> keepAlive(), etc.
> >
> >
> > Fuad Efendi
> > +1 416-993-2060
> > http://www.tokenizer.ca
> > Data Mining, Vertical Search
> >
> >
> >> -----Original Message-----
> >> From: Simon Willnauer [mailto:[email protected]]
> >> Sent: December-09-09 1:31 PM
> >> To: [email protected]
> >> Subject: Copyphase dies on timeout
> >>
> >> Hi folks,
> >>
> >> I do have a questions regarding a map-reduce job who gets killed
> >> during the copy phase due to a timeout. I don't really understand the
> >> output completely so I first look for some help to understand what is
> >> going on.
> >> There are about 50 mapper running where each mapper succeeds. the job
> >> is configured with a single reduces (I assume this could be closely
> >> related to that?!)
> >> So this is the output:
> >>
> >> 2009-12-09 15:01:45,103 INFO org.apache.hadoop.mapred.Merger: Merging
> >> 5 sorted segments
> >> 2009-12-09 15:01:45,103 INFO org.apache.hadoop.mapred.Merger: Down to
> >> the last merge-pass, with 5 segments left of total size: 216092233
> >> bytes
> >> 2009-12-09 15:01:50,727 INFO org.apache.hadoop.mapred.ReduceTask:
> >> attempt_200912081402_0042_r_000000_0 Merge of the 5 files in-memory
> >> complete. Local file is
> >> /data/hadoop-
> >>
> data/mapred/local/taskTracker/jobcache/job_200912081402_0042/attempt_20091
> >> 2081402_0042_r_000000_0/output/map_275.out
> >> of size 216092229
> >> 2009-12-09 15:01:50,728 INFO org.apache.hadoop.mapred.ReduceTask:
> >> attempt_200912081402_0042_r_000000_0 Thread waiting: Thread for
> >> merging on-disk files
> >> 2009-12-09 15:01:58,109 INFO org.apache.hadoop.mapred.ReduceTask: Read
> >> 53281055 bytes from map-output for
> >> attempt_200912081402_0042_m_000253_0
> >> 2009-12-09 15:01:58,148 INFO org.apache.hadoop.mapred.ReduceTask: Rec
> >> #1 from attempt_200912081402_0042_m_000253_0 -> (71, 103) from
> >> hadoop32
> >> 2009-12-09 15:01:58,273 INFO org.apache.hadoop.mapred.ReduceTask:
> >> attempt_200912081402_0042_r_000000_0 Need another 44 map output(s)
> >> where 3 is already in progress
> >> 2009-12-09 15:01:58,273 INFO org.apache.hadoop.mapred.ReduceTask:
> >> attempt_200912081402_0042_r_000000_0 Scheduled 0 outputs (0 slow hosts
> >> and41 dup hosts)
> >> 2009-12-09 15:02:40,174 INFO org.apache.hadoop.mapred.ReduceTask: Read
> >> 81297732 bytes from map-output for
> >> attempt_200912081402_0042_m_000112_0
> >> 2009-12-09 15:02:40,175 INFO org.apache.hadoop.mapred.ReduceTask: Rec
> >> #1 from attempt_200912081402_0042_m_000112_0 -> (129, 153) from
> >> hadoop31
> >> 2009-12-09 15:02:40,175 INFO org.apache.hadoop.mapred.ReduceTask:
> >> attempt_200912081402_0042_r_000000_0 Scheduled 1 outputs (0 slow hosts
> >> and16 dup hosts)
> >> 2009-12-09 15:02:40,223 INFO org.apache.hadoop.mapred.ReduceTask:
> >> Shuffling 82364774 bytes (82364778 raw bytes) into RAM from
> >> attempt_200912081402_0042_m_000121_0
> >> 2009-12-09 15:02:45,044 INFO org.apache.hadoop.mapred.ReduceTask: Read
> >> 82722679 bytes from map-output for
> >> attempt_200912081402_0042_m_000196_0
> >> 2009-12-09 15:02:45,044 INFO org.apache.hadoop.mapred.ReduceTask: Rec
> >> #1 from attempt_200912081402_0042_m_000196_0 -> (114, 137) from
> >> hadoop21
> >> 2009-12-09 15:02:45,044 INFO org.apache.hadoop.mapred.ReduceTask:
> >> attempt_200912081402_0042_r_000000_0 Scheduled 1 outputs (0 slow hosts
> >> and32 dup hosts)
> >> 2009-12-09 15:02:45,143 INFO org.apache.hadoop.mapred.ReduceTask:
> >> Shuffling 82639206 bytes (82639210 raw bytes) into RAM from
> >> attempt_200912081402_0042_m_000195_0
> >> 2009-12-09 15:02:58,274 INFO org.apache.hadoop.mapred.ReduceTask:
> >> attempt_200912081402_0042_r_000000_0 Need another 42 map output(s)
> >> where 3 is already in progress
> >> 2009-12-09 15:02:58,274 INFO org.apache.hadoop.mapred.ReduceTask:
> >> attempt_200912081402_0042_r_000000_0 Scheduled 0 outputs (0 slow hosts
> >> and39 dup hosts)
> >> 2009-12-09 15:03:58,275 INFO org.apache.hadoop.mapred.ReduceTask:
> >> attempt_200912081402_0042_r_000000_0 Need another 42 map output(s)
> >> where 3 is already in progress
> >> 2009-12-09 15:03:58,275 INFO org.apache.hadoop.mapred.ReduceTask:
> >> attempt_200912081402_0042_r_000000_0 Scheduled 0 outputs (0 slow hosts
> >> and39 dup hosts)
> >> ......
> >>
> >> Task attempt_200912081402_0042_r_000000_0 failed to report status for
> >> 902 seconds. Killing!
> >>
> >> my first guess would be some IO problem but I don not have any
> >> indicator that those logs. If somebody can clarify or give me a hint
> >> would be great. Even if it is something like: use more reducers
> >> because of ... <-- this is the interesting part at the moment!
> >> btw: this is hadoop-0.20
> >>
> >> simon
> >
> >
> >

RE: Copyphase dies on timeout

Reply via email to