Hi Chris,

I'd suggest updating to a newer version of your hadoop distro - you're
hitting some bugs that were fixed last summer. In particular, you're
missing the "amendment" patch from MAPREDUCE-2373 as well as some
patches to MR which make the fetch retry behavior more aggressive.

-Todd

On Mon, Dec 5, 2011 at 12:45 PM, Chris Curtin <curtin.ch...@gmail.com> wrote:
> Hi,
>
> Using: *Version:* 0.20.2-cdh3u0, r81256ad0f2e4ab2bd34b04f53d25a6c23686dd14,
> 8 node cluster, 64 bit Centos
>
> We are occasionally seeing MAX_FETCH_RETRIES_PER_MAP errors on reducer
> jobs. When we investigate it looks like the TaskTracker on the node being
> fetched from is not running. Looking at the logs we see what looks like a
> self-initiated shutdown:
>
> 2011-12-05 14:10:48,632 INFO org.apache.hadoop.mapred.JvmManager: JVM :
> jvm_201112050908_0222_r_1100711673 exited with exit code 0. Number of tasks
> it ran: 0
> 2011-12-05 14:10:48,632 ERROR org.apache.hadoop.mapred.JvmManager: Caught
> Throwable in JVMRunner. Aborting TaskTracker.
> java.lang.NullPointerException
>        at
> org.apache.hadoop.mapred.DefaultTaskController.logShExecStatus(DefaultTaskController.java:145)
>        at
> org.apache.hadoop.mapred.DefaultTaskController.launchTask(DefaultTaskController.java:129)
>        at
> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:472)
>        at
> org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:446)
> 2011-12-05 14:10:48,634 INFO org.apache.hadoop.mapred.TaskTracker:
> SHUTDOWN_MSG:
> /************************************************************
> SHUTDOWN_MSG: Shutting down TaskTracker at had11.atlis1/10.120.41.118
> ************************************************************/
>
> Then the reducers have the following:
>
>
> 2011-12-05 14:12:00,962 WARN org.apache.hadoop.mapred.ReduceTask:
> java.net.ConnectException: Connection refused
>  at java.net.PlainSocketImpl.socketConnect(Native Method)
>  at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
>  at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)
>  at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
>  at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
>  at java.net.Socket.connect(Socket.java:529)
>  at sun.net.NetworkClient.doConnect(NetworkClient.java:158)
>  at sun.net.www.http.HttpClient.openServer(HttpClient.java:394)
>  at sun.net.www.http.HttpClient.openServer(HttpClient.java:529)
>  at sun.net.www.http.HttpClient.<init>(HttpClient.java:233)
>  at sun.net.www.http.HttpClient.New(HttpClient.java:306)
>  at sun.net.www.http.HttpClient.New(HttpClient.java:323)
>  at
> sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:970)
>  at
> sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:911)
>  at
> sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:836)
>  at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1525)
>  at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.setupSecureConnection(ReduceTask.java:1482)
>  at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1390)
>  at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1301)
>  at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1233)
>
> 2011-12-05 14:12:00,962 INFO org.apache.hadoop.mapred.ReduceTask: Task
> attempt_201112050908_0169_r_000005_0: Failed fetch #2 from
> attempt_201112050908_0169_m_000002_0
> 2011-12-05 14:12:00,962 INFO org.apache.hadoop.mapred.ReduceTask: Failed to
> fetch map-output from attempt_201112050908_0169_m_000002_0 even after
> MAX_FETCH_RETRIES_PER_MAP retries...  or it is a read error,  reporting to
> the JobTracker
> 2011-12-05 14:12:00,962 FATAL org.apache.hadoop.mapred.ReduceTask: Shuffle
> failed with too many fetch failures and insufficient progress!Killing task
> attempt_201112050908_0169_r_000005_0.
> 2011-12-05 14:12:00,966 WARN org.apache.hadoop.mapred.ReduceTask:
> attempt_201112050908_0169_r_000005_0 adding host had11.atlis1 to penalty
> box, next contact in 8 seconds
> 2011-12-05 14:12:00,966 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_201112050908_0169_r_000005_0: Got 1 map-outputs from previous
> failures
> The job then fails.
>
> Several questions:
> 1. what is causing the TaskTracker to fail/exit? This is after running
> hundreds to thousands of jobs, so it's not just at start-up.
> 2. why isn't hadoop detecting that the reducers need something from a dead
> mapper and restarting the mapper job, even it means aborting the reducers?
> 3. why isn't the DataNode being used to fetch the blocks? It is still up
> and running when this happens, so shouldn't it know where the files are in
> HDFS?
>
> Thanks,
>
> Chris



-- 
Todd Lipcon
Software Engineer, Cloudera

Reply via email to