Hi Chris, I'd suggest updating to a newer version of your hadoop distro - you're hitting some bugs that were fixed last summer. In particular, you're missing the "amendment" patch from MAPREDUCE-2373 as well as some patches to MR which make the fetch retry behavior more aggressive.
-Todd On Mon, Dec 5, 2011 at 12:45 PM, Chris Curtin <curtin.ch...@gmail.com> wrote: > Hi, > > Using: *Version:* 0.20.2-cdh3u0, r81256ad0f2e4ab2bd34b04f53d25a6c23686dd14, > 8 node cluster, 64 bit Centos > > We are occasionally seeing MAX_FETCH_RETRIES_PER_MAP errors on reducer > jobs. When we investigate it looks like the TaskTracker on the node being > fetched from is not running. Looking at the logs we see what looks like a > self-initiated shutdown: > > 2011-12-05 14:10:48,632 INFO org.apache.hadoop.mapred.JvmManager: JVM : > jvm_201112050908_0222_r_1100711673 exited with exit code 0. Number of tasks > it ran: 0 > 2011-12-05 14:10:48,632 ERROR org.apache.hadoop.mapred.JvmManager: Caught > Throwable in JVMRunner. Aborting TaskTracker. > java.lang.NullPointerException > at > org.apache.hadoop.mapred.DefaultTaskController.logShExecStatus(DefaultTaskController.java:145) > at > org.apache.hadoop.mapred.DefaultTaskController.launchTask(DefaultTaskController.java:129) > at > org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:472) > at > org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:446) > 2011-12-05 14:10:48,634 INFO org.apache.hadoop.mapred.TaskTracker: > SHUTDOWN_MSG: > /************************************************************ > SHUTDOWN_MSG: Shutting down TaskTracker at had11.atlis1/10.120.41.118 > ************************************************************/ > > Then the reducers have the following: > > > 2011-12-05 14:12:00,962 WARN org.apache.hadoop.mapred.ReduceTask: > java.net.ConnectException: Connection refused > at java.net.PlainSocketImpl.socketConnect(Native Method) > at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351) > at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213) > at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200) > at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366) > at java.net.Socket.connect(Socket.java:529) > at sun.net.NetworkClient.doConnect(NetworkClient.java:158) > at sun.net.www.http.HttpClient.openServer(HttpClient.java:394) > at sun.net.www.http.HttpClient.openServer(HttpClient.java:529) > at sun.net.www.http.HttpClient.<init>(HttpClient.java:233) > at sun.net.www.http.HttpClient.New(HttpClient.java:306) > at sun.net.www.http.HttpClient.New(HttpClient.java:323) > at > sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:970) > at > sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:911) > at > sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:836) > at > org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1525) > at > org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.setupSecureConnection(ReduceTask.java:1482) > at > org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1390) > at > org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1301) > at > org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1233) > > 2011-12-05 14:12:00,962 INFO org.apache.hadoop.mapred.ReduceTask: Task > attempt_201112050908_0169_r_000005_0: Failed fetch #2 from > attempt_201112050908_0169_m_000002_0 > 2011-12-05 14:12:00,962 INFO org.apache.hadoop.mapred.ReduceTask: Failed to > fetch map-output from attempt_201112050908_0169_m_000002_0 even after > MAX_FETCH_RETRIES_PER_MAP retries... or it is a read error, reporting to > the JobTracker > 2011-12-05 14:12:00,962 FATAL org.apache.hadoop.mapred.ReduceTask: Shuffle > failed with too many fetch failures and insufficient progress!Killing task > attempt_201112050908_0169_r_000005_0. > 2011-12-05 14:12:00,966 WARN org.apache.hadoop.mapred.ReduceTask: > attempt_201112050908_0169_r_000005_0 adding host had11.atlis1 to penalty > box, next contact in 8 seconds > 2011-12-05 14:12:00,966 INFO org.apache.hadoop.mapred.ReduceTask: > attempt_201112050908_0169_r_000005_0: Got 1 map-outputs from previous > failures > The job then fails. > > Several questions: > 1. what is causing the TaskTracker to fail/exit? This is after running > hundreds to thousands of jobs, so it's not just at start-up. > 2. why isn't hadoop detecting that the reducers need something from a dead > mapper and restarting the mapper job, even it means aborting the reducers? > 3. why isn't the DataNode being used to fetch the blocks? It is still up > and running when this happens, so shouldn't it know where the files are in > HDFS? > > Thanks, > > Chris -- Todd Lipcon Software Engineer, Cloudera