Re: MAX_FETCH_RETRIES_PER_MAP (TaskTracker dying?)
Thanks guys, I'll get with operations to do the upgrade. Chris On Mon, Dec 5, 2011 at 4:11 PM, Bejoy Ks bejoy.had...@gmail.com wrote: Hi Chris From the stack trace, it looks like a JVM corruption issue. It is a known issue and have been fixed in CDH3u2, i believe an upgrade would solve your issues. https://issues.apache.org/jira/browse/MAPREDUCE-3184 Then regarding your queries,I'd try to help you out a bit.In mapreduce the data transfer between map and reduce happens over http. If jetty is down then that won't happen which means map output in one location wont be accessible to reducer in another location. The map outputs are in LFS and not on HDFS so even if the data node on the machine is up we can't get the data in above circumstances. Hope it helps!.. Regards Bejoy.K.S On Tue, Dec 6, 2011 at 2:15 AM, Chris Curtin curtin.ch...@gmail.com wrote: Hi, Using: *Version:* 0.20.2-cdh3u0, r81256ad0f2e4ab2bd34b04f53d25a6c23686dd14, 8 node cluster, 64 bit Centos We are occasionally seeing MAX_FETCH_RETRIES_PER_MAP errors on reducer jobs. When we investigate it looks like the TaskTracker on the node being fetched from is not running. Looking at the logs we see what looks like a self-initiated shutdown: 2011-12-05 14:10:48,632 INFO org.apache.hadoop.mapred.JvmManager: JVM : jvm_201112050908_0222_r_1100711673 exited with exit code 0. Number of tasks it ran: 0 2011-12-05 14:10:48,632 ERROR org.apache.hadoop.mapred.JvmManager: Caught Throwable in JVMRunner. Aborting TaskTracker. java.lang.NullPointerException at org.apache.hadoop.mapred.DefaultTaskController.logShExecStatus(DefaultTaskController.java:145) at org.apache.hadoop.mapred.DefaultTaskController.launchTask(DefaultTaskController.java:129) at org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:472) at org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:446) 2011-12-05 14:10:48,634 INFO org.apache.hadoop.mapred.TaskTracker: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down TaskTracker at had11.atlis1/10.120.41.118 / Then the reducers have the following: 2011-12-05 14:12:00,962 WARN org.apache.hadoop.mapred.ReduceTask: java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351) at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213) at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366) at java.net.Socket.connect(Socket.java:529) at sun.net.NetworkClient.doConnect(NetworkClient.java:158) at sun.net.www.http.HttpClient.openServer(HttpClient.java:394) at sun.net.www.http.HttpClient.openServer(HttpClient.java:529) at sun.net.www.http.HttpClient.init(HttpClient.java:233) at sun.net.www.http.HttpClient.New(HttpClient.java:306) at sun.net.www.http.HttpClient.New(HttpClient.java:323) at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:970) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:911) at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:836) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1525) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.setupSecureConnection(ReduceTask.java:1482) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1390) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1301) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1233) 2011-12-05 14:12:00,962 INFO org.apache.hadoop.mapred.ReduceTask: Task attempt_201112050908_0169_r_05_0: Failed fetch #2 from attempt_201112050908_0169_m_02_0 2011-12-05 14:12:00,962 INFO org.apache.hadoop.mapred.ReduceTask: Failed to fetch map-output from attempt_201112050908_0169_m_02_0 even after MAX_FETCH_RETRIES_PER_MAP retries... or it is a read error, reporting to the JobTracker 2011-12-05 14:12:00,962 FATAL org.apache.hadoop.mapred.ReduceTask: Shuffle failed with too many fetch failures and insufficient progress!Killing task attempt_201112050908_0169_r_05_0. 2011-12-05 14:12:00,966 WARN org.apache.hadoop.mapred.ReduceTask: attempt_201112050908_0169_r_05_0 adding host had11.atlis1 to penalty box, next contact in 8 seconds 2011-12-05 14:12:00,966 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201112050908_0169_r_05_0: Got 1 map-outputs from previous
MAX_FETCH_RETRIES_PER_MAP (TaskTracker dying?)
Hi, Using: *Version:* 0.20.2-cdh3u0, r81256ad0f2e4ab2bd34b04f53d25a6c23686dd14, 8 node cluster, 64 bit Centos We are occasionally seeing MAX_FETCH_RETRIES_PER_MAP errors on reducer jobs. When we investigate it looks like the TaskTracker on the node being fetched from is not running. Looking at the logs we see what looks like a self-initiated shutdown: 2011-12-05 14:10:48,632 INFO org.apache.hadoop.mapred.JvmManager: JVM : jvm_201112050908_0222_r_1100711673 exited with exit code 0. Number of tasks it ran: 0 2011-12-05 14:10:48,632 ERROR org.apache.hadoop.mapred.JvmManager: Caught Throwable in JVMRunner. Aborting TaskTracker. java.lang.NullPointerException at org.apache.hadoop.mapred.DefaultTaskController.logShExecStatus(DefaultTaskController.java:145) at org.apache.hadoop.mapred.DefaultTaskController.launchTask(DefaultTaskController.java:129) at org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:472) at org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:446) 2011-12-05 14:10:48,634 INFO org.apache.hadoop.mapred.TaskTracker: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down TaskTracker at had11.atlis1/10.120.41.118 / Then the reducers have the following: 2011-12-05 14:12:00,962 WARN org.apache.hadoop.mapred.ReduceTask: java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351) at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213) at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366) at java.net.Socket.connect(Socket.java:529) at sun.net.NetworkClient.doConnect(NetworkClient.java:158) at sun.net.www.http.HttpClient.openServer(HttpClient.java:394) at sun.net.www.http.HttpClient.openServer(HttpClient.java:529) at sun.net.www.http.HttpClient.init(HttpClient.java:233) at sun.net.www.http.HttpClient.New(HttpClient.java:306) at sun.net.www.http.HttpClient.New(HttpClient.java:323) at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:970) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:911) at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:836) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1525) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.setupSecureConnection(ReduceTask.java:1482) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1390) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1301) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1233) 2011-12-05 14:12:00,962 INFO org.apache.hadoop.mapred.ReduceTask: Task attempt_201112050908_0169_r_05_0: Failed fetch #2 from attempt_201112050908_0169_m_02_0 2011-12-05 14:12:00,962 INFO org.apache.hadoop.mapred.ReduceTask: Failed to fetch map-output from attempt_201112050908_0169_m_02_0 even after MAX_FETCH_RETRIES_PER_MAP retries... or it is a read error, reporting to the JobTracker 2011-12-05 14:12:00,962 FATAL org.apache.hadoop.mapred.ReduceTask: Shuffle failed with too many fetch failures and insufficient progress!Killing task attempt_201112050908_0169_r_05_0. 2011-12-05 14:12:00,966 WARN org.apache.hadoop.mapred.ReduceTask: attempt_201112050908_0169_r_05_0 adding host had11.atlis1 to penalty box, next contact in 8 seconds 2011-12-05 14:12:00,966 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201112050908_0169_r_05_0: Got 1 map-outputs from previous failures The job then fails. Several questions: 1. what is causing the TaskTracker to fail/exit? This is after running hundreds to thousands of jobs, so it's not just at start-up. 2. why isn't hadoop detecting that the reducers need something from a dead mapper and restarting the mapper job, even it means aborting the reducers? 3. why isn't the DataNode being used to fetch the blocks? It is still up and running when this happens, so shouldn't it know where the files are in HDFS? Thanks, Chris
Re: MAX_FETCH_RETRIES_PER_MAP (TaskTracker dying?)
Hi Chris, I'd suggest updating to a newer version of your hadoop distro - you're hitting some bugs that were fixed last summer. In particular, you're missing the amendment patch from MAPREDUCE-2373 as well as some patches to MR which make the fetch retry behavior more aggressive. -Todd On Mon, Dec 5, 2011 at 12:45 PM, Chris Curtin curtin.ch...@gmail.com wrote: Hi, Using: *Version:* 0.20.2-cdh3u0, r81256ad0f2e4ab2bd34b04f53d25a6c23686dd14, 8 node cluster, 64 bit Centos We are occasionally seeing MAX_FETCH_RETRIES_PER_MAP errors on reducer jobs. When we investigate it looks like the TaskTracker on the node being fetched from is not running. Looking at the logs we see what looks like a self-initiated shutdown: 2011-12-05 14:10:48,632 INFO org.apache.hadoop.mapred.JvmManager: JVM : jvm_201112050908_0222_r_1100711673 exited with exit code 0. Number of tasks it ran: 0 2011-12-05 14:10:48,632 ERROR org.apache.hadoop.mapred.JvmManager: Caught Throwable in JVMRunner. Aborting TaskTracker. java.lang.NullPointerException at org.apache.hadoop.mapred.DefaultTaskController.logShExecStatus(DefaultTaskController.java:145) at org.apache.hadoop.mapred.DefaultTaskController.launchTask(DefaultTaskController.java:129) at org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:472) at org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:446) 2011-12-05 14:10:48,634 INFO org.apache.hadoop.mapred.TaskTracker: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down TaskTracker at had11.atlis1/10.120.41.118 / Then the reducers have the following: 2011-12-05 14:12:00,962 WARN org.apache.hadoop.mapred.ReduceTask: java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351) at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213) at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366) at java.net.Socket.connect(Socket.java:529) at sun.net.NetworkClient.doConnect(NetworkClient.java:158) at sun.net.www.http.HttpClient.openServer(HttpClient.java:394) at sun.net.www.http.HttpClient.openServer(HttpClient.java:529) at sun.net.www.http.HttpClient.init(HttpClient.java:233) at sun.net.www.http.HttpClient.New(HttpClient.java:306) at sun.net.www.http.HttpClient.New(HttpClient.java:323) at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:970) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:911) at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:836) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1525) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.setupSecureConnection(ReduceTask.java:1482) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1390) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1301) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1233) 2011-12-05 14:12:00,962 INFO org.apache.hadoop.mapred.ReduceTask: Task attempt_201112050908_0169_r_05_0: Failed fetch #2 from attempt_201112050908_0169_m_02_0 2011-12-05 14:12:00,962 INFO org.apache.hadoop.mapred.ReduceTask: Failed to fetch map-output from attempt_201112050908_0169_m_02_0 even after MAX_FETCH_RETRIES_PER_MAP retries... or it is a read error, reporting to the JobTracker 2011-12-05 14:12:00,962 FATAL org.apache.hadoop.mapred.ReduceTask: Shuffle failed with too many fetch failures and insufficient progress!Killing task attempt_201112050908_0169_r_05_0. 2011-12-05 14:12:00,966 WARN org.apache.hadoop.mapred.ReduceTask: attempt_201112050908_0169_r_05_0 adding host had11.atlis1 to penalty box, next contact in 8 seconds 2011-12-05 14:12:00,966 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201112050908_0169_r_05_0: Got 1 map-outputs from previous failures The job then fails. Several questions: 1. what is causing the TaskTracker to fail/exit? This is after running hundreds to thousands of jobs, so it's not just at start-up. 2. why isn't hadoop detecting that the reducers need something from a dead mapper and restarting the mapper job, even it means aborting the reducers? 3. why isn't the DataNode being used to fetch the blocks? It is still up and running when this happens, so shouldn't it know where the files are in HDFS? Thanks, Chris -- Todd Lipcon Software Engineer, Cloudera
Re: MAX_FETCH_RETRIES_PER_MAP (TaskTracker dying?)
Hi Chris From the stack trace, it looks like a JVM corruption issue. It is a known issue and have been fixed in CDH3u2, i believe an upgrade would solve your issues. https://issues.apache.org/jira/browse/MAPREDUCE-3184 Then regarding your queries,I'd try to help you out a bit.In mapreduce the data transfer between map and reduce happens over http. If jetty is down then that won't happen which means map output in one location wont be accessible to reducer in another location. The map outputs are in LFS and not on HDFS so even if the data node on the machine is up we can't get the data in above circumstances. Hope it helps!.. Regards Bejoy.K.S On Tue, Dec 6, 2011 at 2:15 AM, Chris Curtin curtin.ch...@gmail.com wrote: Hi, Using: *Version:* 0.20.2-cdh3u0, r81256ad0f2e4ab2bd34b04f53d25a6c23686dd14, 8 node cluster, 64 bit Centos We are occasionally seeing MAX_FETCH_RETRIES_PER_MAP errors on reducer jobs. When we investigate it looks like the TaskTracker on the node being fetched from is not running. Looking at the logs we see what looks like a self-initiated shutdown: 2011-12-05 14:10:48,632 INFO org.apache.hadoop.mapred.JvmManager: JVM : jvm_201112050908_0222_r_1100711673 exited with exit code 0. Number of tasks it ran: 0 2011-12-05 14:10:48,632 ERROR org.apache.hadoop.mapred.JvmManager: Caught Throwable in JVMRunner. Aborting TaskTracker. java.lang.NullPointerException at org.apache.hadoop.mapred.DefaultTaskController.logShExecStatus(DefaultTaskController.java:145) at org.apache.hadoop.mapred.DefaultTaskController.launchTask(DefaultTaskController.java:129) at org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:472) at org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:446) 2011-12-05 14:10:48,634 INFO org.apache.hadoop.mapred.TaskTracker: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down TaskTracker at had11.atlis1/10.120.41.118 / Then the reducers have the following: 2011-12-05 14:12:00,962 WARN org.apache.hadoop.mapred.ReduceTask: java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351) at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213) at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366) at java.net.Socket.connect(Socket.java:529) at sun.net.NetworkClient.doConnect(NetworkClient.java:158) at sun.net.www.http.HttpClient.openServer(HttpClient.java:394) at sun.net.www.http.HttpClient.openServer(HttpClient.java:529) at sun.net.www.http.HttpClient.init(HttpClient.java:233) at sun.net.www.http.HttpClient.New(HttpClient.java:306) at sun.net.www.http.HttpClient.New(HttpClient.java:323) at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:970) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:911) at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:836) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1525) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.setupSecureConnection(ReduceTask.java:1482) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1390) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1301) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1233) 2011-12-05 14:12:00,962 INFO org.apache.hadoop.mapred.ReduceTask: Task attempt_201112050908_0169_r_05_0: Failed fetch #2 from attempt_201112050908_0169_m_02_0 2011-12-05 14:12:00,962 INFO org.apache.hadoop.mapred.ReduceTask: Failed to fetch map-output from attempt_201112050908_0169_m_02_0 even after MAX_FETCH_RETRIES_PER_MAP retries... or it is a read error, reporting to the JobTracker 2011-12-05 14:12:00,962 FATAL org.apache.hadoop.mapred.ReduceTask: Shuffle failed with too many fetch failures and insufficient progress!Killing task attempt_201112050908_0169_r_05_0. 2011-12-05 14:12:00,966 WARN org.apache.hadoop.mapred.ReduceTask: attempt_201112050908_0169_r_05_0 adding host had11.atlis1 to penalty box, next contact in 8 seconds 2011-12-05 14:12:00,966 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201112050908_0169_r_05_0: Got 1 map-outputs from previous failures The job then fails. Several questions: 1. what is causing the TaskTracker to fail/exit? This is after running hundreds to thousands of jobs, so it's not just at start-up. 2. why isn't hadoop detecting that the reducers need something from a dead