Re: MAX_FETCH_RETRIES_PER_MAP (TaskTracker dying?)

2011-12-06 Thread Chris Curtin
Thanks guys, I'll get with operations to do the upgrade.

Chris

On Mon, Dec 5, 2011 at 4:11 PM, Bejoy Ks bejoy.had...@gmail.com wrote:

 Hi Chris
 From the stack trace, it looks like a JVM corruption issue. It is
 a known issue and have been fixed in CDH3u2, i believe an upgrade would
 solve your issues.
 https://issues.apache.org/jira/browse/MAPREDUCE-3184

 Then regarding your queries,I'd try to help you out a bit.In mapreduce the
 data transfer between map and reduce happens over http. If jetty is down
 then that won't happen which means map output in one location wont be
 accessible to reducer in another location. The map outputs are in LFS and
 not on HDFS so even if the data node on the machine is up we can't get the
 data in above circumstances.

 Hope it helps!..

 Regards
 Bejoy.K.S


 On Tue, Dec 6, 2011 at 2:15 AM, Chris Curtin curtin.ch...@gmail.com
 wrote:

  Hi,
 
  Using: *Version:* 0.20.2-cdh3u0,
 r81256ad0f2e4ab2bd34b04f53d25a6c23686dd14,
   8 node cluster, 64 bit Centos
 
  We are occasionally seeing MAX_FETCH_RETRIES_PER_MAP errors on reducer
  jobs. When we investigate it looks like the TaskTracker on the node being
  fetched from is not running. Looking at the logs we see what looks like a
  self-initiated shutdown:
 
  2011-12-05 14:10:48,632 INFO org.apache.hadoop.mapred.JvmManager: JVM :
  jvm_201112050908_0222_r_1100711673 exited with exit code 0. Number of
 tasks
  it ran: 0
  2011-12-05 14:10:48,632 ERROR org.apache.hadoop.mapred.JvmManager: Caught
  Throwable in JVMRunner. Aborting TaskTracker.
  java.lang.NullPointerException
 at
 
 
 org.apache.hadoop.mapred.DefaultTaskController.logShExecStatus(DefaultTaskController.java:145)
 at
 
 
 org.apache.hadoop.mapred.DefaultTaskController.launchTask(DefaultTaskController.java:129)
 at
 
 
 org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:472)
 at
 
 
 org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:446)
  2011-12-05 14:10:48,634 INFO org.apache.hadoop.mapred.TaskTracker:
  SHUTDOWN_MSG:
  /
  SHUTDOWN_MSG: Shutting down TaskTracker at had11.atlis1/10.120.41.118
  /
 
  Then the reducers have the following:
 
 
  2011-12-05 14:12:00,962 WARN org.apache.hadoop.mapred.ReduceTask:
  java.net.ConnectException: Connection refused
   at java.net.PlainSocketImpl.socketConnect(Native Method)
   at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
   at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)
   at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
   at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
   at java.net.Socket.connect(Socket.java:529)
   at sun.net.NetworkClient.doConnect(NetworkClient.java:158)
   at sun.net.www.http.HttpClient.openServer(HttpClient.java:394)
   at sun.net.www.http.HttpClient.openServer(HttpClient.java:529)
   at sun.net.www.http.HttpClient.init(HttpClient.java:233)
   at sun.net.www.http.HttpClient.New(HttpClient.java:306)
   at sun.net.www.http.HttpClient.New(HttpClient.java:323)
   at
 
 
 sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:970)
   at
 
 
 sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:911)
   at
 
 
 sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:836)
   at
 
 
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1525)
   at
 
 
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.setupSecureConnection(ReduceTask.java:1482)
   at
 
 
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1390)
   at
 
 
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1301)
   at
 
 
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1233)
 
  2011-12-05 14:12:00,962 INFO org.apache.hadoop.mapred.ReduceTask: Task
  attempt_201112050908_0169_r_05_0: Failed fetch #2 from
  attempt_201112050908_0169_m_02_0
  2011-12-05 14:12:00,962 INFO org.apache.hadoop.mapred.ReduceTask: Failed
 to
  fetch map-output from attempt_201112050908_0169_m_02_0 even after
  MAX_FETCH_RETRIES_PER_MAP retries...  or it is a read error,  reporting
 to
  the JobTracker
  2011-12-05 14:12:00,962 FATAL org.apache.hadoop.mapred.ReduceTask:
 Shuffle
  failed with too many fetch failures and insufficient progress!Killing
 task
  attempt_201112050908_0169_r_05_0.
  2011-12-05 14:12:00,966 WARN org.apache.hadoop.mapred.ReduceTask:
  attempt_201112050908_0169_r_05_0 adding host had11.atlis1 to penalty
  box, next contact in 8 seconds
  2011-12-05 14:12:00,966 INFO org.apache.hadoop.mapred.ReduceTask:
  attempt_201112050908_0169_r_05_0: Got 1 map-outputs from previous
 

MAX_FETCH_RETRIES_PER_MAP (TaskTracker dying?)

2011-12-05 Thread Chris Curtin
Hi,

Using: *Version:* 0.20.2-cdh3u0, r81256ad0f2e4ab2bd34b04f53d25a6c23686dd14,
8 node cluster, 64 bit Centos

We are occasionally seeing MAX_FETCH_RETRIES_PER_MAP errors on reducer
jobs. When we investigate it looks like the TaskTracker on the node being
fetched from is not running. Looking at the logs we see what looks like a
self-initiated shutdown:

2011-12-05 14:10:48,632 INFO org.apache.hadoop.mapred.JvmManager: JVM :
jvm_201112050908_0222_r_1100711673 exited with exit code 0. Number of tasks
it ran: 0
2011-12-05 14:10:48,632 ERROR org.apache.hadoop.mapred.JvmManager: Caught
Throwable in JVMRunner. Aborting TaskTracker.
java.lang.NullPointerException
at
org.apache.hadoop.mapred.DefaultTaskController.logShExecStatus(DefaultTaskController.java:145)
at
org.apache.hadoop.mapred.DefaultTaskController.launchTask(DefaultTaskController.java:129)
at
org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:472)
at
org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:446)
2011-12-05 14:10:48,634 INFO org.apache.hadoop.mapred.TaskTracker:
SHUTDOWN_MSG:
/
SHUTDOWN_MSG: Shutting down TaskTracker at had11.atlis1/10.120.41.118
/

Then the reducers have the following:


2011-12-05 14:12:00,962 WARN org.apache.hadoop.mapred.ReduceTask:
java.net.ConnectException: Connection refused
 at java.net.PlainSocketImpl.socketConnect(Native Method)
 at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
 at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)
 at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
 at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
 at java.net.Socket.connect(Socket.java:529)
 at sun.net.NetworkClient.doConnect(NetworkClient.java:158)
 at sun.net.www.http.HttpClient.openServer(HttpClient.java:394)
 at sun.net.www.http.HttpClient.openServer(HttpClient.java:529)
 at sun.net.www.http.HttpClient.init(HttpClient.java:233)
 at sun.net.www.http.HttpClient.New(HttpClient.java:306)
 at sun.net.www.http.HttpClient.New(HttpClient.java:323)
 at
sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:970)
 at
sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:911)
 at
sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:836)
 at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1525)
 at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.setupSecureConnection(ReduceTask.java:1482)
 at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1390)
 at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1301)
 at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1233)

2011-12-05 14:12:00,962 INFO org.apache.hadoop.mapred.ReduceTask: Task
attempt_201112050908_0169_r_05_0: Failed fetch #2 from
attempt_201112050908_0169_m_02_0
2011-12-05 14:12:00,962 INFO org.apache.hadoop.mapred.ReduceTask: Failed to
fetch map-output from attempt_201112050908_0169_m_02_0 even after
MAX_FETCH_RETRIES_PER_MAP retries...  or it is a read error,  reporting to
the JobTracker
2011-12-05 14:12:00,962 FATAL org.apache.hadoop.mapred.ReduceTask: Shuffle
failed with too many fetch failures and insufficient progress!Killing task
attempt_201112050908_0169_r_05_0.
2011-12-05 14:12:00,966 WARN org.apache.hadoop.mapred.ReduceTask:
attempt_201112050908_0169_r_05_0 adding host had11.atlis1 to penalty
box, next contact in 8 seconds
2011-12-05 14:12:00,966 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201112050908_0169_r_05_0: Got 1 map-outputs from previous
failures
The job then fails.

Several questions:
1. what is causing the TaskTracker to fail/exit? This is after running
hundreds to thousands of jobs, so it's not just at start-up.
2. why isn't hadoop detecting that the reducers need something from a dead
mapper and restarting the mapper job, even it means aborting the reducers?
3. why isn't the DataNode being used to fetch the blocks? It is still up
and running when this happens, so shouldn't it know where the files are in
HDFS?

Thanks,

Chris


Re: MAX_FETCH_RETRIES_PER_MAP (TaskTracker dying?)

2011-12-05 Thread Todd Lipcon
Hi Chris,

I'd suggest updating to a newer version of your hadoop distro - you're
hitting some bugs that were fixed last summer. In particular, you're
missing the amendment patch from MAPREDUCE-2373 as well as some
patches to MR which make the fetch retry behavior more aggressive.

-Todd

On Mon, Dec 5, 2011 at 12:45 PM, Chris Curtin curtin.ch...@gmail.com wrote:
 Hi,

 Using: *Version:* 0.20.2-cdh3u0, r81256ad0f2e4ab2bd34b04f53d25a6c23686dd14,
 8 node cluster, 64 bit Centos

 We are occasionally seeing MAX_FETCH_RETRIES_PER_MAP errors on reducer
 jobs. When we investigate it looks like the TaskTracker on the node being
 fetched from is not running. Looking at the logs we see what looks like a
 self-initiated shutdown:

 2011-12-05 14:10:48,632 INFO org.apache.hadoop.mapred.JvmManager: JVM :
 jvm_201112050908_0222_r_1100711673 exited with exit code 0. Number of tasks
 it ran: 0
 2011-12-05 14:10:48,632 ERROR org.apache.hadoop.mapred.JvmManager: Caught
 Throwable in JVMRunner. Aborting TaskTracker.
 java.lang.NullPointerException
        at
 org.apache.hadoop.mapred.DefaultTaskController.logShExecStatus(DefaultTaskController.java:145)
        at
 org.apache.hadoop.mapred.DefaultTaskController.launchTask(DefaultTaskController.java:129)
        at
 org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:472)
        at
 org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:446)
 2011-12-05 14:10:48,634 INFO org.apache.hadoop.mapred.TaskTracker:
 SHUTDOWN_MSG:
 /
 SHUTDOWN_MSG: Shutting down TaskTracker at had11.atlis1/10.120.41.118
 /

 Then the reducers have the following:


 2011-12-05 14:12:00,962 WARN org.apache.hadoop.mapred.ReduceTask:
 java.net.ConnectException: Connection refused
  at java.net.PlainSocketImpl.socketConnect(Native Method)
  at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
  at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)
  at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
  at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
  at java.net.Socket.connect(Socket.java:529)
  at sun.net.NetworkClient.doConnect(NetworkClient.java:158)
  at sun.net.www.http.HttpClient.openServer(HttpClient.java:394)
  at sun.net.www.http.HttpClient.openServer(HttpClient.java:529)
  at sun.net.www.http.HttpClient.init(HttpClient.java:233)
  at sun.net.www.http.HttpClient.New(HttpClient.java:306)
  at sun.net.www.http.HttpClient.New(HttpClient.java:323)
  at
 sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:970)
  at
 sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:911)
  at
 sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:836)
  at
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1525)
  at
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.setupSecureConnection(ReduceTask.java:1482)
  at
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1390)
  at
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1301)
  at
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1233)

 2011-12-05 14:12:00,962 INFO org.apache.hadoop.mapred.ReduceTask: Task
 attempt_201112050908_0169_r_05_0: Failed fetch #2 from
 attempt_201112050908_0169_m_02_0
 2011-12-05 14:12:00,962 INFO org.apache.hadoop.mapred.ReduceTask: Failed to
 fetch map-output from attempt_201112050908_0169_m_02_0 even after
 MAX_FETCH_RETRIES_PER_MAP retries...  or it is a read error,  reporting to
 the JobTracker
 2011-12-05 14:12:00,962 FATAL org.apache.hadoop.mapred.ReduceTask: Shuffle
 failed with too many fetch failures and insufficient progress!Killing task
 attempt_201112050908_0169_r_05_0.
 2011-12-05 14:12:00,966 WARN org.apache.hadoop.mapred.ReduceTask:
 attempt_201112050908_0169_r_05_0 adding host had11.atlis1 to penalty
 box, next contact in 8 seconds
 2011-12-05 14:12:00,966 INFO org.apache.hadoop.mapred.ReduceTask:
 attempt_201112050908_0169_r_05_0: Got 1 map-outputs from previous
 failures
 The job then fails.

 Several questions:
 1. what is causing the TaskTracker to fail/exit? This is after running
 hundreds to thousands of jobs, so it's not just at start-up.
 2. why isn't hadoop detecting that the reducers need something from a dead
 mapper and restarting the mapper job, even it means aborting the reducers?
 3. why isn't the DataNode being used to fetch the blocks? It is still up
 and running when this happens, so shouldn't it know where the files are in
 HDFS?

 Thanks,

 Chris



-- 
Todd Lipcon
Software Engineer, Cloudera


Re: MAX_FETCH_RETRIES_PER_MAP (TaskTracker dying?)

2011-12-05 Thread Bejoy Ks
Hi Chris
 From the stack trace, it looks like a JVM corruption issue. It is
a known issue and have been fixed in CDH3u2, i believe an upgrade would
solve your issues.
https://issues.apache.org/jira/browse/MAPREDUCE-3184

Then regarding your queries,I'd try to help you out a bit.In mapreduce the
data transfer between map and reduce happens over http. If jetty is down
then that won't happen which means map output in one location wont be
accessible to reducer in another location. The map outputs are in LFS and
not on HDFS so even if the data node on the machine is up we can't get the
data in above circumstances.

Hope it helps!..

Regards
Bejoy.K.S


On Tue, Dec 6, 2011 at 2:15 AM, Chris Curtin curtin.ch...@gmail.com wrote:

 Hi,

 Using: *Version:* 0.20.2-cdh3u0, r81256ad0f2e4ab2bd34b04f53d25a6c23686dd14,
 8 node cluster, 64 bit Centos

 We are occasionally seeing MAX_FETCH_RETRIES_PER_MAP errors on reducer
 jobs. When we investigate it looks like the TaskTracker on the node being
 fetched from is not running. Looking at the logs we see what looks like a
 self-initiated shutdown:

 2011-12-05 14:10:48,632 INFO org.apache.hadoop.mapred.JvmManager: JVM :
 jvm_201112050908_0222_r_1100711673 exited with exit code 0. Number of tasks
 it ran: 0
 2011-12-05 14:10:48,632 ERROR org.apache.hadoop.mapred.JvmManager: Caught
 Throwable in JVMRunner. Aborting TaskTracker.
 java.lang.NullPointerException
at

 org.apache.hadoop.mapred.DefaultTaskController.logShExecStatus(DefaultTaskController.java:145)
at

 org.apache.hadoop.mapred.DefaultTaskController.launchTask(DefaultTaskController.java:129)
at

 org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:472)
at

 org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:446)
 2011-12-05 14:10:48,634 INFO org.apache.hadoop.mapred.TaskTracker:
 SHUTDOWN_MSG:
 /
 SHUTDOWN_MSG: Shutting down TaskTracker at had11.atlis1/10.120.41.118
 /

 Then the reducers have the following:


 2011-12-05 14:12:00,962 WARN org.apache.hadoop.mapred.ReduceTask:
 java.net.ConnectException: Connection refused
  at java.net.PlainSocketImpl.socketConnect(Native Method)
  at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
  at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)
  at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
  at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
  at java.net.Socket.connect(Socket.java:529)
  at sun.net.NetworkClient.doConnect(NetworkClient.java:158)
  at sun.net.www.http.HttpClient.openServer(HttpClient.java:394)
  at sun.net.www.http.HttpClient.openServer(HttpClient.java:529)
  at sun.net.www.http.HttpClient.init(HttpClient.java:233)
  at sun.net.www.http.HttpClient.New(HttpClient.java:306)
  at sun.net.www.http.HttpClient.New(HttpClient.java:323)
  at

 sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:970)
  at

 sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:911)
  at

 sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:836)
  at

 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1525)
  at

 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.setupSecureConnection(ReduceTask.java:1482)
  at

 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1390)
  at

 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1301)
  at

 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1233)

 2011-12-05 14:12:00,962 INFO org.apache.hadoop.mapred.ReduceTask: Task
 attempt_201112050908_0169_r_05_0: Failed fetch #2 from
 attempt_201112050908_0169_m_02_0
 2011-12-05 14:12:00,962 INFO org.apache.hadoop.mapred.ReduceTask: Failed to
 fetch map-output from attempt_201112050908_0169_m_02_0 even after
 MAX_FETCH_RETRIES_PER_MAP retries...  or it is a read error,  reporting to
 the JobTracker
 2011-12-05 14:12:00,962 FATAL org.apache.hadoop.mapred.ReduceTask: Shuffle
 failed with too many fetch failures and insufficient progress!Killing task
 attempt_201112050908_0169_r_05_0.
 2011-12-05 14:12:00,966 WARN org.apache.hadoop.mapred.ReduceTask:
 attempt_201112050908_0169_r_05_0 adding host had11.atlis1 to penalty
 box, next contact in 8 seconds
 2011-12-05 14:12:00,966 INFO org.apache.hadoop.mapred.ReduceTask:
 attempt_201112050908_0169_r_05_0: Got 1 map-outputs from previous
 failures
 The job then fails.

 Several questions:
 1. what is causing the TaskTracker to fail/exit? This is after running
 hundreds to thousands of jobs, so it's not just at start-up.
 2. why isn't hadoop detecting that the reducers need something from a dead