[ https://issues.apache.org/jira/browse/HADOOP-2076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Christian Kunz updated HADOOP-2076: ----------------------------------- With input from Owen and Devaraj, I will try to use less jetty threads to reduce contention and see whether this helps > extensive map tasks failures because of SocketTimeoutException during > statusUpdate > ---------------------------------------------------------------------------------- > > Key: HADOOP-2076 > URL: https://issues.apache.org/jira/browse/HADOOP-2076 > Project: Hadoop > Issue Type: Bug > Affects Versions: 0.16.0 > Environment: Oct 17 #718 nightly build with patches 2033 and 2048 > Reporter: Christian Kunz > > A job with 3600 tasks on a cluster of 1350 nodes (up 3 tasks per node) shows > extensive map tasks failures because of connection timeouts at the end of the > task (c++ application using pipes interface completed successfully) > More than 600 tasks failed, slowing down the job because of retries. Only a > portion of the tasks fail because of the timeout issue, but they spawn other > failures because retries and speculatively executed tasks cannot even get a > connection and fail just after a few seconds. > JobTracker is running with 60 handlers. We allow up to 10 attempts for maps. > I attach the log of a task failing because of timeout (which includes a > thread dump), and the log of one task which could not start. > 2007-10-18 15:58:41,743 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: > Initializing JVM Metrics with processName=MAP, sessionId= > 2007-10-18 15:58:41,827 INFO org.apache.hadoop.mapred.MapTask: > numReduceTasks: 3600 > 2007-10-18 16:12:28,918 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded > the native-hadoop library > 2007-10-18 16:12:28,920 INFO org.apache.hadoop.io.compress.zlib.ZlibFactory: > Successfully loaded & initialized native-zlib library > 2007-10-18 17:43:00,785 INFO org.apache.hadoop.mapred.TaskRunner: > Communication exception: java.net.SocketTimeoutException: timed out waiting > for rpc response > at org.apache.hadoop.ipc.Client.call(Client.java:484) > at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184) > at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source) > at org.apache.hadoop.mapred.Task$1.run(Task.java:293) > at java.lang.Thread.run(Thread.java:619) > 2007-10-18 17:44:03,833 INFO org.apache.hadoop.mapred.TaskRunner: > Communication exception: java.net.SocketTimeoutException: timed out waiting > for rpc response > at org.apache.hadoop.ipc.Client.call(Client.java:484) > at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184) > at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source) > at org.apache.hadoop.mapred.Task$1.run(Task.java:293) > at java.lang.Thread.run(Thread.java:619) > 2007-10-18 17:45:06,838 INFO org.apache.hadoop.mapred.TaskRunner: > Communication exception: java.net.SocketTimeoutException: timed out waiting > for rpc response > at org.apache.hadoop.ipc.Client.call(Client.java:484) > at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184) > at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown Source) > at org.apache.hadoop.mapred.Task$1.run(Task.java:293) > at java.lang.Thread.run(Thread.java:619) > 2007-10-18 17:45:40,258 INFO org.apache.hadoop.mapred.TaskRunner: Process > Thread Dump: Communication exception > 8 active threads > Thread 13 (Comm thread for task_200710172336_0016_m_000071_0): > State: RUNNABLE > Blocked count: 0 > Waited count: 4128 > Stack: > sun.management.ThreadImpl.getThreadInfo0(Native Method) > sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:147) > sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:123) > > org.apache.hadoop.util.ReflectionUtils.printThreadInfo(ReflectionUtils.java:114) > > org.apache.hadoop.util.ReflectionUtils.logThreadInfo(ReflectionUtils.java:162) > org.apache.hadoop.mapred.Task$1.run(Task.java:315) > java.lang.Thread.run(Thread.java:619) > Thread 12 ([EMAIL PROTECTED]): > State: TIMED_WAITING > Blocked count: 0 > Waited count: 6403 > Stack: > java.lang.Thread.sleep(Native Method) > org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:558) > java.lang.Thread.run(Thread.java:619) > Thread 9 (IPC Client connection to /127.0.0.1:49458): > State: RUNNABLE > Blocked count: 21 > Waited count: 2063 > Stack: > java.net.SocketInputStream.socketRead0(Native Method) > java.net.SocketInputStream.read(SocketInputStream.java:129) > java.io.FilterInputStream.read(FilterInputStream.java:116) > org.apache.hadoop.ipc.Client$Connection$1.read(Client.java:181) > java.io.BufferedInputStream.fill(BufferedInputStream.java:218) > java.io.BufferedInputStream.read(BufferedInputStream.java:237) > java.io.DataInputStream.readInt(DataInputStream.java:370) > org.apache.hadoop.ipc.Client$Connection.run(Client.java:258) > Thread 8 (org.apache.hadoop.io.ObjectWritable Connection Culler): > State: TIMED_WAITING > Blocked count: 0 > Waited count: 6402 > Stack: > java.lang.Thread.sleep(Native Method) > org.apache.hadoop.ipc.Client$ConnectionCuller.run(Client.java:404) > Thread 4 (Signal Dispatcher): > State: RUNNABLE > Blocked count: 0 > Waited count: 0 > Stack: > Thread 3 (Finalizer): > State: WAITING > Blocked count: 398 > Waited count: 2270 > Waiting on [EMAIL PROTECTED] > Stack: > java.lang.Object.wait(Native Method) > java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116) > java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132) > java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159) > Thread 2 (Reference Handler): > State: WAITING > Blocked count: 257 > Waited count: 2269 > Waiting on [EMAIL PROTECTED] > Stack: > java.lang.Object.wait(Native Method) > java.lang.Object.wait(Object.java:485) > java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116) > Thread 1 (main): > State: RUNNABLE > Blocked count: 1 > Waited count: 10 > Stack: > java.io.FileInputStream.readBytes(Native Method) > java.io.FileInputStream.read(FileInputStream.java:199) > > org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.read(RawLocalFileSystem.java:105) > java.io.BufferedInputStream.fill(BufferedInputStream.java:218) > java.io.BufferedInputStream.read1(BufferedInputStream.java:258) > java.io.BufferedInputStream.read(BufferedInputStream.java:317) > java.io.DataInputStream.read(DataInputStream.java:132) > org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker.java:378) > > org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:200) > > org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:234) > org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176) > org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193) > org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:157) > org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker.java:378) > org.apache.hadoop.fs.FSInputChecker.seek(FSInputChecker.java:359) > > org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.seek(ChecksumFileSystem.java:254) > org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:37) > org.apache.hadoop.io.SequenceFile$Reader.seek(SequenceFile.java:1793) > org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1217) > org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1142) > 2007-10-18 17:45:40,258 WARN org.apache.hadoop.mapred.TaskRunner: Last retry, > killing task_200710172336_0016_m_000071_0 > Log of task that could not start: > 2007-10-18 17:43:55,766 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: /127.0.0.1:53972. Already tried 1 time(s). > 2007-10-18 17:43:56,768 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: /127.0.0.1:53972. Already tried 2 time(s). > 2007-10-18 17:43:57,770 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: /127.0.0.1:53972. Already tried 3 time(s). > 2007-10-18 17:43:58,772 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: /127.0.0.1:53972. Already tried 4 time(s). > 2007-10-18 17:43:59,774 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: /127.0.0.1:53972. Already tried 5 time(s). > 2007-10-18 17:44:00,776 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: /127.0.0.1:53972. Already tried 6 time(s). > 2007-10-18 17:44:01,778 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: /127.0.0.1:53972. Already tried 7 time(s). > 2007-10-18 17:44:02,780 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: /127.0.0.1:53972. Already tried 8 time(s). > 2007-10-18 17:44:03,783 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: /127.0.0.1:53972. Already tried 9 time(s). > 2007-10-18 17:44:04,785 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: /127.0.0.1:53972. Already tried 10 time(s). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.