[ https://issues.apache.org/jira/browse/YARN-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Robert Kanter updated YARN-1490: -------------------------------- Attachment: org.apache.oozie.service.TestRecoveryService_thread-dump.txt As reported in the "Re-swizzle 2.3" email thread on the mailing lists, when running the Oozie unit tests we saw some weird behavior after YARN-1490: Basically, we use a single MiniMRCluster and MiniDFSCluster across all unit tests in a module. With YARN-1490 we saw that, regardless of test order, the last few tests would timeout waiting for an MR job to finish; on slower machines, the entire test suite would timeout. Through some digging, I found that we were getting a ton of “Connection refused” Exceptions on LeaseRenewer talking to the NN and a few on the AM talking to the RM. So it sounds like there's something that happens over time... I've attached a thread dump (org.apache.oozie.service.TestRecoveryService_thread-dump.txt) taken during the test where we saw the timeout; though its possible that the issue manifests itself earlier but isn't noticeable until then. And here is one of the exceptions that we see in the in the MiniMRCluster's syslog for the container used during that test; it repeats many many times: {noformat} 2014-02-07 14:42:22,998 WARN [LeaseRenewer:test@localhost:56186] org.apache.hadoop.hdfs.LeaseRenewer: Failed to renew lease for [DFSClient_NONMAPREDUCE_-1380838220_1] for 2419 seconds. Will retry shortly ... java.net.ConnectException: Call From rkanter-mbp.local/172.16.1.64 to localhost:56186 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at sun.reflect.GeneratedConstructorAccessor17.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730) at org.apache.hadoop.ipc.Client.call(Client.java:1410) at org.apache.hadoop.ipc.Client.call(Client.java:1359) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) at com.sun.proxy.$Proxy9.renewLease(Unknown Source) at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy9.renewLease(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.renewLease(ClientNamenodeProtocolTranslatorPB.java:519) at org.apache.hadoop.hdfs.DFSClient.renewLease(DFSClient.java:773) at org.apache.hadoop.hdfs.LeaseRenewer.renew(LeaseRenewer.java:417) at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:442) at org.apache.hadoop.hdfs.LeaseRenewer.access$700(LeaseRenewer.java:71) at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:298) at java.lang.Thread.run(Thread.java:744) Caused by: java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:735) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:601) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:696) at org.apache.hadoop.ipc.Client$Connection.access$2700(Client.java:367) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1458) at org.apache.hadoop.ipc.Client.call(Client.java:1377) ... 16 more {noformat} I'm going to continue looking into why YARN-1490 is causing this behavior, but I thought I'd post this info here in case anyone has any ideas. > RM should optionally not kill all containers when an ApplicationMaster exits > ---------------------------------------------------------------------------- > > Key: YARN-1490 > URL: https://issues.apache.org/jira/browse/YARN-1490 > Project: Hadoop YARN > Issue Type: Sub-task > Reporter: Vinod Kumar Vavilapalli > Assignee: Jian He > Fix For: 2.4.0 > > Attachments: YARN-1490.1.patch, YARN-1490.10.patch, > YARN-1490.11.patch, YARN-1490.11.patch, YARN-1490.12.patch, > YARN-1490.2.patch, YARN-1490.3.patch, YARN-1490.4.patch, YARN-1490.5.patch, > YARN-1490.6.patch, YARN-1490.7.patch, YARN-1490.8.patch, YARN-1490.9.patch, > org.apache.oozie.service.TestRecoveryService_thread-dump.txt > > > This is needed to enable work-preserving AM restart. Some apps can chose to > reconnect with old running containers, some may not want to. This should be > an option. -- This message was sent by Atlassian JIRA (v6.1.5#6160)