[jira] [Updated] (YARN-1490) RM should optionally not kill all containers when an ApplicationMaster exits

Robert Kanter (JIRA) Fri, 07 Feb 2014 17:01:37 -0800

     [ 
https://issues.apache.org/jira/browse/YARN-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Robert Kanter updated YARN-1490:
--------------------------------

    Attachment: org.apache.oozie.service.TestRecoveryService_thread-dump.txt

As reported in the "Re-swizzle 2.3" email thread on the mailing lists, when 
running the Oozie unit tests we saw some weird behavior after YARN-1490:
Basically, we use a single MiniMRCluster and MiniDFSCluster across all unit 
tests in a module.  With YARN-1490 we saw that, regardless of test order, the 
last few tests would timeout waiting for an MR job to finish; on slower 
machines, the entire test suite would timeout.  Through some digging, I found 
that we were getting a ton of “Connection refused” Exceptions on LeaseRenewer 
talking to the NN and a few on the AM talking to the RM.
So it sounds like there's something that happens over time...

I've attached a thread dump 
(org.apache.oozie.service.TestRecoveryService_thread-dump.txt) taken during the 
test where we saw the timeout; though its possible that the issue manifests 
itself earlier but isn't noticeable until then.

And here is one of the exceptions that we see in the in the MiniMRCluster's 
syslog for the container used during that test; it repeats many many times:
{noformat}
2014-02-07 14:42:22,998 WARN [LeaseRenewer:test@localhost:56186] 
org.apache.hadoop.hdfs.LeaseRenewer: Failed to renew lease for 
[DFSClient_NONMAPREDUCE_-1380838220_1] for 2419 seconds.  Will retry shortly ...
java.net.ConnectException: Call From rkanter-mbp.local/172.16.1.64 to 
localhost:56186 failed on connection exception: java.net.ConnectException: 
Connection refused; For more details see:  
http://wiki.apache.org/hadoop/ConnectionRefused
        at sun.reflect.GeneratedConstructorAccessor17.newInstance(Unknown 
Source)
        at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
        at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
        at org.apache.hadoop.ipc.Client.call(Client.java:1410)
        at org.apache.hadoop.ipc.Client.call(Client.java:1359)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
        at com.sun.proxy.$Proxy9.renewLease(Unknown Source)
        at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
        at com.sun.proxy.$Proxy9.renewLease(Unknown Source)
        at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.renewLease(ClientNamenodeProtocolTranslatorPB.java:519)
        at org.apache.hadoop.hdfs.DFSClient.renewLease(DFSClient.java:773)
        at org.apache.hadoop.hdfs.LeaseRenewer.renew(LeaseRenewer.java:417)
        at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:442)
        at org.apache.hadoop.hdfs.LeaseRenewer.access$700(LeaseRenewer.java:71)
        at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:298)
        at java.lang.Thread.run(Thread.java:744)
Caused by: java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:735)
        at 
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
        at 
org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:601)
        at 
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:696)
        at org.apache.hadoop.ipc.Client$Connection.access$2700(Client.java:367)
        at org.apache.hadoop.ipc.Client.getConnection(Client.java:1458)
        at org.apache.hadoop.ipc.Client.call(Client.java:1377)
        ... 16 more
{noformat}

I'm going to continue looking into why YARN-1490 is causing this behavior, but 
I thought I'd post this info here in case anyone has any ideas.

> RM should optionally not kill all containers when an ApplicationMaster exits
> ----------------------------------------------------------------------------
>
>                 Key: YARN-1490
>                 URL: https://issues.apache.org/jira/browse/YARN-1490
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Jian He
>             Fix For: 2.4.0
>
>         Attachments: YARN-1490.1.patch, YARN-1490.10.patch, 
> YARN-1490.11.patch, YARN-1490.11.patch, YARN-1490.12.patch, 
> YARN-1490.2.patch, YARN-1490.3.patch, YARN-1490.4.patch, YARN-1490.5.patch, 
> YARN-1490.6.patch, YARN-1490.7.patch, YARN-1490.8.patch, YARN-1490.9.patch, 
> org.apache.oozie.service.TestRecoveryService_thread-dump.txt
>
>
> This is needed to enable work-preserving AM restart. Some apps can chose to 
> reconnect with old running containers, some may not want to. This should be 
> an option.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (YARN-1490) RM should optionally not kill all containers when an ApplicationMaster exits

Reply via email to