[ 
https://issues.apache.org/jira/browse/TEZ-3156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hitesh Shah updated TEZ-3156:
-----------------------------
    Attachment: TEZ-3156.1.patch

[~bikassaha] [~sseth] please review. 

This seems like a weird edge-case where the RM is restarted without a state 
store and it "forgets" the application. At this point, the client is still 
trying to get in touch with the AM and expects the RM to either point it to a 
new AM or provide an app completed status. 

Introduced a new DAGClientInternal to cleanly handle AppNotFound without 
changing the public API ( as internal impl classes were re-using public facing 
DAGClient ). 

Tested manually. 

> Tez client keep trying to talk to RM if RM does not know application
> --------------------------------------------------------------------
>
>                 Key: TEZ-3156
>                 URL: https://issues.apache.org/jira/browse/TEZ-3156
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Yesha Vora
>            Assignee: Hitesh Shah
>         Attachments: TEZ-3156.1.patch
>
>
> Scenario : 
> * Set RM/NM recovery to false.
> {code}
>  <property>
>       <name>yarn.resourcemanager.recovery.enabled</name>
>       <value>false</value>
>     </property>
>  <property>
>       <name>yarn.nodemanager.recovery.enabled</name>
>       <value>false</value>
>     </property>
> {code}
> * Start Mrrsleep application (application_1456883132071_0001)
> {code}
> hadoop jar tez-tests-*.jar mrrsleep -m 1 -r 1 -mt 1000000 -rt 1000
> {code}
> * When application is running, restart RM
> Since recovery is disabled and RM is restarted, it forgets mrrsleep 
> application. At this point, mrrsleep application's tez-client keep trying to 
> communicate with RM and loads RM with below exception. 
> {code}
> 2016-03-02 02:01:24,708 INFO  ipc.Server (Server.java:run(2172)) - IPC Server 
> handler 18 on 8050, call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
> from xx.xx.xx.xxx:36191 Call#500250 Retry#0
> org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
> with id 'application_1456883132071_0001' doesn't exist in RM.
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:328)
>       at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:175)
>       at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:417)
>       at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
>       at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:415)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)
> 2016-03-02 02:01:24,709 INFO  ipc.Server (Server.java:run(2172)) - IPC Server 
> handler 27 on 8050, call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
> from xx.xx.xx.xxx:36191 Call#500251 Retry#0
> org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
> with id 'application_1456883132071_0001' doesn't exist in RM.
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:328)
>       at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:175)
>       at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:417)
>       at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
>       at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:415)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to