[ 
https://issues.apache.org/jira/browse/YARN-5933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15698124#comment-15698124
 ] 

Prabhu Joseph commented on YARN-5933:
-------------------------------------

Hi [~sunilg] [~gtCarrera9], Below are some of the ways to fix this issue 
assuming an application which is not found in RM at first getApplicationReport 
call will never be one of APP_FINAL_STATES at subsequent getApplicationReport 
call.

1. Once the AppState is Unknown, the appDir can be removed from ActivePath 
immediately. Not sure why there is a wait of unknownActiveMillis and then app 
marked as completed. If we choose removal of appDir immediately, then there 
won't be any need for unknownActiveMillis handling code.
2. If there is a need to move unknown state app also to done directory, then 
the appDir can be moved immediately before waiting for unknownActiveMillis 

Please share your comments.

> ATS stale entries in active directory causes ApplicationNotFoundException in 
> RM
> -------------------------------------------------------------------------------
>
>                 Key: YARN-5933
>                 URL: https://issues.apache.org/jira/browse/YARN-5933
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.7.3
>            Reporter: Prabhu Joseph
>            Assignee: Prabhu Joseph
>
> On Secure cluster where ATS is down, Tez job submitted will fail while 
> getting TIMELINE_DELEGATION_TOKEN with below exception
> {code}
> 0: jdbc:hive2://kerberos-2.openstacklocal:100> select csmallint from 
> alltypesorc group by csmallint;
> INFO  : Session is already open
> INFO  : Dag name: select csmallint from alltypesor...csmallint(Stage-1)
> INFO  : Tez session was closed. Reopening...
> ERROR : Failed to execute tez graph.
> java.lang.RuntimeException: Failed to connect to timeline server. Connection 
> retries limit exceeded. The posted timeline event may be missing
>       at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:266)
>       at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.operateDelegationToken(TimelineClientImpl.java:590)
>       at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.getDelegationToken(TimelineClientImpl.java:506)
>       at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getTimelineDelegationToken(YarnClientImpl.java:349)
>       at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.addTimelineDelegationToken(YarnClientImpl.java:330)
>       at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.submitApplication(YarnClientImpl.java:250)
>       at 
> org.apache.tez.client.TezYarnClient.submitApplication(TezYarnClient.java:72)
>       at org.apache.tez.client.TezClient.start(TezClient.java:409)
>       at 
> org.apache.hadoop.hive.ql.exec.tez.TezSessionState.open(TezSessionState.java:196)
>       at 
> org.apache.hadoop.hive.ql.exec.tez.TezSessionPoolManager.closeAndOpen(TezSessionPoolManager.java:311)
>       at org.apache.hadoop.hive.ql.exec.tez.TezTask.submit(TezTask.java:453)
>       at org.apache.hadoop.hive.ql.exec.tez.TezTask.execute(TezTask.java:180)
>       at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160)
>       at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:89)
>       at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1728)
>       at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1485)
>       at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1262)
>       at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1126)
>       at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1121)
>       at 
> org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:154)
>       at 
> org.apache.hive.service.cli.operation.SQLOperation.access$100(SQLOperation.java:71)
>       at 
> org.apache.hive.service.cli.operation.SQLOperation$1$1.run(SQLOperation.java:206)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:422)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
>       at 
> org.apache.hive.service.cli.operation.SQLOperation$1.run(SQLOperation.java:218)
>       at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>       at java.lang.Thread.run(Thread.java:745)
> {code}
> Tez YarnClient has received an applicationID from RM. On Restarting ATS now, 
> ATS tries to get the application report from RM and so RM will throw 
> ApplicationNotFoundException. ATS will keep on requesting and which floods RM.
> {code}
> RM logs:
> 2016-11-23 13:53:57,345 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Allocated new 
> applicationId: 5
> 2016-11-23 14:05:04,936 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
> 9 on 8050, call 
> org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport 
> from 172.26.71.120:37699 Call#26 Retry#0
> org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
> with id 'application_1479897867169_0005' doesn't exist in RM.
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:328)
>       at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:175)
>       at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:417)
>       at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
>       at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2206)
>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2202)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:422)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
>       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2200)
> {code}
> There is a stale application entry inside /ats/active directory. ATS stops 
> requesting when we remove this directory.
> [hive@kerberos-2 bin]$ hadoop fs -ls /ats/active
> drwxrwx---   - hive hadoop          0 2016-11-23 13:54 
> /ats/active/application_1479897867169_0005
> This issue with ATS is exposed by Tez job as Tez uses putDomain method. On 
> calling TimelineClientImpl#putDomain() -> writeDomain() -> getAppAttemptDir() 
> -> createApplicationDir() which creates a application directory inside ATS 
> activePath. After Tez job created this, it fails as unable to connect to ATS. 
> Now when ATS comes back, it scans activePath for every 60 seconds 
> (yarn.timeline-service.entity-group-fs-store.scan-interval-seconds) and calls 
> GetApplicationReport which leads to ApplicationNotFoundException in RM. 
> For this negative case - we can delete the appDirectory inside activePath 
> from ATS EntityGroupFSTimelineStore#getAppState() once the RM throws 
> ApplicationNotFoundException.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to