[ https://issues.apache.org/jira/browse/TEZ-2724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14699873#comment-14699873 ]
Prakash Ramachandran commented on TEZ-2724: ------------------------------------------- * if realClient.getApplicationReportInternal returns null (say temp n/w issue) and we switch to ats client , should we switch back to getting status via am once the appreport is available and app has not completed? * minor - switchToTimelineClient debug log can be changed. > Tez Client keeps on showing old status when application is finished but RM is > shutdown > -------------------------------------------------------------------------------------- > > Key: TEZ-2724 > URL: https://issues.apache.org/jira/browse/TEZ-2724 > Project: Apache Tez > Issue Type: Bug > Affects Versions: 0.5.4 > Reporter: Jeff Zhang > Assignee: Jeff Zhang > Attachments: TEZ-2724-1.patch, amrecovery_mutlipleamrestart.txt > > > From the logs, it seems the ipc retry interval is set as 20 seconds and ipc > max retries is 45. This means that the client will retry the RPC connection > for total 900 (20*45) seconds. And in this period, the application may > already complete and RM Restarting may be triggered as said in the jira > description. And I think the RM recovery is not enabled, so even the new RM > is restarted, the original application info is lost, that means the client > can never get the correct application report which makes it showing the old > status forever. > {code} > 15/05/07 19:13:43 INFO ipc.Client: Retrying connect to server: > maint22-tez12/100.79.80.19:52822. Already tried 26 time(s); maxRetries=45 > Deleted /user/hadoopqa/Input1 > RUNNING: call D:\hdp\hadoop-2.6.0.2.2.6.0-2782\bin\hdfs.cmd dfs -ls > /user/hadoopqa/Input2 > RUNNING: call D:\hdp\hadoop-2.6.0.2.2.6.0-2782\bin\hdfs.cmd dfs -rm -r > -skipTrash /user/hadoopqa/Input2 > 15/05/07 19:14:03 INFO ipc.Client: Retrying connect to server: > maint22-tez12/100.79.80.19:52822. Already tried 27 time(s); maxRetries=45 > {code} > Configuration to reproduce this issue > * disable generic application history > (yarn.timeline-service.generic-application-history.enabled) > * disable rm recovery (yarn.resourcemanager.recovery.enabled) > * increase the ipc retry interval and max retry > (ipc.client.connect.retry.interval & ipc.client.connect.max.retries) -- This message was sent by Atlassian JIRA (v6.3.4#6332)