[jira] [Commented] (TEZ-2724) Tez Client keeps on showing old status when application is finished but RM is shutdown
[ https://issues.apache.org/jira/browse/TEZ-2724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14935177#comment-14935177 ] Jeff Zhang commented on TEZ-2724: - [~hitesh] Please help review > Tez Client keeps on showing old status when application is finished but RM is > shutdown > -- > > Key: TEZ-2724 > URL: https://issues.apache.org/jira/browse/TEZ-2724 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.5.4 >Reporter: Jeff Zhang >Assignee: Jeff Zhang > Attachments: TEZ-2724-1.patch, TEZ-2724-2.patch, > amrecovery_mutlipleamrestart.txt > > > From the logs, it seems the ipc retry interval is set as 20 seconds and ipc > max retries is 45. This means that the client will retry the RPC connection > for total 900 (20*45) seconds. And in this period, the application may > already complete and RM Restarting may be triggered as said in the jira > description. And I think the RM recovery is not enabled, so even the new RM > is restarted, the original application info is lost, that means the client > can never get the correct application report which makes it showing the old > status forever. > {code} > 15/05/07 19:13:43 INFO ipc.Client: Retrying connect to server: > maint22-tez12/100.79.80.19:52822. Already tried 26 time(s); maxRetries=45 > Deleted /user/hadoopqa/Input1 > RUNNING: call D:\hdp\hadoop-2.6.0.2.2.6.0-2782\bin\hdfs.cmd dfs -ls > /user/hadoopqa/Input2 > RUNNING: call D:\hdp\hadoop-2.6.0.2.2.6.0-2782\bin\hdfs.cmd dfs -rm -r > -skipTrash /user/hadoopqa/Input2 > 15/05/07 19:14:03 INFO ipc.Client: Retrying connect to server: > maint22-tez12/100.79.80.19:52822. Already tried 27 time(s); maxRetries=45 > {code} > Configuration to reproduce this issue > * disable generic application history > (yarn.timeline-service.generic-application-history.enabled) > * disable rm recovery (yarn.resourcemanager.recovery.enabled) > * increase the ipc retry interval and max retry > (ipc.client.connect.retry.interval & ipc.client.connect.max.retries) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2724) Tez Client keeps on showing old status when application is finished but RM is shutdown
[ https://issues.apache.org/jira/browse/TEZ-2724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14740183#comment-14740183 ] Jeff Zhang commented on TEZ-2724: - Upload another patch. [~hitesh] [~pramachandran] Please help review. There's 2 scenarios : * App is finished and RM is restarted without recovery enabled. Client keep using cachedDagStatus. ** Solution: Throw ApplicationNotFoundException * App is finished and RM is shut down (no restart) ** Solution: Set a timeout threshold for the cachedDagStatus is used. > Tez Client keeps on showing old status when application is finished but RM is > shutdown > -- > > Key: TEZ-2724 > URL: https://issues.apache.org/jira/browse/TEZ-2724 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.5.4 >Reporter: Jeff Zhang >Assignee: Jeff Zhang > Attachments: TEZ-2724-1.patch, TEZ-2724-2.patch, > amrecovery_mutlipleamrestart.txt > > > From the logs, it seems the ipc retry interval is set as 20 seconds and ipc > max retries is 45. This means that the client will retry the RPC connection > for total 900 (20*45) seconds. And in this period, the application may > already complete and RM Restarting may be triggered as said in the jira > description. And I think the RM recovery is not enabled, so even the new RM > is restarted, the original application info is lost, that means the client > can never get the correct application report which makes it showing the old > status forever. > {code} > 15/05/07 19:13:43 INFO ipc.Client: Retrying connect to server: > maint22-tez12/100.79.80.19:52822. Already tried 26 time(s); maxRetries=45 > Deleted /user/hadoopqa/Input1 > RUNNING: call D:\hdp\hadoop-2.6.0.2.2.6.0-2782\bin\hdfs.cmd dfs -ls > /user/hadoopqa/Input2 > RUNNING: call D:\hdp\hadoop-2.6.0.2.2.6.0-2782\bin\hdfs.cmd dfs -rm -r > -skipTrash /user/hadoopqa/Input2 > 15/05/07 19:14:03 INFO ipc.Client: Retrying connect to server: > maint22-tez12/100.79.80.19:52822. Already tried 27 time(s); maxRetries=45 > {code} > Configuration to reproduce this issue > * disable generic application history > (yarn.timeline-service.generic-application-history.enabled) > * disable rm recovery (yarn.resourcemanager.recovery.enabled) > * increase the ipc retry interval and max retry > (ipc.client.connect.retry.interval & ipc.client.connect.max.retries) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2724) Tez Client keeps on showing old status when application is finished but RM is shutdown
[ https://issues.apache.org/jira/browse/TEZ-2724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14740168#comment-14740168 ] TezQA commented on TEZ-2724: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12755294/TEZ-2724-2.patch against master revision b288be7. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 3.0.1) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in : org.apache.tez.test.TestFaultTolerance Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/1110//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/1110//console This message is automatically generated. > Tez Client keeps on showing old status when application is finished but RM is > shutdown > -- > > Key: TEZ-2724 > URL: https://issues.apache.org/jira/browse/TEZ-2724 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.5.4 >Reporter: Jeff Zhang >Assignee: Jeff Zhang > Attachments: TEZ-2724-1.patch, TEZ-2724-2.patch, > amrecovery_mutlipleamrestart.txt > > > From the logs, it seems the ipc retry interval is set as 20 seconds and ipc > max retries is 45. This means that the client will retry the RPC connection > for total 900 (20*45) seconds. And in this period, the application may > already complete and RM Restarting may be triggered as said in the jira > description. And I think the RM recovery is not enabled, so even the new RM > is restarted, the original application info is lost, that means the client > can never get the correct application report which makes it showing the old > status forever. > {code} > 15/05/07 19:13:43 INFO ipc.Client: Retrying connect to server: > maint22-tez12/100.79.80.19:52822. Already tried 26 time(s); maxRetries=45 > Deleted /user/hadoopqa/Input1 > RUNNING: call D:\hdp\hadoop-2.6.0.2.2.6.0-2782\bin\hdfs.cmd dfs -ls > /user/hadoopqa/Input2 > RUNNING: call D:\hdp\hadoop-2.6.0.2.2.6.0-2782\bin\hdfs.cmd dfs -rm -r > -skipTrash /user/hadoopqa/Input2 > 15/05/07 19:14:03 INFO ipc.Client: Retrying connect to server: > maint22-tez12/100.79.80.19:52822. Already tried 27 time(s); maxRetries=45 > {code} > Configuration to reproduce this issue > * disable generic application history > (yarn.timeline-service.generic-application-history.enabled) > * disable rm recovery (yarn.resourcemanager.recovery.enabled) > * increase the ipc retry interval and max retry > (ipc.client.connect.retry.interval & ipc.client.connect.max.retries) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2724) Tez Client keeps on showing old status when application is finished but RM is shutdown
[ https://issues.apache.org/jira/browse/TEZ-2724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736453#comment-14736453 ] Jeff Zhang commented on TEZ-2724: - Steps to reproduce this issue: * configuration requirements: ** yarn.timeline-service.generic-application-history.enabled=false ** yarn.resourcemanager.recovery.enabled=false ** ipc.client.connect.retry.interval=5000 ** ipc.client.connect.max.retries=12 * Run command: "hadoop jar tez-tests/target/tez-tests-0.8.1-SNAPSHOT.jar mrrsleep -m 5 -r 5 -mt 2 -rt 1" * Kill the AM in the middle of job running * Check the RM UI to wait for the yarn app finished, then restart RM > Tez Client keeps on showing old status when application is finished but RM is > shutdown > -- > > Key: TEZ-2724 > URL: https://issues.apache.org/jira/browse/TEZ-2724 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.5.4 >Reporter: Jeff Zhang >Assignee: Jeff Zhang > Attachments: TEZ-2724-1.patch, amrecovery_mutlipleamrestart.txt > > > From the logs, it seems the ipc retry interval is set as 20 seconds and ipc > max retries is 45. This means that the client will retry the RPC connection > for total 900 (20*45) seconds. And in this period, the application may > already complete and RM Restarting may be triggered as said in the jira > description. And I think the RM recovery is not enabled, so even the new RM > is restarted, the original application info is lost, that means the client > can never get the correct application report which makes it showing the old > status forever. > {code} > 15/05/07 19:13:43 INFO ipc.Client: Retrying connect to server: > maint22-tez12/100.79.80.19:52822. Already tried 26 time(s); maxRetries=45 > Deleted /user/hadoopqa/Input1 > RUNNING: call D:\hdp\hadoop-2.6.0.2.2.6.0-2782\bin\hdfs.cmd dfs -ls > /user/hadoopqa/Input2 > RUNNING: call D:\hdp\hadoop-2.6.0.2.2.6.0-2782\bin\hdfs.cmd dfs -rm -r > -skipTrash /user/hadoopqa/Input2 > 15/05/07 19:14:03 INFO ipc.Client: Retrying connect to server: > maint22-tez12/100.79.80.19:52822. Already tried 27 time(s); maxRetries=45 > {code} > Configuration to reproduce this issue > * disable generic application history > (yarn.timeline-service.generic-application-history.enabled) > * disable rm recovery (yarn.resourcemanager.recovery.enabled) > * increase the ipc retry interval and max retry > (ipc.client.connect.retry.interval & ipc.client.connect.max.retries) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2724) Tez Client keeps on showing old status when application is finished but RM is shutdown
[ https://issues.apache.org/jira/browse/TEZ-2724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14700228#comment-14700228 ] Hitesh Shah commented on TEZ-2724: -- I think this is an edge case where RM HA is not enabled or if RM recovery is not enabled. I think the switch to using TimelineClient should only happen in the following condition: RM either says app finished or throws an AppNotFound exception. If the RM is down, we should just wait or throw an error if it is being done today. Switching to the TimelineClient while the RM is down is probably going to be problematic as it will not switch back to the AM after the RM comes back up ( if recovery is enabled ). > Tez Client keeps on showing old status when application is finished but RM is > shutdown > -- > > Key: TEZ-2724 > URL: https://issues.apache.org/jira/browse/TEZ-2724 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.5.4 >Reporter: Jeff Zhang >Assignee: Jeff Zhang > Attachments: TEZ-2724-1.patch, amrecovery_mutlipleamrestart.txt > > > From the logs, it seems the ipc retry interval is set as 20 seconds and ipc > max retries is 45. This means that the client will retry the RPC connection > for total 900 (20*45) seconds. And in this period, the application may > already complete and RM Restarting may be triggered as said in the jira > description. And I think the RM recovery is not enabled, so even the new RM > is restarted, the original application info is lost, that means the client > can never get the correct application report which makes it showing the old > status forever. > {code} > 15/05/07 19:13:43 INFO ipc.Client: Retrying connect to server: > maint22-tez12/100.79.80.19:52822. Already tried 26 time(s); maxRetries=45 > Deleted /user/hadoopqa/Input1 > RUNNING: call D:\hdp\hadoop-2.6.0.2.2.6.0-2782\bin\hdfs.cmd dfs -ls > /user/hadoopqa/Input2 > RUNNING: call D:\hdp\hadoop-2.6.0.2.2.6.0-2782\bin\hdfs.cmd dfs -rm -r > -skipTrash /user/hadoopqa/Input2 > 15/05/07 19:14:03 INFO ipc.Client: Retrying connect to server: > maint22-tez12/100.79.80.19:52822. Already tried 27 time(s); maxRetries=45 > {code} > Configuration to reproduce this issue > * disable generic application history > (yarn.timeline-service.generic-application-history.enabled) > * disable rm recovery (yarn.resourcemanager.recovery.enabled) > * increase the ipc retry interval and max retry > (ipc.client.connect.retry.interval & ipc.client.connect.max.retries) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2724) Tez Client keeps on showing old status when application is finished but RM is shutdown
[ https://issues.apache.org/jira/browse/TEZ-2724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14699873#comment-14699873 ] Prakash Ramachandran commented on TEZ-2724: --- * if realClient.getApplicationReportInternal returns null (say temp n/w issue) and we switch to ats client , should we switch back to getting status via am once the appreport is available and app has not completed? * minor - switchToTimelineClient debug log can be changed. > Tez Client keeps on showing old status when application is finished but RM is > shutdown > -- > > Key: TEZ-2724 > URL: https://issues.apache.org/jira/browse/TEZ-2724 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.5.4 >Reporter: Jeff Zhang >Assignee: Jeff Zhang > Attachments: TEZ-2724-1.patch, amrecovery_mutlipleamrestart.txt > > > From the logs, it seems the ipc retry interval is set as 20 seconds and ipc > max retries is 45. This means that the client will retry the RPC connection > for total 900 (20*45) seconds. And in this period, the application may > already complete and RM Restarting may be triggered as said in the jira > description. And I think the RM recovery is not enabled, so even the new RM > is restarted, the original application info is lost, that means the client > can never get the correct application report which makes it showing the old > status forever. > {code} > 15/05/07 19:13:43 INFO ipc.Client: Retrying connect to server: > maint22-tez12/100.79.80.19:52822. Already tried 26 time(s); maxRetries=45 > Deleted /user/hadoopqa/Input1 > RUNNING: call D:\hdp\hadoop-2.6.0.2.2.6.0-2782\bin\hdfs.cmd dfs -ls > /user/hadoopqa/Input2 > RUNNING: call D:\hdp\hadoop-2.6.0.2.2.6.0-2782\bin\hdfs.cmd dfs -rm -r > -skipTrash /user/hadoopqa/Input2 > 15/05/07 19:14:03 INFO ipc.Client: Retrying connect to server: > maint22-tez12/100.79.80.19:52822. Already tried 27 time(s); maxRetries=45 > {code} > Configuration to reproduce this issue > * disable generic application history > (yarn.timeline-service.generic-application-history.enabled) > * disable rm recovery (yarn.resourcemanager.recovery.enabled) > * increase the ipc retry interval and max retry > (ipc.client.connect.retry.interval & ipc.client.connect.max.retries) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2724) Tez Client keeps on showing old status when application is finished but RM is shutdown
[ https://issues.apache.org/jira/browse/TEZ-2724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14699116#comment-14699116 ] Jeff Zhang commented on TEZ-2724: - Upload patch to fix it. Verified it manually. [~pramachandran] [~hitesh] Please help review. > Tez Client keeps on showing old status when application is finished but RM is > shutdown > -- > > Key: TEZ-2724 > URL: https://issues.apache.org/jira/browse/TEZ-2724 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.5.4 >Reporter: Jeff Zhang >Assignee: Jeff Zhang > Attachments: TEZ-2724-1.patch, amrecovery_mutlipleamrestart.txt > > > From the logs, it seems the ipc retry interval is set as 20 seconds and ipc > max retries is 45. This means that the client will retry the RPC connection > for total 900 (20*45) seconds. And in this period, the application may > already complete and RM Restarting may be triggered as said in the jira > description. And I think the RM recovery is not enabled, so even the new RM > is restarted, the original application info is lost, that means the client > can never get the correct application report which makes it showing the old > status forever. > {code} > 15/05/07 19:13:43 INFO ipc.Client: Retrying connect to server: > maint22-tez12/100.79.80.19:52822. Already tried 26 time(s); maxRetries=45 > Deleted /user/hadoopqa/Input1 > RUNNING: call D:\hdp\hadoop-2.6.0.2.2.6.0-2782\bin\hdfs.cmd dfs -ls > /user/hadoopqa/Input2 > RUNNING: call D:\hdp\hadoop-2.6.0.2.2.6.0-2782\bin\hdfs.cmd dfs -rm -r > -skipTrash /user/hadoopqa/Input2 > 15/05/07 19:14:03 INFO ipc.Client: Retrying connect to server: > maint22-tez12/100.79.80.19:52822. Already tried 27 time(s); maxRetries=45 > {code} > Configuration to reproduce this issue > * disable generic application history > (yarn.timeline-service.generic-application-history.enabled) > * disable rm recovery (yarn.resourcemanager.recovery.enabled) > * increase the ipc retry interval and max retry > (ipc.client.connect.retry.interval & ipc.client.connect.max.retries) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2724) Tez Client keeps on showing old status when application is finished but RM is shutdown
[ https://issues.apache.org/jira/browse/TEZ-2724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14699107#comment-14699107 ] TezQA commented on TEZ-2724: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12750751/TEZ-2724-1.patch against master revision 6cb8206. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 3.0.1) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/998//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/998//console This message is automatically generated. > Tez Client keeps on showing old status when application is finished but RM is > shutdown > -- > > Key: TEZ-2724 > URL: https://issues.apache.org/jira/browse/TEZ-2724 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.5.4 >Reporter: Jeff Zhang >Assignee: Jeff Zhang > Attachments: TEZ-2724-1.patch, amrecovery_mutlipleamrestart.txt > > > From the logs, it seems the ipc retry interval is set as 20 seconds and ipc > max retries is 45. This means that the client will retry the RPC connection > for total 900 (20*45) seconds. And in this period, the application may > already complete and RM Restarting may be triggered as said in the jira > description. And I think the RM recovery is not enabled, so even the new RM > is restarted, the original application info is lost, that means the client > can never get the correct application report which makes it showing the old > status forever. > {code} > 15/05/07 19:13:43 INFO ipc.Client: Retrying connect to server: > maint22-tez12/100.79.80.19:52822. Already tried 26 time(s); maxRetries=45 > Deleted /user/hadoopqa/Input1 > RUNNING: call D:\hdp\hadoop-2.6.0.2.2.6.0-2782\bin\hdfs.cmd dfs -ls > /user/hadoopqa/Input2 > RUNNING: call D:\hdp\hadoop-2.6.0.2.2.6.0-2782\bin\hdfs.cmd dfs -rm -r > -skipTrash /user/hadoopqa/Input2 > 15/05/07 19:14:03 INFO ipc.Client: Retrying connect to server: > maint22-tez12/100.79.80.19:52822. Already tried 27 time(s); maxRetries=45 > {code} > Configuration to reproduce this issue > * disable generic application history > (yarn.timeline-service.generic-application-history.enabled) > * disable rm recovery (yarn.resourcemanager.recovery.enabled) > * increase the ipc retry interval and max retry > (ipc.client.connect.retry.interval & ipc.client.connect.max.retries) -- This message was sent by Atlassian JIRA (v6.3.4#6332)