[jira] [Comment Edited] (YARN-6031) Application recovery has failed when node label feature is turned off during RM recovery
[ https://issues.apache.org/jira/browse/YARN-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16099180#comment-16099180 ] Jian He edited comment on YARN-6031 at 7/24/17 10:15 PM: - Ran into this patch when debugging same issue, got few questions: cc [~sunilg], [~Ying Zhang] 1. Below code catches InvalidLabelResourceRequestException and assumes that the error is because node-label becomes disabled, but the same InvalidLabelResourceRequestException can be thrown for other reasons too, right ? in that case, the following logic becomes invalid. {code} amReqs = validateAndCreateResourceRequest(submissionContext, isRecovery); } catch (InvalidLabelResourceRequestException e) { // This can happen if the application had been submitted and run // with Node Label enabled but recover with Node Label disabled. // Thus there might be node label expression in the application's // resource requests. If this is the case, create RmAppImpl with // null amReq and reject the application later with clear error // message. So that the application can still be tracked by RM // after recovery and user can see what's going on and react accordingly. if (isRecovery && !YarnConfiguration.areNodeLabelsEnabled(this.conf)) { if (LOG.isDebugEnabled()) { LOG.debug("AMResourceRequest is not created for " + applicationId + ". NodeLabel is not enabled in cluster, but AM resource " + "request contains a label expression."); } } else { throw e; } {code} 2. Below code directly transitions app to failed by using a Rejected event. The attempt state is not moved to failed, it'll be stuck there ? I think we need to send KILL event instead of REJECT event {code} if (labelExp != null && !labelExp.equals(RMNodeLabelsManager.NO_LABEL)) { String message = "Failed to recover application " + appId + ". NodeLabel is not enabled in cluster, but AM resource request " + "contains a label expression."; LOG.warn(message); application.handle( new RMAppEvent(appId, RMAppEventType.APP_REJECTED, message)); return; } {code} 3. Is it ok to let the app continue in this scenario, it's less disruptive to the apps. What's the disadvantage if we let app continue ? was (Author: jianhe): Ran into this patch when debugging, got few questions: cc [~sunilg], [~Ying Zhang] 1. Below code catches InvalidLabelResourceRequestException and assumes that the error is because node-label becomes disabled, but the same InvalidLabelResourceRequestException can be thrown for other reasons too, right ? in that case, the following logic becomes invalid. {code} amReqs = validateAndCreateResourceRequest(submissionContext, isRecovery); } catch (InvalidLabelResourceRequestException e) { // This can happen if the application had been submitted and run // with Node Label enabled but recover with Node Label disabled. // Thus there might be node label expression in the application's // resource requests. If this is the case, create RmAppImpl with // null amReq and reject the application later with clear error // message. So that the application can still be tracked by RM // after recovery and user can see what's going on and react accordingly. if (isRecovery && !YarnConfiguration.areNodeLabelsEnabled(this.conf)) { if (LOG.isDebugEnabled()) { LOG.debug("AMResourceRequest is not created for " + applicationId + ". NodeLabel is not enabled in cluster, but AM resource " + "request contains a label expression."); } } else { throw e; } {code} 2. Below code directly transitions app to failed by using a Rejected event. The attempt state is not moved to failed, it'll be stuck there ? I think we need to send KILL event instead of REJECT event {code} if (labelExp != null && !labelExp.equals(RMNodeLabelsManager.NO_LABEL)) { String message = "Failed to recover application " + appId + ". NodeLabel is not enabled in cluster, but AM resource request " + "contains a label expression."; LOG.warn(message); application.handle( new RMAppEvent(appId, RMAppEventType.APP_REJECTED, message)); return; } {code} 3. Is it ok to let the app continue in this scenario, it's less disruptive to the apps. What's the disadvantage if we let app continue ? > Application recovery has failed when node label feature is turned off during > RM recovery > > > Key: YARN-6031 > URL: https://issues.apache.org/jira/browse/YARN-6031 >
[jira] [Comment Edited] (YARN-6031) Application recovery has failed when node label feature is turned off during RM recovery
[ https://issues.apache.org/jira/browse/YARN-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16099180#comment-16099180 ] Jian He edited comment on YARN-6031 at 7/24/17 10:08 PM: - Ran into this patch when debugging, got few questions: cc [~sunilg], [~Ying Zhang] 1. Below code catches InvalidLabelResourceRequestException and assumes that the error is because node-label becomes disabled, but the same InvalidLabelResourceRequestException can be thrown for other reasons too, right ? in that case, the following logic becomes invalid. {code} amReqs = validateAndCreateResourceRequest(submissionContext, isRecovery); } catch (InvalidLabelResourceRequestException e) { // This can happen if the application had been submitted and run // with Node Label enabled but recover with Node Label disabled. // Thus there might be node label expression in the application's // resource requests. If this is the case, create RmAppImpl with // null amReq and reject the application later with clear error // message. So that the application can still be tracked by RM // after recovery and user can see what's going on and react accordingly. if (isRecovery && !YarnConfiguration.areNodeLabelsEnabled(this.conf)) { if (LOG.isDebugEnabled()) { LOG.debug("AMResourceRequest is not created for " + applicationId + ". NodeLabel is not enabled in cluster, but AM resource " + "request contains a label expression."); } } else { throw e; } {code} 2. Below code directly transitions app to failed by using a Rejected event. The attempt state is not moved to failed, it'll be stuck there ? I think we need to send KILL event instead of REJECT event {code} if (labelExp != null && !labelExp.equals(RMNodeLabelsManager.NO_LABEL)) { String message = "Failed to recover application " + appId + ". NodeLabel is not enabled in cluster, but AM resource request " + "contains a label expression."; LOG.warn(message); application.handle( new RMAppEvent(appId, RMAppEventType.APP_REJECTED, message)); return; } {code} 3. Is it ok to let the app continue in this scenario, it's less disruptive to the apps. What's the disadvantage if we let app continue ? was (Author: jianhe): Ran into this patch when debugging, got few questions: cc [~sunilg], [~Ying Zhang] 1. Below code catches InvalidLabelResourceRequestException and assumes that the error is because node-label becomes disabled, but the same InvalidLabelResourceRequestException can be thrown for other reasons too, right ? in that case, the following logic becomes invalid. {code} amReqs = validateAndCreateResourceRequest(submissionContext, isRecovery); } catch (InvalidLabelResourceRequestException e) { // This can happen if the application had been submitted and run // with Node Label enabled but recover with Node Label disabled. // Thus there might be node label expression in the application's // resource requests. If this is the case, create RmAppImpl with // null amReq and reject the application later with clear error // message. So that the application can still be tracked by RM // after recovery and user can see what's going on and react accordingly. if (isRecovery && !YarnConfiguration.areNodeLabelsEnabled(this.conf)) { if (LOG.isDebugEnabled()) { LOG.debug("AMResourceRequest is not created for " + applicationId + ". NodeLabel is not enabled in cluster, but AM resource " + "request contains a label expression."); } } else { throw e; } {code} 2. Below code directly transitions app to failed by using a Rejected event. The attempt state is not moved to failed, it'll be stuck there ? {code} if (labelExp != null && !labelExp.equals(RMNodeLabelsManager.NO_LABEL)) { String message = "Failed to recover application " + appId + ". NodeLabel is not enabled in cluster, but AM resource request " + "contains a label expression."; LOG.warn(message); application.handle( new RMAppEvent(appId, RMAppEventType.APP_REJECTED, message)); return; } {code} 3. Is it ok to let the app continue in this scenario, it's less disruptive to the apps. What's the disadvantage if we let app continue ? > Application recovery has failed when node label feature is turned off during > RM recovery > > > Key: YARN-6031 > URL: https://issues.apache.org/jira/browse/YARN-6031 > Project: Hadoop YARN > Issue Type: Bug > Compon
[jira] [Comment Edited] (YARN-6031) Application recovery has failed when node label feature is turned off during RM recovery
[ https://issues.apache.org/jira/browse/YARN-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15857562#comment-15857562 ] Ying Zhang edited comment on YARN-6031 at 2/8/17 7:18 AM: -- I'm thinking it is a separate question. No matter we backport YARN-4805 or not, the test case itself can be improved to avoid running with FairScheduler:-) was (Author: ying zhang): I'm thinking it is a separate question. No matter we backport YARN-4805 or not, the test case itself should be improved to avoid running with FairScheduler:-) > Application recovery has failed when node label feature is turned off during > RM recovery > > > Key: YARN-6031 > URL: https://issues.apache.org/jira/browse/YARN-6031 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler >Affects Versions: 2.8.0 >Reporter: Ying Zhang >Assignee: Ying Zhang >Priority: Minor > Attachments: YARN-6031.001.patch, YARN-6031.002.patch, > YARN-6031.003.patch, YARN-6031.004.patch, YARN-6031.005.patch, > YARN-6031.006.patch, YARN-6031.007.patch, YARN-6031-branch-2.8.001.patch > > > Here is the repro steps: > Enable node label, restart RM, configure CS properly, and run some jobs; > Disable node label, restart RM, and the following exception thrown: > {noformat} > Caused by: > org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException: > Invalid resource request, node label not enabled but request contains label > expression > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225) > at > org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:248) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:394) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:339) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:319) > at > org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:436) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1165) > at > org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > ... 10 more > {noformat} > During RM restart, application recovery failed due to that application had > node label expression specified while node label has been disabled. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org