[jira] [Comment Edited] (YARN-6031) Application recovery has failed when node label feature is turned off during RM recovery

2017-07-24 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16099180#comment-16099180
 ] 

Jian He edited comment on YARN-6031 at 7/24/17 10:15 PM:
-

Ran into this patch when debugging same issue, got few questions:
cc [~sunilg], [~Ying Zhang] 
1. Below code catches InvalidLabelResourceRequestException and assumes that the 
error is because node-label becomes disabled, but the same 
InvalidLabelResourceRequestException can be thrown for other reasons too, right 
? in that case, the following logic becomes invalid. 

{code}
  amReqs = validateAndCreateResourceRequest(submissionContext, isRecovery);
} catch (InvalidLabelResourceRequestException e) {
  // This can happen if the application had been submitted and run
  // with Node Label enabled but recover with Node Label disabled.
  // Thus there might be node label expression in the application's
  // resource requests. If this is the case, create RmAppImpl with
  // null amReq and reject the application later with clear error
  // message. So that the application can still be tracked by RM
  // after recovery and user can see what's going on and react accordingly.
  if (isRecovery &&
  !YarnConfiguration.areNodeLabelsEnabled(this.conf)) {
if (LOG.isDebugEnabled()) {
  LOG.debug("AMResourceRequest is not created for " + applicationId
  + ". NodeLabel is not enabled in cluster, but AM resource "
  + "request contains a label expression.");
}
  } else {
throw e;
  }
{code}

2. Below code directly transitions app to failed by using a Rejected event.  
The attempt state is not moved to failed, it'll be stuck there ? I think we 
need to send KILL event instead of REJECT event
{code}
  if (labelExp != null &&
  !labelExp.equals(RMNodeLabelsManager.NO_LABEL)) {
String message = "Failed to recover application " + appId
+ ". NodeLabel is not enabled in cluster, but AM resource request "
+ "contains a label expression.";
LOG.warn(message);
application.handle(
new RMAppEvent(appId, RMAppEventType.APP_REJECTED, message));
return;
  }
{code}

3. Is it ok to let the app continue in this scenario, it's less disruptive to 
the apps. What's the disadvantage if we let app continue ?


was (Author: jianhe):
Ran into this patch when debugging, got few questions:
cc [~sunilg], [~Ying Zhang] 
1. Below code catches InvalidLabelResourceRequestException and assumes that the 
error is because node-label becomes disabled, but the same 
InvalidLabelResourceRequestException can be thrown for other reasons too, right 
? in that case, the following logic becomes invalid. 

{code}
  amReqs = validateAndCreateResourceRequest(submissionContext, isRecovery);
} catch (InvalidLabelResourceRequestException e) {
  // This can happen if the application had been submitted and run
  // with Node Label enabled but recover with Node Label disabled.
  // Thus there might be node label expression in the application's
  // resource requests. If this is the case, create RmAppImpl with
  // null amReq and reject the application later with clear error
  // message. So that the application can still be tracked by RM
  // after recovery and user can see what's going on and react accordingly.
  if (isRecovery &&
  !YarnConfiguration.areNodeLabelsEnabled(this.conf)) {
if (LOG.isDebugEnabled()) {
  LOG.debug("AMResourceRequest is not created for " + applicationId
  + ". NodeLabel is not enabled in cluster, but AM resource "
  + "request contains a label expression.");
}
  } else {
throw e;
  }
{code}

2. Below code directly transitions app to failed by using a Rejected event.  
The attempt state is not moved to failed, it'll be stuck there ? I think we 
need to send KILL event instead of REJECT event
{code}
  if (labelExp != null &&
  !labelExp.equals(RMNodeLabelsManager.NO_LABEL)) {
String message = "Failed to recover application " + appId
+ ". NodeLabel is not enabled in cluster, but AM resource request "
+ "contains a label expression.";
LOG.warn(message);
application.handle(
new RMAppEvent(appId, RMAppEventType.APP_REJECTED, message));
return;
  }
{code}

3. Is it ok to let the app continue in this scenario, it's less disruptive to 
the apps. What's the disadvantage if we let app continue ?

> Application recovery has failed when node label feature is turned off during 
> RM recovery
> 
>
> Key: YARN-6031
> URL: https://issues.apache.org/jira/browse/YARN-6031
>   

[jira] [Comment Edited] (YARN-6031) Application recovery has failed when node label feature is turned off during RM recovery

2017-07-24 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16099180#comment-16099180
 ] 

Jian He edited comment on YARN-6031 at 7/24/17 10:08 PM:
-

Ran into this patch when debugging, got few questions:
cc [~sunilg], [~Ying Zhang] 
1. Below code catches InvalidLabelResourceRequestException and assumes that the 
error is because node-label becomes disabled, but the same 
InvalidLabelResourceRequestException can be thrown for other reasons too, right 
? in that case, the following logic becomes invalid. 

{code}
  amReqs = validateAndCreateResourceRequest(submissionContext, isRecovery);
} catch (InvalidLabelResourceRequestException e) {
  // This can happen if the application had been submitted and run
  // with Node Label enabled but recover with Node Label disabled.
  // Thus there might be node label expression in the application's
  // resource requests. If this is the case, create RmAppImpl with
  // null amReq and reject the application later with clear error
  // message. So that the application can still be tracked by RM
  // after recovery and user can see what's going on and react accordingly.
  if (isRecovery &&
  !YarnConfiguration.areNodeLabelsEnabled(this.conf)) {
if (LOG.isDebugEnabled()) {
  LOG.debug("AMResourceRequest is not created for " + applicationId
  + ". NodeLabel is not enabled in cluster, but AM resource "
  + "request contains a label expression.");
}
  } else {
throw e;
  }
{code}

2. Below code directly transitions app to failed by using a Rejected event.  
The attempt state is not moved to failed, it'll be stuck there ? I think we 
need to send KILL event instead of REJECT event
{code}
  if (labelExp != null &&
  !labelExp.equals(RMNodeLabelsManager.NO_LABEL)) {
String message = "Failed to recover application " + appId
+ ". NodeLabel is not enabled in cluster, but AM resource request "
+ "contains a label expression.";
LOG.warn(message);
application.handle(
new RMAppEvent(appId, RMAppEventType.APP_REJECTED, message));
return;
  }
{code}

3. Is it ok to let the app continue in this scenario, it's less disruptive to 
the apps. What's the disadvantage if we let app continue ?


was (Author: jianhe):
Ran into this patch when debugging, got few questions:
cc [~sunilg], [~Ying Zhang] 
1. Below code catches InvalidLabelResourceRequestException and assumes that the 
error is because node-label becomes disabled, but the same 
InvalidLabelResourceRequestException can be thrown for other reasons too, right 
? in that case, the following logic becomes invalid. 

{code}
  amReqs = validateAndCreateResourceRequest(submissionContext, isRecovery);
} catch (InvalidLabelResourceRequestException e) {
  // This can happen if the application had been submitted and run
  // with Node Label enabled but recover with Node Label disabled.
  // Thus there might be node label expression in the application's
  // resource requests. If this is the case, create RmAppImpl with
  // null amReq and reject the application later with clear error
  // message. So that the application can still be tracked by RM
  // after recovery and user can see what's going on and react accordingly.
  if (isRecovery &&
  !YarnConfiguration.areNodeLabelsEnabled(this.conf)) {
if (LOG.isDebugEnabled()) {
  LOG.debug("AMResourceRequest is not created for " + applicationId
  + ". NodeLabel is not enabled in cluster, but AM resource "
  + "request contains a label expression.");
}
  } else {
throw e;
  }
{code}

2. Below code directly transitions app to failed by using a Rejected event.  
The attempt state is not moved to failed, it'll be stuck there ?
{code}
  if (labelExp != null &&
  !labelExp.equals(RMNodeLabelsManager.NO_LABEL)) {
String message = "Failed to recover application " + appId
+ ". NodeLabel is not enabled in cluster, but AM resource request "
+ "contains a label expression.";
LOG.warn(message);
application.handle(
new RMAppEvent(appId, RMAppEventType.APP_REJECTED, message));
return;
  }
{code}

3. Is it ok to let the app continue in this scenario, it's less disruptive to 
the apps. What's the disadvantage if we let app continue ?

> Application recovery has failed when node label feature is turned off during 
> RM recovery
> 
>
> Key: YARN-6031
> URL: https://issues.apache.org/jira/browse/YARN-6031
> Project: Hadoop YARN
>  Issue Type: Bug
>  Compon

[jira] [Comment Edited] (YARN-6031) Application recovery has failed when node label feature is turned off during RM recovery

2017-02-07 Thread Ying Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15857562#comment-15857562
 ] 

Ying Zhang edited comment on YARN-6031 at 2/8/17 7:18 AM:
--

I'm thinking it is a separate question. No matter we backport YARN-4805 or not, 
the test case itself can be improved to avoid running with FairScheduler:-) 


was (Author: ying zhang):
I'm thinking it is a separate question. No matter we backport YARN-4805 or not, 
the test case itself should be improved to avoid running with FairScheduler:-) 

> Application recovery has failed when node label feature is turned off during 
> RM recovery
> 
>
> Key: YARN-6031
> URL: https://issues.apache.org/jira/browse/YARN-6031
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: scheduler
>Affects Versions: 2.8.0
>Reporter: Ying Zhang
>Assignee: Ying Zhang
>Priority: Minor
> Attachments: YARN-6031.001.patch, YARN-6031.002.patch, 
> YARN-6031.003.patch, YARN-6031.004.patch, YARN-6031.005.patch, 
> YARN-6031.006.patch, YARN-6031.007.patch, YARN-6031-branch-2.8.001.patch
>
>
> Here is the repro steps:
> Enable node label, restart RM, configure CS properly, and run some jobs;
> Disable node label, restart RM, and the following exception thrown:
> {noformat}
> Caused by: 
> org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException: 
> Invalid resource request, node label not enabled but request contains label 
> expression
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:248)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:394)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:339)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:319)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:436)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1165)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574)
> at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> ... 10 more
> {noformat}
> During RM restart, application recovery failed due to that application had 
> node label expression specified while node label has been disabled.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org