[
https://issues.apache.org/jira/browse/YARN-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15782566#comment-15782566
]
Ying Zhang commented on YARN-6031:
----------------------------------
Uploaded a patch, which is based on [~leftnoteasy]'s comment on YARN-4465:
swallow the InvalidResourceRequest exception when recovering, only fail the
recovery for this application and print a error message, then let the rest of
the recovery continue.
[~sunilg], your suggestion also makes sense to me. Actually, the code change
using your approach would be made at the same place as in this patch with small
modification: in function recover(), inside the for loop, if the conditions
are met, skip calling "recoverApplication" and log a message like "skip recover
application ..." instead. Difference is that using this approach we'll always
check for these conditions even though it might not be a normal case, while
using the approach in the patch, we just need to react when the exception
happens. I'm ok with each approach since the overhead is not that big.
Let's see what others think:-) [~leftnoteasy], [~bibinchundatt]
Just want to clarify. The current fact is (with or without this fix):
application submitted with node label expression explicitly specified will fail
during recovery, while application submitted without node label expression
specified will succeed, no matter whether or not there is default node label
expression for the target queue. This is due to the following code snippet, the
calling for "checkQueueLabelInLabelManager" which will check if node label
exists in node label manager(node label manager has no label at all if Node
Label being disabled) has been skipped for recovery:
{code:title=SchedulerUtils.java|borderStyle=solid}
public static void normalizeAndValidateRequest(ResourceRequest resReq,
Resource maximumResource, String queueName, YarnScheduler scheduler,
boolean isRecovery, RMContext rmContext, QueueInfo queueInfo)
throws InvalidResourceRequestException {
... ...
SchedulerUtils.normalizeNodeLabelExpressionInRequest(resReq, queueInfo);
if (!isRecovery) {
validateResourceRequest(resReq, maximumResource, queueInfo, rmContext);
// calling checkQueueLabelInLabelManager
}
{code}
> Application recovery failed after disabling node label
> ------------------------------------------------------
>
> Key: YARN-6031
> URL: https://issues.apache.org/jira/browse/YARN-6031
> Project: Hadoop YARN
> Issue Type: Bug
> Components: scheduler
> Affects Versions: 2.8.0
> Reporter: Ying Zhang
> Assignee: Ying Zhang
> Priority: Minor
> Attachments: YARN-6031.001.patch
>
>
> Here is the repro steps:
> Enable node label, restart RM, configure CS properly, and run some jobs;
> Disable node label, restart RM, and the following exception thrown:
> {noformat}
> Caused by:
> org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException:
> Invalid resource request, node label not enabled but request contains label
> expression
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:248)
> at
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:394)
> at
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:339)
> at
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:319)
> at
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:436)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1165)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574)
> at
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> ... 10 more
> {noformat}
> During RM restart, application recovery failed due to that application had
> node label expression specified while node label has been disabled.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]