[jira] [Commented] (YARN-8248) Job hangs when a job requests a resource that its queue does not have

Haibo Chen (JIRA) Thu, 17 May 2018 13:33:26 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-8248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16479664#comment-16479664
 ]


Haibo Chen commented on YARN-8248:
----------------------------------

Thanks [~snemeth] for the update! I have a few follow-up comments.

1)  We can do a few renames: 
testAppRejectedToQueueZeroCapacityOfResourceVcores() =>  
testAppRejectedToQueue*With*ZeroCapacityOfVcores();  
testAppRejectedToQueueZeroCapacityOfResourceMemory() =>  
testAppRejectedToQueue*With*ZeroCapacityOfMemory();  
testAppRejectedToQueueZeroCapacityOfResource()  => 
testAppRejectedToQueue*With*ZeroCapacityOfResource();  
testSchedulingRejectedToQueueZeroCapacityOfMemory*() => 
testSchedulingRejectedToQueue*With*ZeroCapacityOfMemory*();  
testSchedulingRejectedToQueueZeroCapacityOfVcores*() =>  
testSchedulingRejectedToQueue*With*ZeroCapacityOfVcores*();  
testSchedulingRejectedToQueueZeroCapacityOfResource() => 
testSchedulingRejectedToQueue*With*ZeroCapacityOfResource()

2) How is  testSchedulingRejectedToQueueZeroCapacityOfMemory1() different from  
testSchedulingRejectedToQueueZeroCapacityOfMemory2()? They are calling the same 
function with the same parameters, hence identical as far as I can see. Am I 
missing something? If they are indeed identical, we just need to keep one. 
Similarly for  testSchedulingRejectedToQueueZeroCapacityOfVcores1 and  
testSchedulingRejectedToQueueZeroCapacityOfVcores2.

3) We are expecting SchedulerInvalidResoureRequestException  in  
testSchedulingRejectedToQueueZeroCapacityOfResource. But it does not fail 
currently if no exception. We need to add Assert.fail() after 
createSchedulingRequest().

4) In FairScheduler.allocate(), right now, it tries to be tolerant against 
invalid resource request in that it only throws an exception right before 
returning the Allocation instance.  Thinking more about this, this is 
problematic however because execution stops when the exception is thrown and 
the AM won't be able to get that allocation. All the tokens and 
promoted/demoted containers in that allocation will be lost, because the 
scheduler clears them all right away, but AM won't get the information. When AM 
tries next time, it won't see them because they have been cleared last time.    
 I think it's fine to  fail fast by throwing the exception right after 
validateAndFilterAsks() is called so that AM can retry safely without losing 
information.

5) The  validateAndFilterAsks() method is a bit too complicated. I think it 
suffices to do something along the lines of
{code:java}
// Make sure this application exists
FSAppAttempt application = getSchedulerApp(appAttemptId);
...

final Resource queueMaxShare = queue.getMaxShare();

for (ResourceRequest req: ask) {
   if (Resources.isAnyMajorResourceZero(DOMINANT_RESOURCE_CALCULATOR, 
queueMaxShare)
      && !Resources.fitsIn(amResourceRequest.getCapability(), queueMaxShare)) {
      throw new SchedulerInvalidResoureRequestException(String.format(
          "Resource request %s of application %s is invalid because it asks
           for a resource that the queue %s does not have ", req, appId, 
queue.getName()));
   }
}

...
handleContainerUpdates(application, updateRequests);

{code}
I believe the if check above is also shared in FairScheduler.addApplication(), 
probably we can extract that out as a function and call it in both places. 

> Job hangs when a job requests a resource that its queue does not have
> ---------------------------------------------------------------------
>
>                 Key: YARN-8248
>                 URL: https://issues.apache.org/jira/browse/YARN-8248
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: fairscheduler
>            Reporter: Szilard Nemeth
>            Assignee: Szilard Nemeth
>            Priority: Major
>         Attachments: YARN-8248-001.patch, YARN-8248-002.patch, 
> YARN-8248-003.patch, YARN-8248-004.patch, YARN-8248-005.patch, 
> YARN-8248-006.patch, YARN-8248-007.patch
>
>
> Job hangs when mapreduce.job.queuename is specified and the queue has 0 of 
> any resource (vcores / memory / other)
> In this scenario, the job should be immediately rejected upon submission 
> since the specified queue cannot serve the resource needs of the submitted 
> job.
>  
> Command to run:
> {code:java}
> bin/yarn jar 
> "./share/hadoop/mapreduce/hadoop-mapreduce-examples-$MY_HADOOP_VERSION.jar" 
> pi -Dmapreduce.job.queuename=sample_queue 1 1000;{code}
> fair-scheduler.xml queue config (excerpt):
>  
> {code:java}
>  <queue name="sample_queue">
>     <minResources>10000 mb,0vcores</minResources>
>     <maxResources>90000 mb,0vcores</maxResources>
>     <maxRunningApps>50</maxRunningApps>
>     <maxAMShare>-1.0f</maxAMShare>
>     <weight>2.0</weight>
>     <schedulingPolicy>fair</schedulingPolicy>
>   </queue>
> {code}
> Diagnostic message from the web UI: 
> {code:java}
> Wed May 02 06:35:57 -0700 2018] Application is added to the scheduler and is 
> not yet activated. (Resource request: <memory:1536, vCores:1> exceeds current 
> queue or its parents maximum resource allowed).{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8248) Job hangs when a job requests a resource that its queue does not have

Reply via email to