Ngone51 commented on pull request #29332:
URL: https://github.com/apache/spark/pull/29332#issuecomment-669981103


   I think that's the problem. Think about the case where a barrier stage 
requires 2 CPUs and 2 GPUs but the cluster only has 2 CPUs and 1 GPU. In 3.0, 
since the limiting resource is cores/CPUs, the barrier stage would have a 
chance to launch tasks. However, it could only launch one task because the real 
limiting resources should be GPU. In this case, the barrier stage will fail 
because of partial task launch. But the error message is quite confusing for 
users as it suggests to disable delay scheduling, while the real cause should 
be insufficient (custom) resources. If we backport this fix to 3.0, the barrier 
stage should fail early before it was able to launch the task.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to