Re: [PR] [SPARK-45873][CORE][YARN][K8S] Make ExecutorFailureTracker more tolerant when app remains sufficient resources [spark]

via GitHub Wed, 15 Nov 2023 08:18:56 -0800


tgravescs commented on PR #43746:
URL: https://github.com/apache/spark/pull/43746#issuecomment-1812838085


   > > Preemption on yarn shouldn't be going against the number of failed 
executors. If it is then something has changed and we should fix that.
   
   > Yes, you are right
   
   What do you mean by this, are you saying the Spark on YARN handling of 
preempted containers is not working properly?     Meaning if the container is 
preempted it should not show up as an executor failure.  Are you seeing those 
preempted containers show up as failed?
    Or are you saying that yes Spark on YARN doesn't mark preempted as failed? 
   
   > What does 'this feature' point to?
   
   Sorry I misunderstood your environment here, I thought you were running on 
k8s but it looks like you running on YARN.  by feature I mean the 
spark.yarn.max.executor.failures/spark.executor.maxNumFailures config and its 
functionality.
   
   So unless yarn preemption handling is broken (please answer question above), 
 you gave one very specific use case where user added a bad JAR, in that use 
case it seems like you just don't want spark.executor.maxNumFailures enabled at 
all.  You said you don't want the app to fail so admins can come fix things up 
and not have it affect other users.  If that is the case then Spark should 
allow users to turn spark.executor.maxNumFailures off or I assume you could do 
the same thing by setting it to int.maxvalue.  
   
   As implemented this seems very arbitrary and I would think hard for a normal 
user to set and use this feature.  You have it as a ratio, which normally I'm 
in favor of but really only works if you have max executors set so it is really 
just a hardcoded number. That number seems arbitrary as its just depends on if 
you get lucky and happen to have that before some users pushes a bad jar.   I 
don't understand why this isn't the same as minimum number of executors as that 
seems more in line  - saying you need some minimum number for this application 
to run and by the way its ok to keep running with this is launching new 
executors is failing. 
   
   If there is some other issues with Spark Connect and add jars maybe that is 
a different conversation about isolation 
(https://issues.apache.org/jira/browse/SPARK-44146).  Or maybe it needs to 
better prevent users from adding jars with the same name.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-45873][CORE][YARN][K8S] Make ExecutorFailureTracker more tolerant when app remains sufficient resources [spark]

Reply via email to