Github user ryan-williams commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9147#discussion_r42271853
  
    --- Diff: 
yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala ---
    @@ -62,10 +62,23 @@ private[spark] class ApplicationMaster(
         .asInstanceOf[YarnConfiguration]
       private val isClusterMode = args.userClass != null
     
    -  // Default to numExecutors * 2, with minimum of 3
    -  private val maxNumExecutorFailures = 
sparkConf.getInt("spark.yarn.max.executor.failures",
    -    sparkConf.getInt("spark.yarn.max.worker.failures",
    -      math.max(sparkConf.getInt("spark.executor.instances", 0) *  2, 3)))
    +  // Default to numExecutors * 2 (maxExecutors in the case that we are
    +  // dynamically allocating executors), with minimum of 3.
    +  private val maxNumExecutorFailures =
    +    sparkConf.getInt("spark.yarn.max.executor.failures",
    +      sparkConf.getInt("spark.yarn.max.worker.failures",
    +        math.max(
    +          3,
    +          2 * sparkConf.getInt(
    +            if (Utils.isDynamicAllocationEnabled(sparkConf))
    +              "spark.dynamicAllocation.maxExecutors"
    --- End diff --
    
    To be clear, this change does not place any additional requirements on a 
user to set `maxExecutors` to get sane dynamic allocation (DA) default behavior.
    
    It merely alleviates one class of "gotcha" that caused me some trouble this 
week: when setting standard DA params, the `val maxNumExecutorFailures` here 
becomes `3` by default, which does not seem sensible for apps that are going up 
to many 100s of executors.
    
    It seems to me that the extant 
`math.max(sparkConf.getInt("spark.executor.instances", 0) *  2, 3)` expression 
is not _intentionally_ making DA apps have a limit of `3` failures, but that it 
simply wasn't taking into account the fact that `spark.executor.instances` is 
not set in DA mode.
    
    It's true that we could also "resolve" this by declaring 
`spark.yarn.max.worker.failures` to be yet another configuration param that 
must be set to a non-default value in order to get sane DA behavior.
    
    Off the top of my head, there is already one param 
(`spark.shuffle.service.enabled=true`) that is not named in a way that suggests 
that it is important for DA apps to set, and we could make 
`spark.yarn.max.worker.failures` a second.
    
    My belief is that it would be better to not require yet another parameter 
(especially one that is not named in a way that makes it obvious that it is or 
could be important for DA to not fail in unexpected ways) for sane DA behavior, 
but to just fix the clearly-inadvertently-missed setting of a good default 
value here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to