Github user markhamstra commented on the issue:

    https://github.com/apache/spark/pull/21589
  
    Thank you, @HyukjinKwon 
    
    There are a significant number of Spark users who use the Job Scheduler 
model with a SparkContext shared across many users and many Jobs. Promoting 
tools and patterns based upon the number of core or executors that a 
SparkContext has access to, encouraging users to create Jobs that try to use 
all of the available cores, very much leads those users in the wrong direction.
    
    As much as possible, the public API should target policy that addresses 
real user problems (all users, not just a subset), and avoid targeting the 
particulars of Spark's internal implementation. A `repartition` that is 
extended to support policy or goal declarations (things along the lines of 
`repartition(availableCores)`, `repartition(availableDataNodes)`, 
`repartition(availableExecutors)`, `repartition(unreservedCores)`, etc.), 
relying upon Spark's internals (with it's compete knowledge of the total number 
of cores and executors, scheduling pool shares, number of reserved Task nodes 
sought in barrier scheduling, number of active Jobs, Stages, Tasks and 
Sessions, etc.) may be something that I can get behind. Exposing a couple of 
current Spark scheduler implementation details in the expectation that some 
subset of users in some subset of use cases will be able to make correct use of 
them is not. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to