Maxim Gekk created SPARK-24591:
----------------------------------

             Summary: Number of cores and executors in the cluster
                 Key: SPARK-24591
                 URL: https://issues.apache.org/jira/browse/SPARK-24591
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
    Affects Versions: 2.3.1
            Reporter: Maxim Gekk


Need to add 2 new methods. The first one should return total number of CPU 
cores of all executors in the cluster. The second one should give current 
number of executors registered in the cluster.

Main motivations for adding of those methods:

1. It is the best practice to manage job parallelism relative to available 
cores, e.g., df.repartition(5 * sc.coreCount) . In particular, it is an 
anti-pattern to leave a bunch of cores on large clusters twiddling their thumb 
& doing nothing. Usually users pass predefined constants for _repartition()_ 
and _coalesce()_. Selection of the constant is based on current cluster size. 
If the code runs on another cluster and/or on the resized cluster, they need to 
modify the constant each time. This happens frequently when a job that normally 
runs on, say, an hour of data on a small cluster needs to run on a week of data 
on a much larger cluster.

2. *spark.default.parallelism* can be used to get total number of cores in the 
cluster but it can be redefined by user. The info can be taken via registration 
of a listener but repeating the same looks ugly. We should follow the DRY 
principle.

3. Regarding to executorsCount(), some jobs, e.g., local node ML training, use 
a lot of parallelism. It's a common practice to aim to distribute such jobs 
such that there is one partition for each executor. 
 
4. In some places users collect this info, as well as other settings info 
together with job timing (at the app level) for analysis. E.g., you can use ML 
to determine optimal cluster size given different objectives, e.g., fastest 
throughput vs. lowest cost per unit of processing.

5. The simpler argument is that basic cluster properties should be easily 
discoverable via APIs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to