[
https://issues.apache.org/jira/browse/SPARK-18689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon updated SPARK-18689:
---------------------------------
Labels: bulk-closed (was: )
> Support prioritized apps utilizing linux cgroups
> ------------------------------------------------
>
> Key: SPARK-18689
> URL: https://issues.apache.org/jira/browse/SPARK-18689
> Project: Spark
> Issue Type: New Feature
> Reporter: Travis Hegner
> Priority: Major
> Labels: bulk-closed
>
> I'd like to propose a new mode of operation for the app scheduling system in
> a stand alone cluster. This idea would eliminate static allocation of cpu
> cores between apps, and would instead divvy up cpu resources based on the
> linux cgroups cpu.shares attribute. This could later be expanded to expose
> more of cgroup's options (memory, cpuset, etc...) to each application running
> on a stand alone cluster.
> Preliminary examination shows that this could be relatively simple to
> implement, and would open up a world of flexibility for how cpu time is
> allocated and utilized throughout a cluster.
> What I'm thinking is that when a cluster is operating in "priority" mode, a
> newly submitted application will be given a unique executor on each worker,
> optionally limited by {{spark.executor.instances}}. The executor would *not*
> be *statically* allocated any number of cores, but would have the ability to
> run the number of tasks equal to the number of cores on that host (obeying
> `spark.task.cpus`, of course). We could also use {{spark.executor.cores}} to
> limit the number of tasks that the executors can run simultaneously, if
> desired.
> When each worker gets a request to create its executor, it first creates a
> cgroup with the unique app id (shell examples to show how simple utilizing
> cgroups can be):
> {code}
> mkdir /sys/fs/cgroup/cpu/spark-worker/$appId
> {code}
> Then moves the executor into that cgroup:
> {code}
> echo $pid > /sys/fs/cgroup/cpu/spark-worker/$appId/tasks
> {code}
> And, finally, sets the number cpu.shares
> {code}
> echo $cpushares > /sys/fs/cgroup/cpu/spark-worker/$appId/cpu.shares
> {code}
> to the amount specified by an application config
> {{spark.priority.cpushares}}, or something similar. This app would consume
> 100% of the cpu time allocated to the spark-worker.
> A second app could come along, and also be allocated an executor on every
> worker (assuming enough memory), but with double the {{cpu.shares}} priority.
> Both apps would then run side by side, with the higher priority app receiving
> 66% of the total CPU time that has been allocated to the spark-worker, and
> the first app getting the rest.
> This approach would allow long running, cpu intensive tasks to consume all
> available cpu until a higher priority app comes along, and it will gracefully
> allow that app to consume whatever it needs, without any type of scheduling
> or preemption logic needing to be written.
> Naturally, there would have to be some configuration options to set sane
> defaults and limits for the number of shares that users can request for an
> app.
> One minor downfall is that currently the {{java.lang.Process}} object used to
> launch the executor does not have a public method to access its {{pid}},
> which is required to put that executor into a cgroup. Finding the pid of the
> executor currently requires using the reflection api, but is only a handful
> of lines, from what I've found. (Java 9 has a new {{java.lang.ProcessHandle}}
> interface, which does provide a way to get the {{pid}} of the process it
> represents, but I realize that is a long way out.)
> The other approaches for utilizing cgroups that I've studied for Mesos and
> Yarn abstract away {{cpu.shares}} into some multiple of the number of
> requested "cores". I understand this approach for convenience, but it still
> does static allocation of cores, and removes much of the flexibility that
> cgroups provide.
> In my opinion, running in this mode, as opposed to FIFO, gives users a more
> intuitive experience out of the box when attempting to run multiple
> simultaneous apps, without even having to think about any resource
> allocation. It also allows for an even higher hardware utilization rate than
> spark already provides. The {{cpu.shares}} are relative to only what's been
> allocated to the spark worker by an admin. Beyond that, the weights are
> homogeneous to the priorities set on all of the processes on a running
> machine. This allows for more fine-tuned and simpler control over resource
> allocation when the spark cluster shares the hardware with other software
> resources.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]