[
https://issues.apache.org/jira/browse/SPARK-20054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon updated SPARK-20054:
---------------------------------
Labels: bulk-closed (was: )
> [Mesos] Detectability for resource starvation
> ---------------------------------------------
>
> Key: SPARK-20054
> URL: https://issues.apache.org/jira/browse/SPARK-20054
> Project: Spark
> Issue Type: Improvement
> Components: Mesos, Scheduler
> Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0
> Reporter: Kamal Gurala
> Priority: Minor
> Labels: bulk-closed
>
> We currently use Mesos 1.1.0 for our Spark cluster in coarse-grained mode. We
> had a production issue recently wherein we had our spark frameworks accept
> resources from the Mesos master, so executors were started and spark driver
> was aware of them, but the driver didn’t plan any task and nothing was
> happening for a long time because it didn't meet a minimum registered
> resources threshold. and the cluster is usually under-provisioned in order
> because not all the jobs need to run at the same time. These held resources
> were never offered back to the master for re-allocation leading to the entire
> cluster to a halt until we had to manually intervene.
> Using DRF for mesos and FIFO for Spark and the cluster is usually
> under-provisioned. At any point of time there could be 10-15 spark frameworks
> running on Mesos on the under-provisioned cluster
> The ask is to have a way to better recoverability or detectability for a
> scenario where the individual Spark frameworks hold onto resources but never
> launch any tasks or have these frameworks release these resources after a
> fixed amount of time.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]