[ https://issues.apache.org/jira/browse/SPARK-25888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17397102#comment-17397102 ]
Apache Spark commented on SPARK-25888: -------------------------------------- User 'AngersZhuuuu' has created a pull request for this issue: https://github.com/apache/spark/pull/33700 > Service requests for persist() blocks via external service after dynamic > deallocation > ------------------------------------------------------------------------------------- > > Key: SPARK-25888 > URL: https://issues.apache.org/jira/browse/SPARK-25888 > Project: Spark > Issue Type: New Feature > Components: Block Manager, Shuffle, Spark Core, YARN > Affects Versions: 2.3.2 > Reporter: Adam Kennedy > Priority: Major > Labels: bulk-closed > > Large and highly multi-tenant Spark on YARN clusters with diverse job > execution often display terrible utilization rates (we have observed as low > as 3-7% CPU at max container allocation, but 50% CPU utilization on even a > well policed cluster is not uncommon). > As a sizing example, consider a scenario with 1,000 nodes, 50,000 cores, 250 > users and 50,000 runs of 1,000 distinct applications per week, with > predominantly Spark including a mixture of ETL, Ad Hoc tasks and PySpark > Notebook jobs (no streaming) > Utilization problems appear to be due in large part to difficulties with > persist() blocks (DISK or DISK+MEMORY) preventing dynamic deallocation. > In situations where an external shuffle service is present (which is typical > on clusters of this type) we already solve this for the shuffle block case by > offloading the IO handling of shuffle blocks to the external service, > allowing dynamic deallocation to proceed. > Allowing Executors to transfer persist() blocks to some external "shuffle" > service in a similar manner would be an enormous win for Spark multi-tenancy > as it would limit deallocation blocking scenarios to only MEMORY-only cache() > scenarios. > I'm not sure if I'm correct, but I seem to recall seeing in the original > external shuffle service commits that may have been considered at the time > but getting shuffle blocks moved to the external shuffle service was the > first priority. > With support for external persist() DISK blocks in place, we could also then > handle deallocation of DISK+MEMORY, as the memory instance could first be > dropped, changing the block to DISK only, and then further transferred to the > shuffle service. > We have tried to resolve the persist() issue via extensive user training, but > that has typically only allowed us to improve utilization of the worst > offenders (10% utilization) up to around 40-60% utilization, as the need for > persist() is often legitimate and occurs during the middle stages of a job. > In a healthy multi-tenant scenario, a large job might spool up to say 10,000 > cores, persist() data, release executors across a long tail down to 100 > cores, and then spool back up to 10,000 cores for the following stage without > impact on the persist() data. > In an ideal world, if an new executor started up on a node on which blocks > had been transferred to the shuffle service, the new executor might even be > able to "recapture" control of those blocks (if that would help with > performance in some way). > And the behavior of gradually expanding up and down several times over the > course of a job would not just improve utilization, but would allow resources > to more easily be redistributed to other jobs which start on the cluster > during the long-tail periods, which would improve multi-tenancy and bring us > closer to optimal "envy free" YARN scheduling. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org