[jira] [Updated] (SPARK-33896) Make Spark DAGScheduler datasource cache aware when scheduling tasks

Xudingyu (Jira) Wed, 23 Dec 2020 19:12:39 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-33896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Xudingyu updated SPARK-33896:
-----------------------------
    Description: 
Goals: 
•       Make Spark 3.0 Scheduler DataSource-Cache-Aware in multi-replication 
HDFS cluster
•       Performance gain in E2E workload when enabling this feature

Problem Statement:
Spark’s DAGScheduler currently schedule tasks according to RDD’s 
preferLocations, which repects HDFS BlockLocation. In a multi-replication 
cluster, HDFS BlockLocation can be returned as an Array[BlockLocation], Spark 
chooses one of the BlockLocation to run tasks on. However, tasks can run faster 
if scheduled to the nodes with datasource cache that they need. Currently 
there’re no datasource cache locality provision mechanism in Spark if nodes in 
the cluster have cache data.
This project aims to add a cache-locality-aware mechanism. Spark DAGScheduler 
can schedule tasks to the nodes with datasource cache according to cache 
locality in a multi-replication HDFS.


> Make Spark DAGScheduler datasource cache aware when scheduling tasks
> --------------------------------------------------------------------
>
>                 Key: SPARK-33896
>                 URL: https://issues.apache.org/jira/browse/SPARK-33896
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>    Affects Versions: 3.0.0
>            Reporter: Xudingyu
>            Priority: Major
>
> Goals: 
> •     Make Spark 3.0 Scheduler DataSource-Cache-Aware in multi-replication 
> HDFS cluster
> •     Performance gain in E2E workload when enabling this feature
> Problem Statement:
> Spark’s DAGScheduler currently schedule tasks according to RDD’s 
> preferLocations, which repects HDFS BlockLocation. In a multi-replication 
> cluster, HDFS BlockLocation can be returned as an Array[BlockLocation], Spark 
> chooses one of the BlockLocation to run tasks on. However, tasks can run 
> faster if scheduled to the nodes with datasource cache that they need. 
> Currently there’re no datasource cache locality provision mechanism in Spark 
> if nodes in the cluster have cache data.
> This project aims to add a cache-locality-aware mechanism. Spark DAGScheduler 
> can schedule tasks to the nodes with datasource cache according to cache 
> locality in a multi-replication HDFS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33896) Make Spark DAGScheduler datasource cache aware when scheduling tasks

Reply via email to