[ 
https://issues.apache.org/jira/browse/SPARK-34601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BoYang updated SPARK-34601:
---------------------------
    Description: There are multiple work going on with disaggregated/remote 
shuffle service (e.g. [LinkedIn 
shuffle|https://engineering.linkedin.com/blog/2020/introducing-magnet], 
[Facebook shuffle 
service|https://databricks.com/session/cosco-an-efficient-facebook-scale-shuffle-service],
 [Uber shuffle service|https://github.com/uber/RemoteShuffleService]). Such 
remote shuffle service is not Spark External Shuffle Service. It could be third 
party shuffle solution and user uses it by setting spark.shuffle.manager. In 
those systems, shuffle data will be stored on different server other than 
executor. Spark should not mark shuffle data lost when the executor is lost. We 
could add a Spark configuration to control this behavior. By default, Spark 
still mark shuffle file lost. For disaggregated/remote shuffle service, people 
could set the configure to not mark shuffle file lost.  (was: There are 
multiple work going on with disaggregated/remote shuffle service (e.g. 
[LinkedIn 
shuffle|https://engineering.linkedin.com/blog/2020/introducing-magnet], 
[Facebook shuffle 
service|https://databricks.com/session/cosco-an-efficient-facebook-scale-shuffle-service],
 [Uber shuffle service|https://github.com/uber/RemoteShuffleService]). In those 
systems, shuffle data will be stored on different server other than executor. 
Spark should not mark shuffle data lost when the executor is lost. We could add 
a Spark configuration to control this behavior. By default, Spark still mark 
shuffle file lost. For disaggregated/remote shuffle service, people could set 
the configure to not mark shuffle file lost.)

> Do not delete shuffle file on executor lost event when using remote shuffle 
> service
> -----------------------------------------------------------------------------------
>
>                 Key: SPARK-34601
>                 URL: https://issues.apache.org/jira/browse/SPARK-34601
>             Project: Spark
>          Issue Type: New Feature
>          Components: Shuffle
>    Affects Versions: 3.2.0
>            Reporter: BoYang
>            Priority: Major
>              Labels: shuffle
>             Fix For: 3.2.0
>
>
> There are multiple work going on with disaggregated/remote shuffle service 
> (e.g. [LinkedIn 
> shuffle|https://engineering.linkedin.com/blog/2020/introducing-magnet], 
> [Facebook shuffle 
> service|https://databricks.com/session/cosco-an-efficient-facebook-scale-shuffle-service],
>  [Uber shuffle service|https://github.com/uber/RemoteShuffleService]). Such 
> remote shuffle service is not Spark External Shuffle Service. It could be 
> third party shuffle solution and user uses it by setting 
> spark.shuffle.manager. In those systems, shuffle data will be stored on 
> different server other than executor. Spark should not mark shuffle data lost 
> when the executor is lost. We could add a Spark configuration to control this 
> behavior. By default, Spark still mark shuffle file lost. For 
> disaggregated/remote shuffle service, people could set the configure to not 
> mark shuffle file lost.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to