[ https://issues.apache.org/jira/browse/SPARK-34601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
BoYang updated SPARK-34601: --------------------------- Description: There are multiple work going on with disaggregated/remote shuffle service (e.g. [LinkedIn shuffle|https://engineering.linkedin.com/blog/2020/introducing-magnet], [Facebook shuffle service|https://databricks.com/session/cosco-an-efficient-facebook-scale-shuffle-service], [Uber shuffle service|https://github.com/uber/RemoteShuffleService]). Such remote shuffle service is not Spark External Shuffle Service. It could be third party shuffle solution and user uses it by setting spark.shuffle.manager. In those systems, shuffle data will be stored on different server other than executor. Spark should not mark shuffle data lost when the executor is lost. We could add a Spark configuration to control this behavior. By default, Spark still mark shuffle file lost. For disaggregated/remote shuffle service, people could set the configure to not mark shuffle file lost. (was: There are multiple work going on with disaggregated/remote shuffle service (e.g. [LinkedIn shuffle|https://engineering.linkedin.com/blog/2020/introducing-magnet], [Facebook shuffle service|https://databricks.com/session/cosco-an-efficient-facebook-scale-shuffle-service], [Uber shuffle service|https://github.com/uber/RemoteShuffleService]). In those systems, shuffle data will be stored on different server other than executor. Spark should not mark shuffle data lost when the executor is lost. We could add a Spark configuration to control this behavior. By default, Spark still mark shuffle file lost. For disaggregated/remote shuffle service, people could set the configure to not mark shuffle file lost.) > Do not delete shuffle file on executor lost event when using remote shuffle > service > ----------------------------------------------------------------------------------- > > Key: SPARK-34601 > URL: https://issues.apache.org/jira/browse/SPARK-34601 > Project: Spark > Issue Type: New Feature > Components: Shuffle > Affects Versions: 3.2.0 > Reporter: BoYang > Priority: Major > Labels: shuffle > Fix For: 3.2.0 > > > There are multiple work going on with disaggregated/remote shuffle service > (e.g. [LinkedIn > shuffle|https://engineering.linkedin.com/blog/2020/introducing-magnet], > [Facebook shuffle > service|https://databricks.com/session/cosco-an-efficient-facebook-scale-shuffle-service], > [Uber shuffle service|https://github.com/uber/RemoteShuffleService]). Such > remote shuffle service is not Spark External Shuffle Service. It could be > third party shuffle solution and user uses it by setting > spark.shuffle.manager. In those systems, shuffle data will be stored on > different server other than executor. Spark should not mark shuffle data lost > when the executor is lost. We could add a Spark configuration to control this > behavior. By default, Spark still mark shuffle file lost. For > disaggregated/remote shuffle service, people could set the configure to not > mark shuffle file lost. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org