[jira] [Resolved] (SPARK-47978) Decouple Spark Go Connect Library versioning from Spark versioning
[ https://issues.apache.org/jira/browse/SPARK-47978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BoYang resolved SPARK-47978. Resolution: Fixed > Decouple Spark Go Connect Library versioning from Spark versioning > -- > > Key: SPARK-47978 > URL: https://issues.apache.org/jira/browse/SPARK-47978 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.1 >Reporter: BoYang >Priority: Major > Labels: pull-request-available > Fix For: 3.5.1 > > > There is a recent discussion in Spark community for Spark Operator version > naming convention. People like to use version independent of Spark versions. > That applies to Spark Connect Go Client as well. Better to start from v1. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-47678) Got fetch failed exception when new executor reused same ip address from a previously killed executor
BoYang created SPARK-47678: -- Summary: Got fetch failed exception when new executor reused same ip address from a previously killed executor Key: SPARK-47678 URL: https://issues.apache.org/jira/browse/SPARK-47678 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 3.5.1, 3.5.0, 3.4.1, 3.4.0, 3.4.2 Environment: This only happens on Kubernetes, where same ip address can be re-used for new executor pod. Reporter: BoYang This is an edge case which caused Spark on Kubernetes getting fetch failed exception when new executor reused same ip address from a previously killed executor. The new executor checks shuffle block ip address and compares it with its own host address. If the two ip addresses are the same, the new executor will assume the block on its own local disk and try to read it locally. This causes failure since the block is actually on the previously killed executor which happened to have same ip address. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44681) Solve issue referencing github.com/apache/spark-connect-go as Go library
[ https://issues.apache.org/jira/browse/SPARK-44681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BoYang resolved SPARK-44681. Fix Version/s: 3.4.0 Target Version/s: 3.5.0 Resolution: Fixed > Solve issue referencing github.com/apache/spark-connect-go as Go library > > > Key: SPARK-44681 > URL: https://issues.apache.org/jira/browse/SPARK-44681 > Project: Spark > Issue Type: Sub-task > Components: Connect Contrib >Affects Versions: 3.5.0 >Reporter: BoYang >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44681) Solve issue referencing github.com/apache/spark-connect-go as Go library
BoYang created SPARK-44681: -- Summary: Solve issue referencing github.com/apache/spark-connect-go as Go library Key: SPARK-44681 URL: https://issues.apache.org/jira/browse/SPARK-44681 Project: Spark Issue Type: Sub-task Components: Connect Contrib Affects Versions: 3.5.0 Reporter: BoYang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43351) Support Golang in Spark Connect
[ https://issues.apache.org/jira/browse/SPARK-43351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17751186#comment-17751186 ] BoYang commented on SPARK-43351: Thanks! We can keep it as 3.5.0 now. > Support Golang in Spark Connect > --- > > Key: SPARK-43351 > URL: https://issues.apache.org/jira/browse/SPARK-43351 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.5.0 >Reporter: BoYang >Assignee: BoYang >Priority: Major > Fix For: 3.5.0 > > > Support Spark Connect client side in Go programming language -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44368) Support partition operation on dataframe in Spark Connect Go Client
BoYang created SPARK-44368: -- Summary: Support partition operation on dataframe in Spark Connect Go Client Key: SPARK-44368 URL: https://issues.apache.org/jira/browse/SPARK-44368 Project: Spark Issue Type: Sub-task Components: Connect Contrib Affects Versions: 3.4.1 Reporter: BoYang Support partition operation on dataframe in Spark Connect Go Client -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43351) Support Golang in Spark Connect
[ https://issues.apache.org/jira/browse/SPARK-43351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17720241#comment-17720241 ] BoYang commented on SPARK-43351: Yes, thanks Ruifeng for the comment and suggestion! Would like to add another item to discuss: 1.7, versioning: how to organize Go package for different versions > Support Golang in Spark Connect > --- > > Key: SPARK-43351 > URL: https://issues.apache.org/jira/browse/SPARK-43351 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.5.0 >Reporter: BoYang >Priority: Major > > Support Spark Connect client side in Go programming language -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43351) Support Golang in Spark Connect
BoYang created SPARK-43351: -- Summary: Support Golang in Spark Connect Key: SPARK-43351 URL: https://issues.apache.org/jira/browse/SPARK-43351 Project: Spark Issue Type: New Feature Components: Connect Affects Versions: 3.4.0 Reporter: BoYang Support Spark Connect client side in Go programming language -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38668) Spark on Kubernetes: add separate pod watcher service to reduce pressure on K8s API server
[ https://issues.apache.org/jira/browse/SPARK-38668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BoYang updated SPARK-38668: --- Summary: Spark on Kubernetes: add separate pod watcher service to reduce pressure on K8s API server (was: Spark on Kubernetes: support external pod watcher to reduce pressure on K8s API server) > Spark on Kubernetes: add separate pod watcher service to reduce pressure on > K8s API server > -- > > Key: SPARK-38668 > URL: https://issues.apache.org/jira/browse/SPARK-38668 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 3.2.1 >Reporter: BoYang >Priority: Major > > Spark driver will listen to all pods events to manage its executor pods. This > will cause pressure on Kubernetes API server in a large cluster, because > there will be many drivers connect to the API server and watch for the pods. > > An alternative is to have a separate service to listen and watch all pod > events. Then each Spark driver only connects to that service to get pod > events. This will reduce the load on Kubernetes API server. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38668) Spark on Kubernetes: support external pod watcher to reduce pressure on K8s API server
[ https://issues.apache.org/jira/browse/SPARK-38668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BoYang updated SPARK-38668: --- Description: Spark driver will listen to all pods events to manage its executor pods. This will cause pressure on Kubernetes API server in a large cluster, because there will be many drivers connect to the API server and watch for the pods. An alternative is to have a separate service to listen and watch all pod events. Then each Spark driver only connects to that service to get pod events. This will reduce the load on Kubernetes API server. > Spark on Kubernetes: support external pod watcher to reduce pressure on K8s > API server > -- > > Key: SPARK-38668 > URL: https://issues.apache.org/jira/browse/SPARK-38668 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 3.2.1 >Reporter: BoYang >Priority: Major > > Spark driver will listen to all pods events to manage its executor pods. This > will cause pressure on Kubernetes API server in a large cluster, because > there will be many drivers connect to the API server and watch for the pods. > > An alternative is to have a separate service to listen and watch all pod > events. Then each Spark driver only connects to that service to get pod > events. This will reduce the load on Kubernetes API server. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38668) Spark on Kubernetes: support external pod watcher to reduce pressure on K8s API server
BoYang created SPARK-38668: -- Summary: Spark on Kubernetes: support external pod watcher to reduce pressure on K8s API server Key: SPARK-38668 URL: https://issues.apache.org/jira/browse/SPARK-38668 Project: Spark Issue Type: New Feature Components: Kubernetes Affects Versions: 3.2.1 Reporter: BoYang -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25299) Use remote storage for persisting shuffle data
[ https://issues.apache.org/jira/browse/SPARK-25299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17457384#comment-17457384 ] BoYang edited comment on SPARK-25299 at 12/10/21, 10:01 PM: I am working on a prototype to store shuffle file on external storage like S3: https://github.com/apache/spark/pull/34864 . Would love to hear comments. Also welcome people to collaborate on this. was (Author: bobyangbo): I am working on a prototype to store shuffle file on external storage like S3: [https://github.com/apache/spark/pull/34864.] Would love to hear comments. Also welcome people to collaborate on this. > Use remote storage for persisting shuffle data > -- > > Key: SPARK-25299 > URL: https://issues.apache.org/jira/browse/SPARK-25299 > Project: Spark > Issue Type: New Feature > Components: Shuffle, Spark Core >Affects Versions: 3.1.0 >Reporter: Matt Cheah >Priority: Major > Labels: SPIP > > In Spark, the shuffle primitive requires Spark executors to persist data to > the local disk of the worker nodes. If executors crash, the external shuffle > service can continue to serve the shuffle data that was written beyond the > lifetime of the executor itself. In YARN, Mesos, and Standalone mode, the > external shuffle service is deployed on every worker node. The shuffle > service shares local disk with the executors that run on its node. > There are some shortcomings with the way shuffle is fundamentally implemented > right now. Particularly: > * If any external shuffle service process or node becomes unavailable, all > applications that had an executor that ran on that node must recompute the > shuffle blocks that were lost. > * Similarly to the above, the external shuffle service must be kept running > at all times, which may waste resources when no applications are using that > shuffle service node. > * Mounting local storage can prevent users from taking advantage of > desirable isolation benefits from using containerized environments, like > Kubernetes. We had an external shuffle service implementation in an early > prototype of the Kubernetes backend, but it was rejected due to its strict > requirement to be able to mount hostPath volumes or other persistent volume > setups. > In the following [architecture discussion > document|https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40] > (note: _not_ an SPIP), we brainstorm various high level architectures for > improving the external shuffle service in a way that addresses the above > problems. The purpose of this umbrella JIRA is to promote additional > discussion on how we can approach these problems, both at the architecture > level and the implementation level. We anticipate filing sub-issues that > break down the tasks that must be completed to achieve this goal. > Edit June 28 2019: Our SPIP is here: > [https://docs.google.com/document/d/1d6egnL6WHOwWZe8MWv3m8n4PToNacdx7n_0iMSWwhCQ/edit] -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25299) Use remote storage for persisting shuffle data
[ https://issues.apache.org/jira/browse/SPARK-25299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17457384#comment-17457384 ] BoYang commented on SPARK-25299: I am working on a prototype to store shuffle file on external storage like S3: [https://github.com/apache/spark/pull/34864.] Would love to hear comments. Also welcome people to collaborate on this. > Use remote storage for persisting shuffle data > -- > > Key: SPARK-25299 > URL: https://issues.apache.org/jira/browse/SPARK-25299 > Project: Spark > Issue Type: New Feature > Components: Shuffle, Spark Core >Affects Versions: 3.1.0 >Reporter: Matt Cheah >Priority: Major > Labels: SPIP > > In Spark, the shuffle primitive requires Spark executors to persist data to > the local disk of the worker nodes. If executors crash, the external shuffle > service can continue to serve the shuffle data that was written beyond the > lifetime of the executor itself. In YARN, Mesos, and Standalone mode, the > external shuffle service is deployed on every worker node. The shuffle > service shares local disk with the executors that run on its node. > There are some shortcomings with the way shuffle is fundamentally implemented > right now. Particularly: > * If any external shuffle service process or node becomes unavailable, all > applications that had an executor that ran on that node must recompute the > shuffle blocks that were lost. > * Similarly to the above, the external shuffle service must be kept running > at all times, which may waste resources when no applications are using that > shuffle service node. > * Mounting local storage can prevent users from taking advantage of > desirable isolation benefits from using containerized environments, like > Kubernetes. We had an external shuffle service implementation in an early > prototype of the Kubernetes backend, but it was rejected due to its strict > requirement to be able to mount hostPath volumes or other persistent volume > setups. > In the following [architecture discussion > document|https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40] > (note: _not_ an SPIP), we brainstorm various high level architectures for > improving the external shuffle service in a way that addresses the above > problems. The purpose of this umbrella JIRA is to promote additional > discussion on how we can approach these problems, both at the architecture > level and the implementation level. We anticipate filing sub-issues that > break down the tasks that must be completed to achieve this goal. > Edit June 28 2019: Our SPIP is here: > [https://docs.google.com/document/d/1d6egnL6WHOwWZe8MWv3m8n4PToNacdx7n_0iMSWwhCQ/edit] -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34601) Do not delete shuffle file on executor lost event when using remote shuffle service
[ https://issues.apache.org/jira/browse/SPARK-34601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BoYang resolved SPARK-34601. Resolution: Won't Fix > Do not delete shuffle file on executor lost event when using remote shuffle > service > --- > > Key: SPARK-34601 > URL: https://issues.apache.org/jira/browse/SPARK-34601 > Project: Spark > Issue Type: New Feature > Components: Shuffle >Affects Versions: 3.2.0 >Reporter: BoYang >Priority: Major > Labels: shuffle > > There are multiple work going on with disaggregated/remote shuffle service > (e.g. [LinkedIn > shuffle|https://engineering.linkedin.com/blog/2020/introducing-magnet], > [Facebook shuffle > service|https://databricks.com/session/cosco-an-efficient-facebook-scale-shuffle-service], > [Uber shuffle service|https://github.com/uber/RemoteShuffleService]). Such > remote shuffle service is not Spark External Shuffle Service. It could be > third party shuffle solution and user uses it by setting > spark.shuffle.manager. In those systems, shuffle data will be stored on > different server other than executor. Spark should not mark shuffle data lost > when the executor is lost. We could add a Spark configuration to control this > behavior. By default, Spark still mark shuffle file lost. For > disaggregated/remote shuffle service, people could set the configure to not > mark shuffle file lost. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34601) Do not delete shuffle file on executor lost event when using remote shuffle service
[ https://issues.apache.org/jira/browse/SPARK-34601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296390#comment-17296390 ] BoYang commented on SPARK-34601: It looks Spark already checks and matches the executor id when it tries to remove map output. I will close this ticket. > Do not delete shuffle file on executor lost event when using remote shuffle > service > --- > > Key: SPARK-34601 > URL: https://issues.apache.org/jira/browse/SPARK-34601 > Project: Spark > Issue Type: New Feature > Components: Shuffle >Affects Versions: 3.2.0 >Reporter: BoYang >Priority: Major > Labels: shuffle > > There are multiple work going on with disaggregated/remote shuffle service > (e.g. [LinkedIn > shuffle|https://engineering.linkedin.com/blog/2020/introducing-magnet], > [Facebook shuffle > service|https://databricks.com/session/cosco-an-efficient-facebook-scale-shuffle-service], > [Uber shuffle service|https://github.com/uber/RemoteShuffleService]). Such > remote shuffle service is not Spark External Shuffle Service. It could be > third party shuffle solution and user uses it by setting > spark.shuffle.manager. In those systems, shuffle data will be stored on > different server other than executor. Spark should not mark shuffle data lost > when the executor is lost. We could add a Spark configuration to control this > behavior. By default, Spark still mark shuffle file lost. For > disaggregated/remote shuffle service, people could set the configure to not > mark shuffle file lost. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34601) Do not delete shuffle file on executor lost event when using remote shuffle service
[ https://issues.apache.org/jira/browse/SPARK-34601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BoYang updated SPARK-34601: --- Description: There are multiple work going on with disaggregated/remote shuffle service (e.g. [LinkedIn shuffle|https://engineering.linkedin.com/blog/2020/introducing-magnet], [Facebook shuffle service|https://databricks.com/session/cosco-an-efficient-facebook-scale-shuffle-service], [Uber shuffle service|https://github.com/uber/RemoteShuffleService]). Such remote shuffle service is not Spark External Shuffle Service. It could be third party shuffle solution and user uses it by setting spark.shuffle.manager. In those systems, shuffle data will be stored on different server other than executor. Spark should not mark shuffle data lost when the executor is lost. We could add a Spark configuration to control this behavior. By default, Spark still mark shuffle file lost. For disaggregated/remote shuffle service, people could set the configure to not mark shuffle file lost. (was: There are multiple work going on with disaggregated/remote shuffle service (e.g. [LinkedIn shuffle|https://engineering.linkedin.com/blog/2020/introducing-magnet], [Facebook shuffle service|https://databricks.com/session/cosco-an-efficient-facebook-scale-shuffle-service], [Uber shuffle service|https://github.com/uber/RemoteShuffleService]). In those systems, shuffle data will be stored on different server other than executor. Spark should not mark shuffle data lost when the executor is lost. We could add a Spark configuration to control this behavior. By default, Spark still mark shuffle file lost. For disaggregated/remote shuffle service, people could set the configure to not mark shuffle file lost.) > Do not delete shuffle file on executor lost event when using remote shuffle > service > --- > > Key: SPARK-34601 > URL: https://issues.apache.org/jira/browse/SPARK-34601 > Project: Spark > Issue Type: New Feature > Components: Shuffle >Affects Versions: 3.2.0 >Reporter: BoYang >Priority: Major > Labels: shuffle > Fix For: 3.2.0 > > > There are multiple work going on with disaggregated/remote shuffle service > (e.g. [LinkedIn > shuffle|https://engineering.linkedin.com/blog/2020/introducing-magnet], > [Facebook shuffle > service|https://databricks.com/session/cosco-an-efficient-facebook-scale-shuffle-service], > [Uber shuffle service|https://github.com/uber/RemoteShuffleService]). Such > remote shuffle service is not Spark External Shuffle Service. It could be > third party shuffle solution and user uses it by setting > spark.shuffle.manager. In those systems, shuffle data will be stored on > different server other than executor. Spark should not mark shuffle data lost > when the executor is lost. We could add a Spark configuration to control this > behavior. By default, Spark still mark shuffle file lost. For > disaggregated/remote shuffle service, people could set the configure to not > mark shuffle file lost. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34601) Do not delete shuffle file on executor lost event when using remote shuffle service
[ https://issues.apache.org/jira/browse/SPARK-34601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17294062#comment-17294062 ] BoYang commented on SPARK-34601: I will add a PR for this soon. > Do not delete shuffle file on executor lost event when using remote shuffle > service > --- > > Key: SPARK-34601 > URL: https://issues.apache.org/jira/browse/SPARK-34601 > Project: Spark > Issue Type: New Feature > Components: Shuffle >Affects Versions: 3.2.0 >Reporter: BoYang >Priority: Major > Labels: shuffle > Fix For: 3.2.0 > > > There are multiple work going on with disaggregated/remote shuffle service > (e.g. [LinkedIn > shuffle|https://engineering.linkedin.com/blog/2020/introducing-magnet], > [Facebook shuffle > service|https://databricks.com/session/cosco-an-efficient-facebook-scale-shuffle-service], > [Uber shuffle service|https://github.com/uber/RemoteShuffleService]). In > those systems, shuffle data will be stored on different server other than > executor. Spark should not mark shuffle data lost when the executor is lost. > We could add a Spark configuration to control this behavior. By default, > Spark still mark shuffle file lost. For disaggregated/remote shuffle service, > people could set the configure to not mark shuffle file lost. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34601) Do not delete shuffle file on executor lost event when using remote shuffle service
BoYang created SPARK-34601: -- Summary: Do not delete shuffle file on executor lost event when using remote shuffle service Key: SPARK-34601 URL: https://issues.apache.org/jira/browse/SPARK-34601 Project: Spark Issue Type: New Feature Components: Shuffle Affects Versions: 3.2.0 Reporter: BoYang Fix For: 3.2.0 There are multiple work going on with disaggregated/remote shuffle service (e.g. [LinkedIn shuffle|https://engineering.linkedin.com/blog/2020/introducing-magnet], [Facebook shuffle service|https://databricks.com/session/cosco-an-efficient-facebook-scale-shuffle-service], [Uber shuffle service|https://github.com/uber/RemoteShuffleService]). In those systems, shuffle data will be stored on different server other than executor. Spark should not mark shuffle data lost when the executor is lost. We could add a Spark configuration to control this behavior. By default, Spark still mark shuffle file lost. For disaggregated/remote shuffle service, people could set the configure to not mark shuffle file lost. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33114) Add metadata in MapStatus to support custom shuffle manager
BoYang created SPARK-33114: -- Summary: Add metadata in MapStatus to support custom shuffle manager Key: SPARK-33114 URL: https://issues.apache.org/jira/browse/SPARK-33114 Project: Spark Issue Type: Improvement Components: Shuffle Affects Versions: 3.0.1 Reporter: BoYang Current MapStatus class is tightly bound with local (sort merge) shuffle which uses BlockManagerId to store the shuffle data location. It could not support other custom shuffle manager implementation. We could add "metadata" to MapStatus and allow different shuffle manager implementation to store information related to them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25299) Use remote storage for persisting shuffle data
[ https://issues.apache.org/jira/browse/SPARK-25299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17207093#comment-17207093 ] BoYang edited comment on SPARK-25299 at 10/4/20, 5:27 AM: -- While people work on remote storage for persisting shuffle data, and introduce shuffle API changes, it is better to have some reference implementation to use remote storage for shuffle data. Such reference implementation could demonstrate how to use the shuffle API and also could make sure the API works for both local sort merge shuffle and remote shuffle. More details in https://issues.apache.org/jira/browse/SPARK-31924: Create remote shuffle service reference implementation. Uber also open sourced Remote Shuffle Service ([https://github.com/uber/RemoteShuffleService)] , which is pretty complicated. It might be better to have a simplified small version of remote shuffle service reference implementation inside Spark repo. was (Author: bobyangbo): While people work on remote storage for persisting shuffle data, and introduce shuffle API changes, it is better to have some reference implementation to use remote storage for shuffle data. Such reference implementation could demonstrate how to use the shuffle API and also could make sure the API works for both local sort merge shuffle and remote shuffle. More details in https://issues.apache.org/jira/browse/SPARK-31924: Create remote shuffle service reference implementation. > Use remote storage for persisting shuffle data > -- > > Key: SPARK-25299 > URL: https://issues.apache.org/jira/browse/SPARK-25299 > Project: Spark > Issue Type: New Feature > Components: Shuffle, Spark Core >Affects Versions: 3.1.0 >Reporter: Matt Cheah >Priority: Major > Labels: SPIP > > In Spark, the shuffle primitive requires Spark executors to persist data to > the local disk of the worker nodes. If executors crash, the external shuffle > service can continue to serve the shuffle data that was written beyond the > lifetime of the executor itself. In YARN, Mesos, and Standalone mode, the > external shuffle service is deployed on every worker node. The shuffle > service shares local disk with the executors that run on its node. > There are some shortcomings with the way shuffle is fundamentally implemented > right now. Particularly: > * If any external shuffle service process or node becomes unavailable, all > applications that had an executor that ran on that node must recompute the > shuffle blocks that were lost. > * Similarly to the above, the external shuffle service must be kept running > at all times, which may waste resources when no applications are using that > shuffle service node. > * Mounting local storage can prevent users from taking advantage of > desirable isolation benefits from using containerized environments, like > Kubernetes. We had an external shuffle service implementation in an early > prototype of the Kubernetes backend, but it was rejected due to its strict > requirement to be able to mount hostPath volumes or other persistent volume > setups. > In the following [architecture discussion > document|https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40] > (note: _not_ an SPIP), we brainstorm various high level architectures for > improving the external shuffle service in a way that addresses the above > problems. The purpose of this umbrella JIRA is to promote additional > discussion on how we can approach these problems, both at the architecture > level and the implementation level. We anticipate filing sub-issues that > break down the tasks that must be completed to achieve this goal. > Edit June 28 2019: Our SPIP is here: > [https://docs.google.com/document/d/1d6egnL6WHOwWZe8MWv3m8n4PToNacdx7n_0iMSWwhCQ/edit] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25299) Use remote storage for persisting shuffle data
[ https://issues.apache.org/jira/browse/SPARK-25299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17207093#comment-17207093 ] BoYang edited comment on SPARK-25299 at 10/4/20, 5:23 AM: -- While people work on remote storage for persisting shuffle data, and introduce shuffle API changes, it is better to have some reference implementation to use remote storage for shuffle data. Such reference implementation could demonstrate how to use the shuffle API and also could make sure the API works for both local sort merge shuffle and remote shuffle. More details in https://issues.apache.org/jira/browse/SPARK-31924: Create remote shuffle service reference implementation. was (Author: bobyangbo): While people work on remote storage for persisting shuffle data, and introduce shuffle API changes, it is better to have some reference implementation to use remote storage for shuffle data. Such reference implementation could demonstrate how to use the shuffle API and also could make sure the API works for both local sort merge shuffle and remote shuffle. > Use remote storage for persisting shuffle data > -- > > Key: SPARK-25299 > URL: https://issues.apache.org/jira/browse/SPARK-25299 > Project: Spark > Issue Type: New Feature > Components: Shuffle, Spark Core >Affects Versions: 3.1.0 >Reporter: Matt Cheah >Priority: Major > Labels: SPIP > > In Spark, the shuffle primitive requires Spark executors to persist data to > the local disk of the worker nodes. If executors crash, the external shuffle > service can continue to serve the shuffle data that was written beyond the > lifetime of the executor itself. In YARN, Mesos, and Standalone mode, the > external shuffle service is deployed on every worker node. The shuffle > service shares local disk with the executors that run on its node. > There are some shortcomings with the way shuffle is fundamentally implemented > right now. Particularly: > * If any external shuffle service process or node becomes unavailable, all > applications that had an executor that ran on that node must recompute the > shuffle blocks that were lost. > * Similarly to the above, the external shuffle service must be kept running > at all times, which may waste resources when no applications are using that > shuffle service node. > * Mounting local storage can prevent users from taking advantage of > desirable isolation benefits from using containerized environments, like > Kubernetes. We had an external shuffle service implementation in an early > prototype of the Kubernetes backend, but it was rejected due to its strict > requirement to be able to mount hostPath volumes or other persistent volume > setups. > In the following [architecture discussion > document|https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40] > (note: _not_ an SPIP), we brainstorm various high level architectures for > improving the external shuffle service in a way that addresses the above > problems. The purpose of this umbrella JIRA is to promote additional > discussion on how we can approach these problems, both at the architecture > level and the implementation level. We anticipate filing sub-issues that > break down the tasks that must be completed to achieve this goal. > Edit June 28 2019: Our SPIP is here: > [https://docs.google.com/document/d/1d6egnL6WHOwWZe8MWv3m8n4PToNacdx7n_0iMSWwhCQ/edit] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25299) Use remote storage for persisting shuffle data
[ https://issues.apache.org/jira/browse/SPARK-25299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17207093#comment-17207093 ] BoYang commented on SPARK-25299: While people work on remote storage for persisting shuffle data, and introduce shuffle API changes, it is better to have some reference implementation to use remote storage for shuffle data. Such reference implementation could demonstrate how to use the shuffle API and also could make sure the API works for both local sort merge shuffle and remote shuffle. > Use remote storage for persisting shuffle data > -- > > Key: SPARK-25299 > URL: https://issues.apache.org/jira/browse/SPARK-25299 > Project: Spark > Issue Type: New Feature > Components: Shuffle, Spark Core >Affects Versions: 3.1.0 >Reporter: Matt Cheah >Priority: Major > Labels: SPIP > > In Spark, the shuffle primitive requires Spark executors to persist data to > the local disk of the worker nodes. If executors crash, the external shuffle > service can continue to serve the shuffle data that was written beyond the > lifetime of the executor itself. In YARN, Mesos, and Standalone mode, the > external shuffle service is deployed on every worker node. The shuffle > service shares local disk with the executors that run on its node. > There are some shortcomings with the way shuffle is fundamentally implemented > right now. Particularly: > * If any external shuffle service process or node becomes unavailable, all > applications that had an executor that ran on that node must recompute the > shuffle blocks that were lost. > * Similarly to the above, the external shuffle service must be kept running > at all times, which may waste resources when no applications are using that > shuffle service node. > * Mounting local storage can prevent users from taking advantage of > desirable isolation benefits from using containerized environments, like > Kubernetes. We had an external shuffle service implementation in an early > prototype of the Kubernetes backend, but it was rejected due to its strict > requirement to be able to mount hostPath volumes or other persistent volume > setups. > In the following [architecture discussion > document|https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40] > (note: _not_ an SPIP), we brainstorm various high level architectures for > improving the external shuffle service in a way that addresses the above > problems. The purpose of this umbrella JIRA is to promote additional > discussion on how we can approach these problems, both at the architecture > level and the implementation level. We anticipate filing sub-issues that > break down the tasks that must be completed to achieve this goal. > Edit June 28 2019: Our SPIP is here: > [https://docs.google.com/document/d/1d6egnL6WHOwWZe8MWv3m8n4PToNacdx7n_0iMSWwhCQ/edit] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33037) Remove knownManagers hardcoded list
[ https://issues.apache.org/jira/browse/SPARK-33037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17205743#comment-17205743 ] BoYang commented on SPARK-33037: After discussion, we feel it is better to remove the knownManagers list. That makes code more clean and also support user's custom shuffle manager implementation. PR: https://github.com/apache/spark/pull/29916 > Remove knownManagers hardcoded list > --- > > Key: SPARK-33037 > URL: https://issues.apache.org/jira/browse/SPARK-33037 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.4.7, 3.0.1 >Reporter: BoYang >Priority: Major > > Spark has a hardcode list to contain known shuffle managers, which has two > values now. It does not contain user's custom shuffle manager which is set > through Spark config "spark.shuffle.manager". > > We hit issue when set "spark.shuffle.manager" with our own shuffle manager > plugin (Uber Remote Shuffle Service implementation, > [https://github.com/uber/RemoteShuffleService]). Other users will hit same > issue when they implement their own shuffle manager. > > Need to add "spark.shuffle.manager" config value to the known managers list > as well. > > The know managers list is in code: > common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java > {quote}private final List knownManagers = Arrays.asList( > "org.apache.spark.shuffle.sort.SortShuffleManager", > "org.apache.spark.shuffle.unsafe.UnsafeShuffleManager"); > {quote} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33037) Remove knownManagers hardcoded list
[ https://issues.apache.org/jira/browse/SPARK-33037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BoYang updated SPARK-33037: --- Summary: Remove knownManagers hardcoded list (was: Add "spark.shuffle.manager" value to knownManagers) > Remove knownManagers hardcoded list > --- > > Key: SPARK-33037 > URL: https://issues.apache.org/jira/browse/SPARK-33037 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.4.7, 3.0.1 >Reporter: BoYang >Priority: Major > > Spark has a hardcode list to contain known shuffle managers, which has two > values now. It does not contain user's custom shuffle manager which is set > through Spark config "spark.shuffle.manager". > > We hit issue when set "spark.shuffle.manager" with our own shuffle manager > plugin (Uber Remote Shuffle Service implementation, > [https://github.com/uber/RemoteShuffleService]). Other users will hit same > issue when they implement their own shuffle manager. > > Need to add "spark.shuffle.manager" config value to the known managers list > as well. > > The know managers list is in code: > common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java > {quote}private final List knownManagers = Arrays.asList( > "org.apache.spark.shuffle.sort.SortShuffleManager", > "org.apache.spark.shuffle.unsafe.UnsafeShuffleManager"); > {quote} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33037) Add "spark.shuffle.manager" value to knownManagers
[ https://issues.apache.org/jira/browse/SPARK-33037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BoYang updated SPARK-33037: --- Description: Spark has a hardcode list to contain known shuffle managers, which has two values now. It does not contain user's custom shuffle manager which is set through Spark config "spark.shuffle.manager". We hit issue when set "spark.shuffle.manager" with our own shuffle manager plugin (Uber Remote Shuffle Service implementation, [https://github.com/uber/RemoteShuffleService]). Other users will hit same issue when they implement their own shuffle manager. Need to add "spark.shuffle.manager" config value to the known managers list as well. The know managers list is in code: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java {quote}private final List knownManagers = Arrays.asList( "org.apache.spark.shuffle.sort.SortShuffleManager", "org.apache.spark.shuffle.unsafe.UnsafeShuffleManager"); {quote} was: Spark has a hardcode list to contain known shuffle managers, which has two values now. It does not contain user's custom shuffle manager which is set through Spark config "spark.shuffle.manager". We hit issue when set "spark.shuffle.manager" with our own shuffle manager plugin (Uber Remote Shuffle Service implementation, https://github.com/uber/RemoteShuffleService). Need to add "spark.shuffle.manager" config value to the known managers list as well. The know managers list is in code: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java {quote}private final List knownManagers = Arrays.asList( "org.apache.spark.shuffle.sort.SortShuffleManager", "org.apache.spark.shuffle.unsafe.UnsafeShuffleManager"); {quote} > Add "spark.shuffle.manager" value to knownManagers > -- > > Key: SPARK-33037 > URL: https://issues.apache.org/jira/browse/SPARK-33037 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.4.7, 3.0.1 >Reporter: BoYang >Priority: Major > > Spark has a hardcode list to contain known shuffle managers, which has two > values now. It does not contain user's custom shuffle manager which is set > through Spark config "spark.shuffle.manager". > > We hit issue when set "spark.shuffle.manager" with our own shuffle manager > plugin (Uber Remote Shuffle Service implementation, > [https://github.com/uber/RemoteShuffleService]). Other users will hit same > issue when they implement their own shuffle manager. > > Need to add "spark.shuffle.manager" config value to the known managers list > as well. > > The know managers list is in code: > common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java > {quote}private final List knownManagers = Arrays.asList( > "org.apache.spark.shuffle.sort.SortShuffleManager", > "org.apache.spark.shuffle.unsafe.UnsafeShuffleManager"); > {quote} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33037) Add "spark.shuffle.manager" value to knownManagers
BoYang created SPARK-33037: -- Summary: Add "spark.shuffle.manager" value to knownManagers Key: SPARK-33037 URL: https://issues.apache.org/jira/browse/SPARK-33037 Project: Spark Issue Type: Improvement Components: Shuffle Affects Versions: 3.0.1, 2.4.7 Reporter: BoYang Spark has a hardcode list to contain known shuffle managers, which has two values now. It does not contain user's custom shuffle manager which is set through Spark config "spark.shuffle.manager". We hit issue when set "spark.shuffle.manager" with our own shuffle manager plugin (Uber Remote Shuffle Service implementation, https://github.com/uber/RemoteShuffleService). Need to add "spark.shuffle.manager" config value to the known managers list as well. The know managers list is in code: common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java {quote}private final List knownManagers = Arrays.asList( "org.apache.spark.shuffle.sort.SortShuffleManager", "org.apache.spark.shuffle.unsafe.UnsafeShuffleManager"); {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31924) Create remote shuffle service reference implementation
[ https://issues.apache.org/jira/browse/SPARK-31924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17159550#comment-17159550 ] BoYang edited comment on SPARK-31924 at 7/16/20, 11:21 PM: --- We created a short [design doc|https://docs.google.com/document/d/1thTeID___Dh4Ax4Ep0QJpXn2qsaIm2nxZMbqRix0J-k]. Also created code example of a plain shuffle client/server ([https://github.com/boy-uber/spark/pull/3]) to demonstrate the basic idea. was (Author: bobyangbo): We created a short [design doc|https://docs.google.com/document/d/1thTeID___Dh4Ax4Ep0QJpXn2qsaIm2nxZMbqRix0J-k]. Also created code example of a plain shuffle client/server ([https://github.com/boy-uber/spark/pull/3]) to demonstrate the basic design idea. > Create remote shuffle service reference implementation > -- > > Key: SPARK-31924 > URL: https://issues.apache.org/jira/browse/SPARK-31924 > Project: Spark > Issue Type: New Feature > Components: Shuffle >Affects Versions: 3.1.0 >Reporter: BoYang >Priority: Major > > People in [Spark Scalability & Reliability Sync Meeting > |https://docs.google.com/document/d/1T3y25dOaKWVO0pWd838GeiTeI3DUQJtwy6MKYPLuleg]have > discussed a lot about remote (disaggregated) shuffle service, and plan to do > a reference implementation to help demonstrate some basic design and pave the > way for a future production grade remote shuffle service. > > There are already two pull requests to enhance Spark shuffle metadata API to > make it easy/possible to implement remote shuffle service ([PR > 28616|https://github.com/apache/spark/pull/28616], [PR > 28618|https://github.com/apache/spark/pull/28618]). Creating a remote shuffle > service reference implementation will help to validate those shuffle metadata > API. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31924) Create remote shuffle service reference implementation
[ https://issues.apache.org/jira/browse/SPARK-31924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17159550#comment-17159550 ] BoYang edited comment on SPARK-31924 at 7/16/20, 11:21 PM: --- We created a short [design doc|https://docs.google.com/document/d/1thTeID___Dh4Ax4Ep0QJpXn2qsaIm2nxZMbqRix0J-k]. Also created code example of a plain shuffle client/server ([https://github.com/boy-uber/spark/pull/3]) to demonstrate the basic design idea. was (Author: bobyangbo): We created a short [design doc|https://docs.google.com/document/d/1thTeID___Dh4Ax4Ep0QJpXn2qsaIm2nxZMbqRix0J-k]. Also created [code example of a plain shuffle client/server|[https://github.com/boy-uber/spark/pull/3]] to demonstrate the basic design idea. > Create remote shuffle service reference implementation > -- > > Key: SPARK-31924 > URL: https://issues.apache.org/jira/browse/SPARK-31924 > Project: Spark > Issue Type: New Feature > Components: Shuffle >Affects Versions: 3.1.0 >Reporter: BoYang >Priority: Major > > People in [Spark Scalability & Reliability Sync Meeting > |https://docs.google.com/document/d/1T3y25dOaKWVO0pWd838GeiTeI3DUQJtwy6MKYPLuleg]have > discussed a lot about remote (disaggregated) shuffle service, and plan to do > a reference implementation to help demonstrate some basic design and pave the > way for a future production grade remote shuffle service. > > There are already two pull requests to enhance Spark shuffle metadata API to > make it easy/possible to implement remote shuffle service ([PR > 28616|https://github.com/apache/spark/pull/28616], [PR > 28618|https://github.com/apache/spark/pull/28618]). Creating a remote shuffle > service reference implementation will help to validate those shuffle metadata > API. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31924) Create remote shuffle service reference implementation
[ https://issues.apache.org/jira/browse/SPARK-31924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17159550#comment-17159550 ] BoYang commented on SPARK-31924: We created a short [design doc|https://docs.google.com/document/d/1thTeID___Dh4Ax4Ep0QJpXn2qsaIm2nxZMbqRix0J-k]. Also created [code example of a plain shuffle client/server|[https://github.com/boy-uber/spark/pull/3]] to demonstrate the basic design idea. > Create remote shuffle service reference implementation > -- > > Key: SPARK-31924 > URL: https://issues.apache.org/jira/browse/SPARK-31924 > Project: Spark > Issue Type: New Feature > Components: Shuffle >Affects Versions: 3.1.0 >Reporter: BoYang >Priority: Major > > People in [Spark Scalability & Reliability Sync Meeting > |https://docs.google.com/document/d/1T3y25dOaKWVO0pWd838GeiTeI3DUQJtwy6MKYPLuleg]have > discussed a lot about remote (disaggregated) shuffle service, and plan to do > a reference implementation to help demonstrate some basic design and pave the > way for a future production grade remote shuffle service. > > There are already two pull requests to enhance Spark shuffle metadata API to > make it easy/possible to implement remote shuffle service ([PR > 28616|https://github.com/apache/spark/pull/28616], [PR > 28618|https://github.com/apache/spark/pull/28618]). Creating a remote shuffle > service reference implementation will help to validate those shuffle metadata > API. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31924) Create remote shuffle service reference implementation
[ https://issues.apache.org/jira/browse/SPARK-31924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BoYang updated SPARK-31924: --- Description: People in [Spark Scalability & Reliability Sync Meeting |https://docs.google.com/document/d/1T3y25dOaKWVO0pWd838GeiTeI3DUQJtwy6MKYPLuleg]have discussed a lot about remote (disaggregated) shuffle service, and plan to do a reference implementation to help demonstrate some basic design and pave the way for a future production grade remote shuffle service. There are already two pull requests to enhance Spark shuffle metadata API to make it easy/possible to implement remote shuffle service ([PR 28616|https://github.com/apache/spark/pull/28616], [PR 28618|https://github.com/apache/spark/pull/28618]). Creating a remote shuffle service reference implementation will help to validate those shuffle metadata API. was: People in [Spark Scalability & Reliability Sync Meeting|[https://docs.google.com/document/d/1T3y25dOaKWVO0pWd838GeiTeI3DUQJtwy6MKYPLuleg]] have discussed a lot about remote (disaggregated) shuffle service, and plan to do a reference implementation to help demonstrate some basic design and pave the way for a future production grade remote shuffle service. There are already two pull requests to enhance Spark shuffle metadata API to make it easy/possible to implement remote shuffle service ([PR 28616|https://github.com/apache/spark/pull/28616], [PR 28618|https://github.com/apache/spark/pull/28618]). Creating a remote shuffle service reference implementation will help to validate those shuffle metadata API. > Create remote shuffle service reference implementation > -- > > Key: SPARK-31924 > URL: https://issues.apache.org/jira/browse/SPARK-31924 > Project: Spark > Issue Type: New Feature > Components: Shuffle >Affects Versions: 3.0.0 >Reporter: BoYang >Priority: Major > Fix For: 3.0.0 > > > People in [Spark Scalability & Reliability Sync Meeting > |https://docs.google.com/document/d/1T3y25dOaKWVO0pWd838GeiTeI3DUQJtwy6MKYPLuleg]have > discussed a lot about remote (disaggregated) shuffle service, and plan to do > a reference implementation to help demonstrate some basic design and pave the > way for a future production grade remote shuffle service. > > There are already two pull requests to enhance Spark shuffle metadata API to > make it easy/possible to implement remote shuffle service ([PR > 28616|https://github.com/apache/spark/pull/28616], [PR > 28618|https://github.com/apache/spark/pull/28618]). Creating a remote shuffle > service reference implementation will help to validate those shuffle metadata > API. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31924) Create remote shuffle service reference implementation
BoYang created SPARK-31924: -- Summary: Create remote shuffle service reference implementation Key: SPARK-31924 URL: https://issues.apache.org/jira/browse/SPARK-31924 Project: Spark Issue Type: New Feature Components: Shuffle Affects Versions: 3.0.0 Reporter: BoYang Fix For: 3.0.0 People in [Spark Scalability & Reliability Sync Meeting|[https://docs.google.com/document/d/1T3y25dOaKWVO0pWd838GeiTeI3DUQJtwy6MKYPLuleg]] have discussed a lot about remote (disaggregated) shuffle service, and plan to do a reference implementation to help demonstrate some basic design and pave the way for a future production grade remote shuffle service. There are already two pull requests to enhance Spark shuffle metadata API to make it easy/possible to implement remote shuffle service ([PR 28616|https://github.com/apache/spark/pull/28616], [PR 28618|https://github.com/apache/spark/pull/28618]). Creating a remote shuffle service reference implementation will help to validate those shuffle metadata API. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29472) Mechanism for Excluding Jars at Launch for YARN
[ https://issues.apache.org/jira/browse/SPARK-29472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16953209#comment-16953209 ] BoYang commented on SPARK-29472: This is a pretty good feature, helping to solve production issue when there is jar file conflict! > Mechanism for Excluding Jars at Launch for YARN > --- > > Key: SPARK-29472 > URL: https://issues.apache.org/jira/browse/SPARK-29472 > Project: Spark > Issue Type: New Feature > Components: YARN >Affects Versions: 2.4.4 >Reporter: Abhishek Modi >Priority: Minor > > *Summary* > It would be convenient if there were an easy way to exclude jars from Spark’s > classpath at launch time. This would complement the way in which jars can be > added to the classpath using {{extraClassPath}}. > > *Context* > The Spark build contains its dependency jars in the {{/jars}} directory. > These jars become part of the executor’s classpath. By default on YARN, these > jars are packaged and distributed to containers at launch ({{spark-submit}}) > time. > > While developing Spark applications, customers sometimes need to debug using > different versions of dependencies. This can become difficult if the > dependency (eg. Parquet 1.11.0) is one that Spark already has in {{/jars}} > (eg. Parquet 1.10.1 in Spark 2.4), as the dependency included with Spark is > preferentially loaded. > > Configurations such as {{userClassPathFirst}} are available. However these > have often come with other side effects. For example, if the customer’s build > includes Avro they will likely see {{Caused by: java.lang.LinkageError: > loader constraint violation: when resolving method > "org.apache.spark.SparkConf.registerAvroSchemas(Lscala/collection/Seq;)Lorg/apache/spark/SparkConf;" > the class loader (instance of > org/apache/spark/util/ChildFirstURLClassLoader) of the current class, > com/uber/marmaray/common/spark/SparkFactory, and the class loader (instance > of sun/misc/Launcher$AppClassLoader) for the method's defining class, > org/apache/spark/SparkConf, have different Class objects for the type > scala/collection/Seq used in the signature}}. Resolving such issues often > takes many hours. > > To deal with these sorts of issues, customers often download the Spark build, > remove the target jars and then do spark-submit. Other times, customers may > not be able to do spark-submit as it is gated behind some Spark Job Server. > In this case, customers may try downloading the build, removing the jars, and > then using configurations such as {{spark.yarn.dist.jars}} or > {{spark.yarn.dist.archives}}. Both of these options are undesirable as they > are very operationally heavy, error prone and often result in the customer’s > spark builds going out of sync with the authoritative build. > > *Solution* > I’d like to propose adding a {{spark.yarn.jars.exclusionRegex}} > configuration. Customers could provide a regex such as {{.\*parquet.\*}} and > jar files matching this regex would not be included in the driver and > executor classpath. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org