[jira] [Resolved] (SPARK-47978) Decouple Spark Go Connect Library versioning from Spark versioning

2024-04-25 Thread BoYang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BoYang resolved SPARK-47978.

Resolution: Fixed

> Decouple Spark Go Connect Library versioning from Spark versioning
> --
>
> Key: SPARK-47978
> URL: https://issues.apache.org/jira/browse/SPARK-47978
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.5.1
>Reporter: BoYang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.1
>
>
> There is a recent discussion in Spark community for Spark Operator version 
> naming convention. People like to use version independent of Spark versions. 
> That applies to Spark Connect Go Client as well. Better to start from v1.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47678) Got fetch failed exception when new executor reused same ip address from a previously killed executor

2024-04-01 Thread BoYang (Jira)
BoYang created SPARK-47678:
--

 Summary: Got fetch failed exception when new executor reused same 
ip address from a previously killed executor
 Key: SPARK-47678
 URL: https://issues.apache.org/jira/browse/SPARK-47678
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 3.5.1, 3.5.0, 3.4.1, 3.4.0, 3.4.2
 Environment: This only happens on Kubernetes, where same ip address 
can be re-used for new executor pod.
Reporter: BoYang


This is an edge case which caused Spark on Kubernetes getting fetch failed 
exception when new executor reused same ip address from a previously killed 
executor.

The new executor checks shuffle block ip address and compares it with its own 
host address. If the two ip addresses are the same, the new executor will 
assume the block on its own local disk and try to read it locally. This causes 
failure since the block is actually on the previously killed executor which 
happened to have same ip address.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44681) Solve issue referencing github.com/apache/spark-connect-go as Go library

2023-09-26 Thread BoYang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BoYang resolved SPARK-44681.

   Fix Version/s: 3.4.0
Target Version/s: 3.5.0
  Resolution: Fixed

> Solve issue referencing github.com/apache/spark-connect-go as Go library
> 
>
> Key: SPARK-44681
> URL: https://issues.apache.org/jira/browse/SPARK-44681
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect Contrib
>Affects Versions: 3.5.0
>Reporter: BoYang
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44681) Solve issue referencing github.com/apache/spark-connect-go as Go library

2023-08-04 Thread BoYang (Jira)
BoYang created SPARK-44681:
--

 Summary: Solve issue referencing 
github.com/apache/spark-connect-go as Go library
 Key: SPARK-44681
 URL: https://issues.apache.org/jira/browse/SPARK-44681
 Project: Spark
  Issue Type: Sub-task
  Components: Connect Contrib
Affects Versions: 3.5.0
Reporter: BoYang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43351) Support Golang in Spark Connect

2023-08-04 Thread BoYang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17751186#comment-17751186
 ] 

BoYang commented on SPARK-43351:


Thanks! We can keep it as 3.5.0 now.

> Support Golang in Spark Connect
> ---
>
> Key: SPARK-43351
> URL: https://issues.apache.org/jira/browse/SPARK-43351
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: BoYang
>Assignee: BoYang
>Priority: Major
> Fix For: 3.5.0
>
>
> Support Spark Connect client side in Go programming language 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44368) Support partition operation on dataframe in Spark Connect Go Client

2023-07-10 Thread BoYang (Jira)
BoYang created SPARK-44368:
--

 Summary: Support partition operation on dataframe in Spark Connect 
Go Client
 Key: SPARK-44368
 URL: https://issues.apache.org/jira/browse/SPARK-44368
 Project: Spark
  Issue Type: Sub-task
  Components: Connect Contrib
Affects Versions: 3.4.1
Reporter: BoYang


Support partition operation on dataframe in Spark Connect Go Client



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43351) Support Golang in Spark Connect

2023-05-06 Thread BoYang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17720241#comment-17720241
 ] 

BoYang commented on SPARK-43351:


Yes, thanks Ruifeng for the comment and suggestion! Would like to add another 
item to discuss:

1.7, versioning: how to organize Go package for different versions

 

> Support Golang in Spark Connect
> ---
>
> Key: SPARK-43351
> URL: https://issues.apache.org/jira/browse/SPARK-43351
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: BoYang
>Priority: Major
>
> Support Spark Connect client side in Go programming language 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43351) Support Golang in Spark Connect

2023-05-02 Thread BoYang (Jira)
BoYang created SPARK-43351:
--

 Summary: Support Golang in Spark Connect
 Key: SPARK-43351
 URL: https://issues.apache.org/jira/browse/SPARK-43351
 Project: Spark
  Issue Type: New Feature
  Components: Connect
Affects Versions: 3.4.0
Reporter: BoYang


Support Spark Connect client side in Go programming language 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38668) Spark on Kubernetes: add separate pod watcher service to reduce pressure on K8s API server

2022-03-27 Thread BoYang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BoYang updated SPARK-38668:
---
Summary: Spark on Kubernetes: add separate pod watcher service to reduce 
pressure on K8s API server  (was: Spark on Kubernetes: support external pod 
watcher to reduce pressure on K8s API server)

> Spark on Kubernetes: add separate pod watcher service to reduce pressure on 
> K8s API server
> --
>
> Key: SPARK-38668
> URL: https://issues.apache.org/jira/browse/SPARK-38668
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 3.2.1
>Reporter: BoYang
>Priority: Major
>
> Spark driver will listen to all pods events to manage its executor pods. This 
> will cause pressure on Kubernetes API server in a large cluster, because 
> there will be many drivers connect to the API server and watch for the pods.
>  
> An alternative is to have a separate service to listen and watch all pod 
> events. Then each Spark driver only connects to that service to get pod 
> events. This will reduce the load on Kubernetes API server.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38668) Spark on Kubernetes: support external pod watcher to reduce pressure on K8s API server

2022-03-27 Thread BoYang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BoYang updated SPARK-38668:
---
Description: 
Spark driver will listen to all pods events to manage its executor pods. This 
will cause pressure on Kubernetes API server in a large cluster, because there 
will be many drivers connect to the API server and watch for the pods.

 

An alternative is to have a separate service to listen and watch all pod 
events. Then each Spark driver only connects to that service to get pod events. 
This will reduce the load on Kubernetes API server.

> Spark on Kubernetes: support external pod watcher to reduce pressure on K8s 
> API server
> --
>
> Key: SPARK-38668
> URL: https://issues.apache.org/jira/browse/SPARK-38668
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 3.2.1
>Reporter: BoYang
>Priority: Major
>
> Spark driver will listen to all pods events to manage its executor pods. This 
> will cause pressure on Kubernetes API server in a large cluster, because 
> there will be many drivers connect to the API server and watch for the pods.
>  
> An alternative is to have a separate service to listen and watch all pod 
> events. Then each Spark driver only connects to that service to get pod 
> events. This will reduce the load on Kubernetes API server.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38668) Spark on Kubernetes: support external pod watcher to reduce pressure on K8s API server

2022-03-27 Thread BoYang (Jira)
BoYang created SPARK-38668:
--

 Summary: Spark on Kubernetes: support external pod watcher to 
reduce pressure on K8s API server
 Key: SPARK-38668
 URL: https://issues.apache.org/jira/browse/SPARK-38668
 Project: Spark
  Issue Type: New Feature
  Components: Kubernetes
Affects Versions: 3.2.1
Reporter: BoYang






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25299) Use remote storage for persisting shuffle data

2021-12-10 Thread BoYang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17457384#comment-17457384
 ] 

BoYang edited comment on SPARK-25299 at 12/10/21, 10:01 PM:


I am working on a prototype to store shuffle file on external storage like S3:  
https://github.com/apache/spark/pull/34864 . Would love to hear comments. Also 
welcome people to collaborate on this.


was (Author: bobyangbo):
I am working on a prototype to store shuffle file on external storage like S3: 
[https://github.com/apache/spark/pull/34864.] Would love to hear comments. Also 
welcome people to collaborate on this.

> Use remote storage for persisting shuffle data
> --
>
> Key: SPARK-25299
> URL: https://issues.apache.org/jira/browse/SPARK-25299
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Matt Cheah
>Priority: Major
>  Labels: SPIP
>
> In Spark, the shuffle primitive requires Spark executors to persist data to 
> the local disk of the worker nodes. If executors crash, the external shuffle 
> service can continue to serve the shuffle data that was written beyond the 
> lifetime of the executor itself. In YARN, Mesos, and Standalone mode, the 
> external shuffle service is deployed on every worker node. The shuffle 
> service shares local disk with the executors that run on its node.
> There are some shortcomings with the way shuffle is fundamentally implemented 
> right now. Particularly:
>  * If any external shuffle service process or node becomes unavailable, all 
> applications that had an executor that ran on that node must recompute the 
> shuffle blocks that were lost.
>  * Similarly to the above, the external shuffle service must be kept running 
> at all times, which may waste resources when no applications are using that 
> shuffle service node.
>  * Mounting local storage can prevent users from taking advantage of 
> desirable isolation benefits from using containerized environments, like 
> Kubernetes. We had an external shuffle service implementation in an early 
> prototype of the Kubernetes backend, but it was rejected due to its strict 
> requirement to be able to mount hostPath volumes or other persistent volume 
> setups.
> In the following [architecture discussion 
> document|https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40]
>  (note: _not_ an SPIP), we brainstorm various high level architectures for 
> improving the external shuffle service in a way that addresses the above 
> problems. The purpose of this umbrella JIRA is to promote additional 
> discussion on how we can approach these problems, both at the architecture 
> level and the implementation level. We anticipate filing sub-issues that 
> break down the tasks that must be completed to achieve this goal.
> Edit June 28 2019: Our SPIP is here: 
> [https://docs.google.com/document/d/1d6egnL6WHOwWZe8MWv3m8n4PToNacdx7n_0iMSWwhCQ/edit]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25299) Use remote storage for persisting shuffle data

2021-12-10 Thread BoYang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17457384#comment-17457384
 ] 

BoYang commented on SPARK-25299:


I am working on a prototype to store shuffle file on external storage like S3: 
[https://github.com/apache/spark/pull/34864.] Would love to hear comments. Also 
welcome people to collaborate on this.

> Use remote storage for persisting shuffle data
> --
>
> Key: SPARK-25299
> URL: https://issues.apache.org/jira/browse/SPARK-25299
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Matt Cheah
>Priority: Major
>  Labels: SPIP
>
> In Spark, the shuffle primitive requires Spark executors to persist data to 
> the local disk of the worker nodes. If executors crash, the external shuffle 
> service can continue to serve the shuffle data that was written beyond the 
> lifetime of the executor itself. In YARN, Mesos, and Standalone mode, the 
> external shuffle service is deployed on every worker node. The shuffle 
> service shares local disk with the executors that run on its node.
> There are some shortcomings with the way shuffle is fundamentally implemented 
> right now. Particularly:
>  * If any external shuffle service process or node becomes unavailable, all 
> applications that had an executor that ran on that node must recompute the 
> shuffle blocks that were lost.
>  * Similarly to the above, the external shuffle service must be kept running 
> at all times, which may waste resources when no applications are using that 
> shuffle service node.
>  * Mounting local storage can prevent users from taking advantage of 
> desirable isolation benefits from using containerized environments, like 
> Kubernetes. We had an external shuffle service implementation in an early 
> prototype of the Kubernetes backend, but it was rejected due to its strict 
> requirement to be able to mount hostPath volumes or other persistent volume 
> setups.
> In the following [architecture discussion 
> document|https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40]
>  (note: _not_ an SPIP), we brainstorm various high level architectures for 
> improving the external shuffle service in a way that addresses the above 
> problems. The purpose of this umbrella JIRA is to promote additional 
> discussion on how we can approach these problems, both at the architecture 
> level and the implementation level. We anticipate filing sub-issues that 
> break down the tasks that must be completed to achieve this goal.
> Edit June 28 2019: Our SPIP is here: 
> [https://docs.google.com/document/d/1d6egnL6WHOwWZe8MWv3m8n4PToNacdx7n_0iMSWwhCQ/edit]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34601) Do not delete shuffle file on executor lost event when using remote shuffle service

2021-03-05 Thread BoYang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BoYang resolved SPARK-34601.

Resolution: Won't Fix

> Do not delete shuffle file on executor lost event when using remote shuffle 
> service
> ---
>
> Key: SPARK-34601
> URL: https://issues.apache.org/jira/browse/SPARK-34601
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Affects Versions: 3.2.0
>Reporter: BoYang
>Priority: Major
>  Labels: shuffle
>
> There are multiple work going on with disaggregated/remote shuffle service 
> (e.g. [LinkedIn 
> shuffle|https://engineering.linkedin.com/blog/2020/introducing-magnet], 
> [Facebook shuffle 
> service|https://databricks.com/session/cosco-an-efficient-facebook-scale-shuffle-service],
>  [Uber shuffle service|https://github.com/uber/RemoteShuffleService]). Such 
> remote shuffle service is not Spark External Shuffle Service. It could be 
> third party shuffle solution and user uses it by setting 
> spark.shuffle.manager. In those systems, shuffle data will be stored on 
> different server other than executor. Spark should not mark shuffle data lost 
> when the executor is lost. We could add a Spark configuration to control this 
> behavior. By default, Spark still mark shuffle file lost. For 
> disaggregated/remote shuffle service, people could set the configure to not 
> mark shuffle file lost.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34601) Do not delete shuffle file on executor lost event when using remote shuffle service

2021-03-05 Thread BoYang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296390#comment-17296390
 ] 

BoYang commented on SPARK-34601:


It looks Spark already checks and matches the executor id when it tries to 
remove map output. I will close this ticket.

> Do not delete shuffle file on executor lost event when using remote shuffle 
> service
> ---
>
> Key: SPARK-34601
> URL: https://issues.apache.org/jira/browse/SPARK-34601
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Affects Versions: 3.2.0
>Reporter: BoYang
>Priority: Major
>  Labels: shuffle
>
> There are multiple work going on with disaggregated/remote shuffle service 
> (e.g. [LinkedIn 
> shuffle|https://engineering.linkedin.com/blog/2020/introducing-magnet], 
> [Facebook shuffle 
> service|https://databricks.com/session/cosco-an-efficient-facebook-scale-shuffle-service],
>  [Uber shuffle service|https://github.com/uber/RemoteShuffleService]). Such 
> remote shuffle service is not Spark External Shuffle Service. It could be 
> third party shuffle solution and user uses it by setting 
> spark.shuffle.manager. In those systems, shuffle data will be stored on 
> different server other than executor. Spark should not mark shuffle data lost 
> when the executor is lost. We could add a Spark configuration to control this 
> behavior. By default, Spark still mark shuffle file lost. For 
> disaggregated/remote shuffle service, people could set the configure to not 
> mark shuffle file lost.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34601) Do not delete shuffle file on executor lost event when using remote shuffle service

2021-03-03 Thread BoYang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BoYang updated SPARK-34601:
---
Description: There are multiple work going on with disaggregated/remote 
shuffle service (e.g. [LinkedIn 
shuffle|https://engineering.linkedin.com/blog/2020/introducing-magnet], 
[Facebook shuffle 
service|https://databricks.com/session/cosco-an-efficient-facebook-scale-shuffle-service],
 [Uber shuffle service|https://github.com/uber/RemoteShuffleService]). Such 
remote shuffle service is not Spark External Shuffle Service. It could be third 
party shuffle solution and user uses it by setting spark.shuffle.manager. In 
those systems, shuffle data will be stored on different server other than 
executor. Spark should not mark shuffle data lost when the executor is lost. We 
could add a Spark configuration to control this behavior. By default, Spark 
still mark shuffle file lost. For disaggregated/remote shuffle service, people 
could set the configure to not mark shuffle file lost.  (was: There are 
multiple work going on with disaggregated/remote shuffle service (e.g. 
[LinkedIn 
shuffle|https://engineering.linkedin.com/blog/2020/introducing-magnet], 
[Facebook shuffle 
service|https://databricks.com/session/cosco-an-efficient-facebook-scale-shuffle-service],
 [Uber shuffle service|https://github.com/uber/RemoteShuffleService]). In those 
systems, shuffle data will be stored on different server other than executor. 
Spark should not mark shuffle data lost when the executor is lost. We could add 
a Spark configuration to control this behavior. By default, Spark still mark 
shuffle file lost. For disaggregated/remote shuffle service, people could set 
the configure to not mark shuffle file lost.)

> Do not delete shuffle file on executor lost event when using remote shuffle 
> service
> ---
>
> Key: SPARK-34601
> URL: https://issues.apache.org/jira/browse/SPARK-34601
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Affects Versions: 3.2.0
>Reporter: BoYang
>Priority: Major
>  Labels: shuffle
> Fix For: 3.2.0
>
>
> There are multiple work going on with disaggregated/remote shuffle service 
> (e.g. [LinkedIn 
> shuffle|https://engineering.linkedin.com/blog/2020/introducing-magnet], 
> [Facebook shuffle 
> service|https://databricks.com/session/cosco-an-efficient-facebook-scale-shuffle-service],
>  [Uber shuffle service|https://github.com/uber/RemoteShuffleService]). Such 
> remote shuffle service is not Spark External Shuffle Service. It could be 
> third party shuffle solution and user uses it by setting 
> spark.shuffle.manager. In those systems, shuffle data will be stored on 
> different server other than executor. Spark should not mark shuffle data lost 
> when the executor is lost. We could add a Spark configuration to control this 
> behavior. By default, Spark still mark shuffle file lost. For 
> disaggregated/remote shuffle service, people could set the configure to not 
> mark shuffle file lost.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34601) Do not delete shuffle file on executor lost event when using remote shuffle service

2021-03-02 Thread BoYang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17294062#comment-17294062
 ] 

BoYang commented on SPARK-34601:


I will add a PR for this soon.

> Do not delete shuffle file on executor lost event when using remote shuffle 
> service
> ---
>
> Key: SPARK-34601
> URL: https://issues.apache.org/jira/browse/SPARK-34601
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Affects Versions: 3.2.0
>Reporter: BoYang
>Priority: Major
>  Labels: shuffle
> Fix For: 3.2.0
>
>
> There are multiple work going on with disaggregated/remote shuffle service 
> (e.g. [LinkedIn 
> shuffle|https://engineering.linkedin.com/blog/2020/introducing-magnet], 
> [Facebook shuffle 
> service|https://databricks.com/session/cosco-an-efficient-facebook-scale-shuffle-service],
>  [Uber shuffle service|https://github.com/uber/RemoteShuffleService]). In 
> those systems, shuffle data will be stored on different server other than 
> executor. Spark should not mark shuffle data lost when the executor is lost. 
> We could add a Spark configuration to control this behavior. By default, 
> Spark still mark shuffle file lost. For disaggregated/remote shuffle service, 
> people could set the configure to not mark shuffle file lost.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34601) Do not delete shuffle file on executor lost event when using remote shuffle service

2021-03-02 Thread BoYang (Jira)
BoYang created SPARK-34601:
--

 Summary: Do not delete shuffle file on executor lost event when 
using remote shuffle service
 Key: SPARK-34601
 URL: https://issues.apache.org/jira/browse/SPARK-34601
 Project: Spark
  Issue Type: New Feature
  Components: Shuffle
Affects Versions: 3.2.0
Reporter: BoYang
 Fix For: 3.2.0


There are multiple work going on with disaggregated/remote shuffle service 
(e.g. [LinkedIn 
shuffle|https://engineering.linkedin.com/blog/2020/introducing-magnet], 
[Facebook shuffle 
service|https://databricks.com/session/cosco-an-efficient-facebook-scale-shuffle-service],
 [Uber shuffle service|https://github.com/uber/RemoteShuffleService]). In those 
systems, shuffle data will be stored on different server other than executor. 
Spark should not mark shuffle data lost when the executor is lost. We could add 
a Spark configuration to control this behavior. By default, Spark still mark 
shuffle file lost. For disaggregated/remote shuffle service, people could set 
the configure to not mark shuffle file lost.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33114) Add metadata in MapStatus to support custom shuffle manager

2020-10-10 Thread BoYang (Jira)
BoYang created SPARK-33114:
--

 Summary: Add metadata in MapStatus to support custom shuffle 
manager
 Key: SPARK-33114
 URL: https://issues.apache.org/jira/browse/SPARK-33114
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 3.0.1
Reporter: BoYang


Current MapStatus class is tightly bound with local (sort merge) shuffle which 
uses BlockManagerId to store the shuffle data location. It could not support 
other custom shuffle manager implementation.

We could add "metadata" to MapStatus and allow different shuffle manager 
implementation to store information related to them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25299) Use remote storage for persisting shuffle data

2020-10-03 Thread BoYang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17207093#comment-17207093
 ] 

BoYang edited comment on SPARK-25299 at 10/4/20, 5:27 AM:
--

While people work on remote storage for persisting shuffle data, and introduce 
shuffle API changes, it is better to have some reference implementation to use 
remote storage for shuffle data.

Such reference implementation could demonstrate how to use the shuffle API and 
also could make sure the API works for both local sort merge shuffle and remote 
shuffle.

More details in https://issues.apache.org/jira/browse/SPARK-31924: Create 
remote shuffle service reference implementation.

Uber also open sourced Remote Shuffle Service 
([https://github.com/uber/RemoteShuffleService)] , which is pretty complicated. 
It might be better to have a simplified small version of remote shuffle service 
reference implementation inside Spark repo.


was (Author: bobyangbo):
While people work on remote storage for persisting shuffle data, and introduce 
shuffle API changes, it is better to have some reference implementation to use 
remote storage for shuffle data. Such reference implementation could 
demonstrate how to use the shuffle API and also could make sure the API works 
for both local sort merge shuffle and remote shuffle.

More details in https://issues.apache.org/jira/browse/SPARK-31924: Create 
remote shuffle service reference implementation.

> Use remote storage for persisting shuffle data
> --
>
> Key: SPARK-25299
> URL: https://issues.apache.org/jira/browse/SPARK-25299
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Matt Cheah
>Priority: Major
>  Labels: SPIP
>
> In Spark, the shuffle primitive requires Spark executors to persist data to 
> the local disk of the worker nodes. If executors crash, the external shuffle 
> service can continue to serve the shuffle data that was written beyond the 
> lifetime of the executor itself. In YARN, Mesos, and Standalone mode, the 
> external shuffle service is deployed on every worker node. The shuffle 
> service shares local disk with the executors that run on its node.
> There are some shortcomings with the way shuffle is fundamentally implemented 
> right now. Particularly:
>  * If any external shuffle service process or node becomes unavailable, all 
> applications that had an executor that ran on that node must recompute the 
> shuffle blocks that were lost.
>  * Similarly to the above, the external shuffle service must be kept running 
> at all times, which may waste resources when no applications are using that 
> shuffle service node.
>  * Mounting local storage can prevent users from taking advantage of 
> desirable isolation benefits from using containerized environments, like 
> Kubernetes. We had an external shuffle service implementation in an early 
> prototype of the Kubernetes backend, but it was rejected due to its strict 
> requirement to be able to mount hostPath volumes or other persistent volume 
> setups.
> In the following [architecture discussion 
> document|https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40]
>  (note: _not_ an SPIP), we brainstorm various high level architectures for 
> improving the external shuffle service in a way that addresses the above 
> problems. The purpose of this umbrella JIRA is to promote additional 
> discussion on how we can approach these problems, both at the architecture 
> level and the implementation level. We anticipate filing sub-issues that 
> break down the tasks that must be completed to achieve this goal.
> Edit June 28 2019: Our SPIP is here: 
> [https://docs.google.com/document/d/1d6egnL6WHOwWZe8MWv3m8n4PToNacdx7n_0iMSWwhCQ/edit]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25299) Use remote storage for persisting shuffle data

2020-10-03 Thread BoYang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17207093#comment-17207093
 ] 

BoYang edited comment on SPARK-25299 at 10/4/20, 5:23 AM:
--

While people work on remote storage for persisting shuffle data, and introduce 
shuffle API changes, it is better to have some reference implementation to use 
remote storage for shuffle data. Such reference implementation could 
demonstrate how to use the shuffle API and also could make sure the API works 
for both local sort merge shuffle and remote shuffle.

More details in https://issues.apache.org/jira/browse/SPARK-31924: Create 
remote shuffle service reference implementation.


was (Author: bobyangbo):
While people work on remote storage for persisting shuffle data, and introduce 
shuffle API changes, it is better to have some reference implementation to use 
remote storage for shuffle data. Such reference implementation could 
demonstrate how to use the shuffle API and also could make sure the API works 
for both local sort merge shuffle and remote shuffle.

> Use remote storage for persisting shuffle data
> --
>
> Key: SPARK-25299
> URL: https://issues.apache.org/jira/browse/SPARK-25299
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Matt Cheah
>Priority: Major
>  Labels: SPIP
>
> In Spark, the shuffle primitive requires Spark executors to persist data to 
> the local disk of the worker nodes. If executors crash, the external shuffle 
> service can continue to serve the shuffle data that was written beyond the 
> lifetime of the executor itself. In YARN, Mesos, and Standalone mode, the 
> external shuffle service is deployed on every worker node. The shuffle 
> service shares local disk with the executors that run on its node.
> There are some shortcomings with the way shuffle is fundamentally implemented 
> right now. Particularly:
>  * If any external shuffle service process or node becomes unavailable, all 
> applications that had an executor that ran on that node must recompute the 
> shuffle blocks that were lost.
>  * Similarly to the above, the external shuffle service must be kept running 
> at all times, which may waste resources when no applications are using that 
> shuffle service node.
>  * Mounting local storage can prevent users from taking advantage of 
> desirable isolation benefits from using containerized environments, like 
> Kubernetes. We had an external shuffle service implementation in an early 
> prototype of the Kubernetes backend, but it was rejected due to its strict 
> requirement to be able to mount hostPath volumes or other persistent volume 
> setups.
> In the following [architecture discussion 
> document|https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40]
>  (note: _not_ an SPIP), we brainstorm various high level architectures for 
> improving the external shuffle service in a way that addresses the above 
> problems. The purpose of this umbrella JIRA is to promote additional 
> discussion on how we can approach these problems, both at the architecture 
> level and the implementation level. We anticipate filing sub-issues that 
> break down the tasks that must be completed to achieve this goal.
> Edit June 28 2019: Our SPIP is here: 
> [https://docs.google.com/document/d/1d6egnL6WHOwWZe8MWv3m8n4PToNacdx7n_0iMSWwhCQ/edit]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25299) Use remote storage for persisting shuffle data

2020-10-03 Thread BoYang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17207093#comment-17207093
 ] 

BoYang commented on SPARK-25299:


While people work on remote storage for persisting shuffle data, and introduce 
shuffle API changes, it is better to have some reference implementation to use 
remote storage for shuffle data. Such reference implementation could 
demonstrate how to use the shuffle API and also could make sure the API works 
for both local sort merge shuffle and remote shuffle.

> Use remote storage for persisting shuffle data
> --
>
> Key: SPARK-25299
> URL: https://issues.apache.org/jira/browse/SPARK-25299
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Matt Cheah
>Priority: Major
>  Labels: SPIP
>
> In Spark, the shuffle primitive requires Spark executors to persist data to 
> the local disk of the worker nodes. If executors crash, the external shuffle 
> service can continue to serve the shuffle data that was written beyond the 
> lifetime of the executor itself. In YARN, Mesos, and Standalone mode, the 
> external shuffle service is deployed on every worker node. The shuffle 
> service shares local disk with the executors that run on its node.
> There are some shortcomings with the way shuffle is fundamentally implemented 
> right now. Particularly:
>  * If any external shuffle service process or node becomes unavailable, all 
> applications that had an executor that ran on that node must recompute the 
> shuffle blocks that were lost.
>  * Similarly to the above, the external shuffle service must be kept running 
> at all times, which may waste resources when no applications are using that 
> shuffle service node.
>  * Mounting local storage can prevent users from taking advantage of 
> desirable isolation benefits from using containerized environments, like 
> Kubernetes. We had an external shuffle service implementation in an early 
> prototype of the Kubernetes backend, but it was rejected due to its strict 
> requirement to be able to mount hostPath volumes or other persistent volume 
> setups.
> In the following [architecture discussion 
> document|https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40]
>  (note: _not_ an SPIP), we brainstorm various high level architectures for 
> improving the external shuffle service in a way that addresses the above 
> problems. The purpose of this umbrella JIRA is to promote additional 
> discussion on how we can approach these problems, both at the architecture 
> level and the implementation level. We anticipate filing sub-issues that 
> break down the tasks that must be completed to achieve this goal.
> Edit June 28 2019: Our SPIP is here: 
> [https://docs.google.com/document/d/1d6egnL6WHOwWZe8MWv3m8n4PToNacdx7n_0iMSWwhCQ/edit]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33037) Remove knownManagers hardcoded list

2020-10-01 Thread BoYang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17205743#comment-17205743
 ] 

BoYang commented on SPARK-33037:


After discussion, we feel it is better to remove the knownManagers list. That 
makes code more clean and also support user's custom shuffle manager 
implementation.

PR: https://github.com/apache/spark/pull/29916

> Remove knownManagers hardcoded list
> ---
>
> Key: SPARK-33037
> URL: https://issues.apache.org/jira/browse/SPARK-33037
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.4.7, 3.0.1
>Reporter: BoYang
>Priority: Major
>
> Spark has a hardcode list to contain known shuffle managers, which has two 
> values now. It does not contain user's custom shuffle manager which is set 
> through Spark config "spark.shuffle.manager".
>  
> We hit issue when set "spark.shuffle.manager" with our own shuffle manager 
> plugin (Uber Remote Shuffle Service implementation, 
> [https://github.com/uber/RemoteShuffleService]). Other users will hit same 
> issue when they implement their own shuffle manager.
>  
> Need to add "spark.shuffle.manager" config value to the known managers list 
> as well.
>  
> The know managers list is in code:
> common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java
> {quote}private final List knownManagers = Arrays.asList(
>    "org.apache.spark.shuffle.sort.SortShuffleManager",
>    "org.apache.spark.shuffle.unsafe.UnsafeShuffleManager");
> {quote}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33037) Remove knownManagers hardcoded list

2020-10-01 Thread BoYang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BoYang updated SPARK-33037:
---
Summary: Remove knownManagers hardcoded list  (was: Add 
"spark.shuffle.manager" value to knownManagers)

> Remove knownManagers hardcoded list
> ---
>
> Key: SPARK-33037
> URL: https://issues.apache.org/jira/browse/SPARK-33037
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.4.7, 3.0.1
>Reporter: BoYang
>Priority: Major
>
> Spark has a hardcode list to contain known shuffle managers, which has two 
> values now. It does not contain user's custom shuffle manager which is set 
> through Spark config "spark.shuffle.manager".
>  
> We hit issue when set "spark.shuffle.manager" with our own shuffle manager 
> plugin (Uber Remote Shuffle Service implementation, 
> [https://github.com/uber/RemoteShuffleService]). Other users will hit same 
> issue when they implement their own shuffle manager.
>  
> Need to add "spark.shuffle.manager" config value to the known managers list 
> as well.
>  
> The know managers list is in code:
> common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java
> {quote}private final List knownManagers = Arrays.asList(
>    "org.apache.spark.shuffle.sort.SortShuffleManager",
>    "org.apache.spark.shuffle.unsafe.UnsafeShuffleManager");
> {quote}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33037) Add "spark.shuffle.manager" value to knownManagers

2020-09-30 Thread BoYang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BoYang updated SPARK-33037:
---
Description: 
Spark has a hardcode list to contain known shuffle managers, which has two 
values now. It does not contain user's custom shuffle manager which is set 
through Spark config "spark.shuffle.manager".

 

We hit issue when set "spark.shuffle.manager" with our own shuffle manager 
plugin (Uber Remote Shuffle Service implementation, 
[https://github.com/uber/RemoteShuffleService]). Other users will hit same 
issue when they implement their own shuffle manager.

 

Need to add "spark.shuffle.manager" config value to the known managers list as 
well.

 

The know managers list is in code:

common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java
{quote}private final List knownManagers = Arrays.asList(
   "org.apache.spark.shuffle.sort.SortShuffleManager",
   "org.apache.spark.shuffle.unsafe.UnsafeShuffleManager");
{quote}
 

 

  was:
Spark has a hardcode list to contain known shuffle managers, which has two 
values now. It does not contain user's custom shuffle manager which is set 
through Spark config "spark.shuffle.manager".

 

We hit issue when set "spark.shuffle.manager" with our own shuffle manager 
plugin (Uber Remote Shuffle Service implementation, 
https://github.com/uber/RemoteShuffleService).

 

Need to add "spark.shuffle.manager" config value to the known managers list as 
well.

 

The know managers list is in code:

common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java
{quote}private final List knownManagers = Arrays.asList(
  "org.apache.spark.shuffle.sort.SortShuffleManager",
  "org.apache.spark.shuffle.unsafe.UnsafeShuffleManager");
{quote}
 

 


> Add "spark.shuffle.manager" value to knownManagers
> --
>
> Key: SPARK-33037
> URL: https://issues.apache.org/jira/browse/SPARK-33037
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.4.7, 3.0.1
>Reporter: BoYang
>Priority: Major
>
> Spark has a hardcode list to contain known shuffle managers, which has two 
> values now. It does not contain user's custom shuffle manager which is set 
> through Spark config "spark.shuffle.manager".
>  
> We hit issue when set "spark.shuffle.manager" with our own shuffle manager 
> plugin (Uber Remote Shuffle Service implementation, 
> [https://github.com/uber/RemoteShuffleService]). Other users will hit same 
> issue when they implement their own shuffle manager.
>  
> Need to add "spark.shuffle.manager" config value to the known managers list 
> as well.
>  
> The know managers list is in code:
> common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java
> {quote}private final List knownManagers = Arrays.asList(
>    "org.apache.spark.shuffle.sort.SortShuffleManager",
>    "org.apache.spark.shuffle.unsafe.UnsafeShuffleManager");
> {quote}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33037) Add "spark.shuffle.manager" value to knownManagers

2020-09-30 Thread BoYang (Jira)
BoYang created SPARK-33037:
--

 Summary: Add "spark.shuffle.manager" value to knownManagers
 Key: SPARK-33037
 URL: https://issues.apache.org/jira/browse/SPARK-33037
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 3.0.1, 2.4.7
Reporter: BoYang


Spark has a hardcode list to contain known shuffle managers, which has two 
values now. It does not contain user's custom shuffle manager which is set 
through Spark config "spark.shuffle.manager".

 

We hit issue when set "spark.shuffle.manager" with our own shuffle manager 
plugin (Uber Remote Shuffle Service implementation, 
https://github.com/uber/RemoteShuffleService).

 

Need to add "spark.shuffle.manager" config value to the known managers list as 
well.

 

The know managers list is in code:

common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java
{quote}private final List knownManagers = Arrays.asList(
  "org.apache.spark.shuffle.sort.SortShuffleManager",
  "org.apache.spark.shuffle.unsafe.UnsafeShuffleManager");
{quote}
 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-31924) Create remote shuffle service reference implementation

2020-07-16 Thread BoYang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17159550#comment-17159550
 ] 

BoYang edited comment on SPARK-31924 at 7/16/20, 11:21 PM:
---

We created a short [design 
doc|https://docs.google.com/document/d/1thTeID___Dh4Ax4Ep0QJpXn2qsaIm2nxZMbqRix0J-k].

Also created code example of a plain shuffle client/server 
([https://github.com/boy-uber/spark/pull/3]) to demonstrate the basic idea.


was (Author: bobyangbo):
We created a short [design 
doc|https://docs.google.com/document/d/1thTeID___Dh4Ax4Ep0QJpXn2qsaIm2nxZMbqRix0J-k].

Also created code example of a plain shuffle client/server 
([https://github.com/boy-uber/spark/pull/3]) to demonstrate the basic design 
idea.

> Create remote shuffle service reference implementation
> --
>
> Key: SPARK-31924
> URL: https://issues.apache.org/jira/browse/SPARK-31924
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Affects Versions: 3.1.0
>Reporter: BoYang
>Priority: Major
>
> People in [Spark Scalability & Reliability Sync Meeting 
> |https://docs.google.com/document/d/1T3y25dOaKWVO0pWd838GeiTeI3DUQJtwy6MKYPLuleg]have
>  discussed a lot about remote (disaggregated) shuffle service, and plan to do 
> a reference implementation to help demonstrate some basic design and pave the 
> way for a future production grade remote  shuffle service.
>  
> There are already two pull requests to enhance Spark shuffle metadata API to 
> make it easy/possible to implement remote shuffle service ([PR 
> 28616|https://github.com/apache/spark/pull/28616], [PR 
> 28618|https://github.com/apache/spark/pull/28618]). Creating a remote shuffle 
> service reference implementation will help to validate those shuffle metadata 
> API.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-31924) Create remote shuffle service reference implementation

2020-07-16 Thread BoYang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17159550#comment-17159550
 ] 

BoYang edited comment on SPARK-31924 at 7/16/20, 11:21 PM:
---

We created a short [design 
doc|https://docs.google.com/document/d/1thTeID___Dh4Ax4Ep0QJpXn2qsaIm2nxZMbqRix0J-k].

Also created code example of a plain shuffle client/server 
([https://github.com/boy-uber/spark/pull/3]) to demonstrate the basic design 
idea.


was (Author: bobyangbo):
We created a short [design 
doc|https://docs.google.com/document/d/1thTeID___Dh4Ax4Ep0QJpXn2qsaIm2nxZMbqRix0J-k].

Also created [code example of a plain shuffle 
client/server|[https://github.com/boy-uber/spark/pull/3]] to demonstrate the 
basic design idea.

> Create remote shuffle service reference implementation
> --
>
> Key: SPARK-31924
> URL: https://issues.apache.org/jira/browse/SPARK-31924
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Affects Versions: 3.1.0
>Reporter: BoYang
>Priority: Major
>
> People in [Spark Scalability & Reliability Sync Meeting 
> |https://docs.google.com/document/d/1T3y25dOaKWVO0pWd838GeiTeI3DUQJtwy6MKYPLuleg]have
>  discussed a lot about remote (disaggregated) shuffle service, and plan to do 
> a reference implementation to help demonstrate some basic design and pave the 
> way for a future production grade remote  shuffle service.
>  
> There are already two pull requests to enhance Spark shuffle metadata API to 
> make it easy/possible to implement remote shuffle service ([PR 
> 28616|https://github.com/apache/spark/pull/28616], [PR 
> 28618|https://github.com/apache/spark/pull/28618]). Creating a remote shuffle 
> service reference implementation will help to validate those shuffle metadata 
> API.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31924) Create remote shuffle service reference implementation

2020-07-16 Thread BoYang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17159550#comment-17159550
 ] 

BoYang commented on SPARK-31924:


We created a short [design 
doc|https://docs.google.com/document/d/1thTeID___Dh4Ax4Ep0QJpXn2qsaIm2nxZMbqRix0J-k].

Also created [code example of a plain shuffle 
client/server|[https://github.com/boy-uber/spark/pull/3]] to demonstrate the 
basic design idea.

> Create remote shuffle service reference implementation
> --
>
> Key: SPARK-31924
> URL: https://issues.apache.org/jira/browse/SPARK-31924
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Affects Versions: 3.1.0
>Reporter: BoYang
>Priority: Major
>
> People in [Spark Scalability & Reliability Sync Meeting 
> |https://docs.google.com/document/d/1T3y25dOaKWVO0pWd838GeiTeI3DUQJtwy6MKYPLuleg]have
>  discussed a lot about remote (disaggregated) shuffle service, and plan to do 
> a reference implementation to help demonstrate some basic design and pave the 
> way for a future production grade remote  shuffle service.
>  
> There are already two pull requests to enhance Spark shuffle metadata API to 
> make it easy/possible to implement remote shuffle service ([PR 
> 28616|https://github.com/apache/spark/pull/28616], [PR 
> 28618|https://github.com/apache/spark/pull/28618]). Creating a remote shuffle 
> service reference implementation will help to validate those shuffle metadata 
> API.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31924) Create remote shuffle service reference implementation

2020-06-07 Thread BoYang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BoYang updated SPARK-31924:
---
Description: 
People in [Spark Scalability & Reliability Sync Meeting 
|https://docs.google.com/document/d/1T3y25dOaKWVO0pWd838GeiTeI3DUQJtwy6MKYPLuleg]have
 discussed a lot about remote (disaggregated) shuffle service, and plan to do a 
reference implementation to help demonstrate some basic design and pave the way 
for a future production grade remote  shuffle service.

 

There are already two pull requests to enhance Spark shuffle metadata API to 
make it easy/possible to implement remote shuffle service ([PR 
28616|https://github.com/apache/spark/pull/28616], [PR 
28618|https://github.com/apache/spark/pull/28618]). Creating a remote shuffle 
service reference implementation will help to validate those shuffle metadata 
API.

 

  was:
People in [Spark Scalability & Reliability Sync 
Meeting|[https://docs.google.com/document/d/1T3y25dOaKWVO0pWd838GeiTeI3DUQJtwy6MKYPLuleg]]
 have discussed a lot about remote (disaggregated) shuffle service, and plan to 
do a reference implementation to help demonstrate some basic design and pave 
the way for a future production grade remote  shuffle service.

 

There are already two pull requests to enhance Spark shuffle metadata API to 
make it easy/possible to implement remote shuffle service ([PR 
28616|https://github.com/apache/spark/pull/28616], [PR 
28618|https://github.com/apache/spark/pull/28618]). Creating a remote shuffle 
service reference implementation will help to validate those shuffle metadata 
API.

 


> Create remote shuffle service reference implementation
> --
>
> Key: SPARK-31924
> URL: https://issues.apache.org/jira/browse/SPARK-31924
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Affects Versions: 3.0.0
>Reporter: BoYang
>Priority: Major
> Fix For: 3.0.0
>
>
> People in [Spark Scalability & Reliability Sync Meeting 
> |https://docs.google.com/document/d/1T3y25dOaKWVO0pWd838GeiTeI3DUQJtwy6MKYPLuleg]have
>  discussed a lot about remote (disaggregated) shuffle service, and plan to do 
> a reference implementation to help demonstrate some basic design and pave the 
> way for a future production grade remote  shuffle service.
>  
> There are already two pull requests to enhance Spark shuffle metadata API to 
> make it easy/possible to implement remote shuffle service ([PR 
> 28616|https://github.com/apache/spark/pull/28616], [PR 
> 28618|https://github.com/apache/spark/pull/28618]). Creating a remote shuffle 
> service reference implementation will help to validate those shuffle metadata 
> API.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31924) Create remote shuffle service reference implementation

2020-06-07 Thread BoYang (Jira)
BoYang created SPARK-31924:
--

 Summary: Create remote shuffle service reference implementation
 Key: SPARK-31924
 URL: https://issues.apache.org/jira/browse/SPARK-31924
 Project: Spark
  Issue Type: New Feature
  Components: Shuffle
Affects Versions: 3.0.0
Reporter: BoYang
 Fix For: 3.0.0


People in [Spark Scalability & Reliability Sync 
Meeting|[https://docs.google.com/document/d/1T3y25dOaKWVO0pWd838GeiTeI3DUQJtwy6MKYPLuleg]]
 have discussed a lot about remote (disaggregated) shuffle service, and plan to 
do a reference implementation to help demonstrate some basic design and pave 
the way for a future production grade remote  shuffle service.

 

There are already two pull requests to enhance Spark shuffle metadata API to 
make it easy/possible to implement remote shuffle service ([PR 
28616|https://github.com/apache/spark/pull/28616], [PR 
28618|https://github.com/apache/spark/pull/28618]). Creating a remote shuffle 
service reference implementation will help to validate those shuffle metadata 
API.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29472) Mechanism for Excluding Jars at Launch for YARN

2019-10-16 Thread BoYang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16953209#comment-16953209
 ] 

BoYang commented on SPARK-29472:


This is a pretty good feature, helping to solve production issue when there is 
jar file conflict!

> Mechanism for Excluding Jars at Launch for YARN
> ---
>
> Key: SPARK-29472
> URL: https://issues.apache.org/jira/browse/SPARK-29472
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Affects Versions: 2.4.4
>Reporter: Abhishek Modi
>Priority: Minor
>
> *Summary*
> It would be convenient if there were an easy way to exclude jars from Spark’s 
> classpath at launch time. This would complement the way in which jars can be 
> added to the classpath using {{extraClassPath}}.
>  
> *Context*
> The Spark build contains its dependency jars in the {{/jars}} directory. 
> These jars become part of the executor’s classpath. By default on YARN, these 
> jars are packaged and distributed to containers at launch ({{spark-submit}}) 
> time.
>  
> While developing Spark applications, customers sometimes need to debug using 
> different versions of dependencies. This can become difficult if the 
> dependency (eg. Parquet 1.11.0) is one that Spark already has in {{/jars}} 
> (eg. Parquet 1.10.1 in Spark 2.4), as the dependency included with Spark is 
> preferentially loaded. 
>  
> Configurations such as {{userClassPathFirst}} are available. However these 
> have often come with other side effects. For example, if the customer’s build 
> includes Avro they will likely see {{Caused by: java.lang.LinkageError: 
> loader constraint violation: when resolving method 
> "org.apache.spark.SparkConf.registerAvroSchemas(Lscala/collection/Seq;)Lorg/apache/spark/SparkConf;"
>  the class loader (instance of 
> org/apache/spark/util/ChildFirstURLClassLoader) of the current class, 
> com/uber/marmaray/common/spark/SparkFactory, and the class loader (instance 
> of sun/misc/Launcher$AppClassLoader) for the method's defining class, 
> org/apache/spark/SparkConf, have different Class objects for the type 
> scala/collection/Seq used in the signature}}. Resolving such issues often 
> takes many hours.
>  
> To deal with these sorts of issues, customers often download the Spark build, 
> remove the target jars and then do spark-submit. Other times, customers may 
> not be able to do spark-submit as it is gated behind some Spark Job Server. 
> In this case, customers may try downloading the build, removing the jars, and 
> then using configurations such as {{spark.yarn.dist.jars}} or 
> {{spark.yarn.dist.archives}}. Both of these options are undesirable as they 
> are very operationally heavy, error prone and often result in the customer’s 
> spark builds going out of sync with the authoritative build. 
>  
> *Solution*
> I’d like to propose adding a {{spark.yarn.jars.exclusionRegex}} 
> configuration. Customers could provide a regex such as {{.\*parquet.\*}} and 
> jar files matching this regex would not be included in the driver and 
> executor classpath.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org