[ 
https://issues.apache.org/jira/browse/SPARK-41342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17641989#comment-17641989
 ] 

Sean R. Owen commented on SPARK-41342:
--------------------------------------

Why not Horovod? it works with Spark and Pytorch. 

> Add support for distributed deep learning framework
> ---------------------------------------------------
>
>                 Key: SPARK-41342
>                 URL: https://issues.apache.org/jira/browse/SPARK-41342
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 3.3.2
>            Reporter: Lu Wang
>            Priority: Major
>
> There is a clear trend for deep learning to go from single-machine to 
> distributed to scale/accelerate training. Adding a support for Distributed DL 
> solution on Spark will increase the power for spark and largely simplify the 
> distributed DL workload for the users. 
> Currently, 
> [spark-tensorflow-distributor|https://github.com/tensorflow/ecosystem/tree/master/spark/spark-tensorflow-distributor]
>  provides a solution to run distributed Tensorflow on spark clusters.But 
> there is no such support for distributed PyTorch. 
> We want to add a general framework to support both DL frameworks so that we 
> can have a unified interface for distributed DL workload on spark. And it can 
> take the advantages for GPU scheduling on spark and have a better resource 
> management too. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to