[ https://issues.apache.org/jira/browse/SPARK-41342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17641989#comment-17641989 ]
Sean R. Owen commented on SPARK-41342: -------------------------------------- Why not Horovod? it works with Spark and Pytorch. > Add support for distributed deep learning framework > --------------------------------------------------- > > Key: SPARK-41342 > URL: https://issues.apache.org/jira/browse/SPARK-41342 > Project: Spark > Issue Type: Improvement > Components: PySpark > Affects Versions: 3.3.2 > Reporter: Lu Wang > Priority: Major > > There is a clear trend for deep learning to go from single-machine to > distributed to scale/accelerate training. Adding a support for Distributed DL > solution on Spark will increase the power for spark and largely simplify the > distributed DL workload for the users. > Currently, > [spark-tensorflow-distributor|https://github.com/tensorflow/ecosystem/tree/master/spark/spark-tensorflow-distributor] > provides a solution to run distributed Tensorflow on spark clusters.But > there is no such support for distributed PyTorch. > We want to add a general framework to support both DL frameworks so that we > can have a unified interface for distributed DL workload on spark. And it can > take the advantages for GPU scheduling on spark and have a better resource > management too. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org