[ https://issues.apache.org/jira/browse/SPARK-27941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17062198#comment-17062198 ]
Dongjoon Hyun commented on SPARK-27941: --------------------------------------- Hi, [~mu5358271]. Is there any update on this issue? > Serverless Spark in the Cloud > ----------------------------- > > Key: SPARK-27941 > URL: https://issues.apache.org/jira/browse/SPARK-27941 > Project: Spark > Issue Type: New Feature > Components: Build, Deploy, Scheduler, Security, Shuffle, Spark Core > Affects Versions: 3.1.0 > Reporter: Shuheng Dai > Priority: Major > > Public cloud providers have started offering serverless container services. > For example, AWS offers Fargate [https://aws.amazon.com/fargate/] > This opens up the possibility to run Spark workloads in a serverless manner > and remove the need to provision, maintain and manage a cluster. POC: > [https://github.com/mu5358271/spark-on-fargate] > While it might not make sense for Spark to favor any particular cloud > provider or to support a large number of cloud providers natively, it would > make sense to make some of the internal Spark components more pluggable and > cloud friendly so that it is easier for various cloud providers to integrate. > For example, > * authentication: IO and network encryption requires authentication via > securely sharing a secret, and the implementation of this is currently tied > to the cluster manager: yarn uses hadoop ugi, kubernetes uses a shared file > mounted on all pods. These can be decoupled so it is possible to swap in > implementation using public cloud. In the POC, this is implemented by passing > around AWS KMS encrypted secret and decrypting the secret at each executor, > which delegate authentication and authorization to the cloud. > * deployment & scheduler: adding a new cluster manager and scheduler backend > requires changing a number of places in the Spark core package and rebuilding > the entire project. Having a pluggable scheduler per > https://issues.apache.org/jira/browse/SPARK-19700 would make it easier to add > different scheduler backends backed by different cloud providers. > * client-cluster communication: I am not very familiar with the network part > of the code base so I might be wrong on this. My understanding is that the > code base assumes that the client and the cluster are on the same network and > the nodes communicate with each other via hostname/ip. For security best > practice, it is advised to run the executors in a private protected network, > which may be separate from the client machine's network. Since we are > serverless, that means the client need to first launch the driver into the > private network, and the driver in turn start the executors, potentially > doubling job initialization time. This can be solved by dropping complete > serverlessness and having a persistent host in the private network, or (I do > not have a POC, so I am not sure if this actually works) by implementing > client-cluster communication via message queues in the cloud to get around > the network separation. > * shuffle storage and retrieval: external shuffle in yarn relies on the > existence of a persistent cluster that continues to serve shuffle files > beyond the lifecycle of the executors. This assumption no longer holds in a > serverless cluster with only transient containers. Pluggable remote shuffle > storage per https://issues.apache.org/jira/browse/SPARK-25299 would make it > easier to introduce new cloud-backed shuffle. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org