[ 
https://issues.apache.org/jira/browse/SPARK-27941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27941:
----------------------------------
    Affects Version/s:     (was: 3.0.0)
                       3.1.0

> Serverless Spark in the Cloud
> -----------------------------
>
>                 Key: SPARK-27941
>                 URL: https://issues.apache.org/jira/browse/SPARK-27941
>             Project: Spark
>          Issue Type: New Feature
>          Components: Build, Deploy, Scheduler, Security, Shuffle, Spark Core
>    Affects Versions: 3.1.0
>            Reporter: Shuheng Dai
>            Priority: Major
>
> Public cloud providers have started offering serverless container services. 
> For example, AWS offers Fargate [https://aws.amazon.com/fargate/]
> This opens up the possibility to run Spark workloads in a serverless manner 
> and remove the need to provision, maintain and manage a cluster. POC: 
> [https://github.com/mu5358271/spark-on-fargate]
> While it might not make sense for Spark to favor any particular cloud 
> provider or to support a large number of cloud providers natively, it would 
> make sense to make some of the internal Spark components more pluggable and 
> cloud friendly so that it is easier for various cloud providers to integrate. 
> For example, 
>  * authentication: IO and network encryption requires authentication via 
> securely sharing a secret, and the implementation of this is currently tied 
> to the cluster manager: yarn uses hadoop ugi, kubernetes uses a shared file 
> mounted on all pods. These can be decoupled so it is possible to swap in 
> implementation using public cloud. In the POC, this is implemented by passing 
> around AWS KMS encrypted secret and decrypting the secret at each executor, 
> which delegate authentication and authorization to the cloud.
>  * deployment & scheduler: adding a new cluster manager and scheduler backend 
> requires changing a number of places in the Spark core package and rebuilding 
> the entire project. Having a pluggable scheduler per 
> https://issues.apache.org/jira/browse/SPARK-19700 would make it easier to add 
> different scheduler backends backed by different cloud providers.
>  * client-cluster communication: I am not very familiar with the network part 
> of the code base so I might be wrong on this. My understanding is that the 
> code base assumes that the client and the cluster are on the same network and 
> the nodes communicate with each other via hostname/ip. For security best 
> practice, it is advised to run the executors in a private protected network, 
> which may be separate from the client machine's network. Since we are 
> serverless, that means the client need to first launch the driver into the 
> private network, and the driver in turn start the executors, potentially 
> doubling job initialization time. This can be solved by dropping complete 
> serverlessness and having a persistent host in the private network, or (I do 
> not have a POC, so I am not sure if this actually works) by implementing 
> client-cluster communication via message queues in the cloud to get around 
> the network separation.
>  * shuffle storage and retrieval: external shuffle in yarn relies on the 
> existence of a persistent cluster that continues to serve shuffle files 
> beyond the lifecycle of the executors. This assumption no longer holds in a 
> serverless cluster with only transient containers. Pluggable remote shuffle 
> storage per https://issues.apache.org/jira/browse/SPARK-25299 would make it 
> easier to introduce new cloud-backed shuffle.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to