[jira] [Updated] (SPARK-27941) Serverless Spark in the Cloud

Shuheng Dai (JIRA) Tue, 04 Jun 2019 01:30:18 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-27941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Shuheng Dai updated SPARK-27941:
--------------------------------
    Description: 
Public cloud providers have started offering serverless container services. For 
example, AWS offers Fargate [https://aws.amazon.com/fargate/]

This opens up the possibility to run Spark workloads in a serverless manner and 
remove the need to provision and maintain a cluster. POC: 
[https://github.com/mu5358271/spark-on-fargate]

While it might not make sense for Spark to favor any particular cloud provider 
or to support a large number of cloud providers natively. It would make sense 
to make some of the internal Spark components more pluggable and cloud friendly 
so that it is easier for various cloud providers to integrate. For example, 
 * authentication: IO and network encryption requires authentication via 
securely sharing a secret, and the implementation of this is currently tied to 
the cluster manager: yarn uses hadoop ugi, kubernetes uses a shared file 
mounted on all pods. These can be decoupled so it is possible to swap in 
implementation using public cloud. In the POC, this is implemented by passing 
around AWS KMS encrypted secret and decrypting the secret at each executor, 
which delegate authentication and authorization to the cloud.
 * deployment & scheduler: adding a new cluster manager and scheduler backend 
requires changing a number of places in the Spark core package and rebuilding 
the entire project. Having a pluggable scheduler per 
https://issues.apache.org/jira/browse/SPARK-19700 would make it easier to add 
different scheduler backends backed by different cloud providers.
 * client-cluster communication: I am not very familiar with the network part 
of the code base so I might be wrong on this. My understanding is that the code 
base assumes that the client and the cluster are on the same network and the 
nodes communicate with each other via hostname/ip. 
 * shuffle storage and retrieval: 

  was:
Public cloud providers have started offering serverless container services. For 
example, AWS offers Fargate [https://aws.amazon.com/fargate/]

This opens up the possibility to run Spark workloads in a serverless manner and 
remove the need to provision and maintain a cluster. POC: 
[https://github.com/mu5358271/spark-on-fargate]

While it might not make sense for Spark to favor any particular cloud provider 
or to support a large number of cloud providers natively. It would make sense 
to make some of the internal Spark components more pluggable and cloud friendly 
so that it is easier for various cloud providers to integrate. For example, 
 * authentication: IO and network encryption requires authentication via 
securely sharing a secret, and the implementation of this is currently tied to 
the cluster manager: yarn uses hadoop ugi, kubernetes uses a shared file 
mounted on all pods. These can be decoupled so it is possible to swap in 
implementation using public cloud. In the POC, this is implemented by passing 
around AWS KMS encrypted secret and decrypting the secret at each executor, 
which delegate authentication and authorization to the cloud.
 * deployment & scheduler: adding a new cluster manager and scheduler backend 
requires changing a number of places in the Spark core package, and rebuilding 
the entire project. 
 * driver-executor communication: 
 * shuffle storage and retrieval: 


> Serverless Spark in the Cloud
> -----------------------------
>
>                 Key: SPARK-27941
>                 URL: https://issues.apache.org/jira/browse/SPARK-27941
>             Project: Spark
>          Issue Type: New Feature
>          Components: Build, Deploy, Scheduler, Security, Shuffle, Spark Core
>    Affects Versions: 3.0.0
>            Reporter: Shuheng Dai
>            Priority: Major
>
> Public cloud providers have started offering serverless container services. 
> For example, AWS offers Fargate [https://aws.amazon.com/fargate/]
> This opens up the possibility to run Spark workloads in a serverless manner 
> and remove the need to provision and maintain a cluster. POC: 
> [https://github.com/mu5358271/spark-on-fargate]
> While it might not make sense for Spark to favor any particular cloud 
> provider or to support a large number of cloud providers natively. It would 
> make sense to make some of the internal Spark components more pluggable and 
> cloud friendly so that it is easier for various cloud providers to integrate. 
> For example, 
>  * authentication: IO and network encryption requires authentication via 
> securely sharing a secret, and the implementation of this is currently tied 
> to the cluster manager: yarn uses hadoop ugi, kubernetes uses a shared file 
> mounted on all pods. These can be decoupled so it is possible to swap in 
> implementation using public cloud. In the POC, this is implemented by passing 
> around AWS KMS encrypted secret and decrypting the secret at each executor, 
> which delegate authentication and authorization to the cloud.
>  * deployment & scheduler: adding a new cluster manager and scheduler backend 
> requires changing a number of places in the Spark core package and rebuilding 
> the entire project. Having a pluggable scheduler per 
> https://issues.apache.org/jira/browse/SPARK-19700 would make it easier to add 
> different scheduler backends backed by different cloud providers.
>  * client-cluster communication: I am not very familiar with the network part 
> of the code base so I might be wrong on this. My understanding is that the 
> code base assumes that the client and the cluster are on the same network and 
> the nodes communicate with each other via hostname/ip. 
>  * shuffle storage and retrieval: 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27941) Serverless Spark in the Cloud

Reply via email to