[
https://issues.apache.org/jira/browse/SPARK-27941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17062198#comment-17062198
]
Dongjoon Hyun commented on SPARK-27941:
---------------------------------------
Hi, [~mu5358271].
Is there any update on this issue?
> Serverless Spark in the Cloud
> -----------------------------
>
> Key: SPARK-27941
> URL: https://issues.apache.org/jira/browse/SPARK-27941
> Project: Spark
> Issue Type: New Feature
> Components: Build, Deploy, Scheduler, Security, Shuffle, Spark Core
> Affects Versions: 3.1.0
> Reporter: Shuheng Dai
> Priority: Major
>
> Public cloud providers have started offering serverless container services.
> For example, AWS offers Fargate [https://aws.amazon.com/fargate/]
> This opens up the possibility to run Spark workloads in a serverless manner
> and remove the need to provision, maintain and manage a cluster. POC:
> [https://github.com/mu5358271/spark-on-fargate]
> While it might not make sense for Spark to favor any particular cloud
> provider or to support a large number of cloud providers natively, it would
> make sense to make some of the internal Spark components more pluggable and
> cloud friendly so that it is easier for various cloud providers to integrate.
> For example,
> * authentication: IO and network encryption requires authentication via
> securely sharing a secret, and the implementation of this is currently tied
> to the cluster manager: yarn uses hadoop ugi, kubernetes uses a shared file
> mounted on all pods. These can be decoupled so it is possible to swap in
> implementation using public cloud. In the POC, this is implemented by passing
> around AWS KMS encrypted secret and decrypting the secret at each executor,
> which delegate authentication and authorization to the cloud.
> * deployment & scheduler: adding a new cluster manager and scheduler backend
> requires changing a number of places in the Spark core package and rebuilding
> the entire project. Having a pluggable scheduler per
> https://issues.apache.org/jira/browse/SPARK-19700 would make it easier to add
> different scheduler backends backed by different cloud providers.
> * client-cluster communication: I am not very familiar with the network part
> of the code base so I might be wrong on this. My understanding is that the
> code base assumes that the client and the cluster are on the same network and
> the nodes communicate with each other via hostname/ip. For security best
> practice, it is advised to run the executors in a private protected network,
> which may be separate from the client machine's network. Since we are
> serverless, that means the client need to first launch the driver into the
> private network, and the driver in turn start the executors, potentially
> doubling job initialization time. This can be solved by dropping complete
> serverlessness and having a persistent host in the private network, or (I do
> not have a POC, so I am not sure if this actually works) by implementing
> client-cluster communication via message queues in the cloud to get around
> the network separation.
> * shuffle storage and retrieval: external shuffle in yarn relies on the
> existence of a persistent cluster that continues to serve shuffle files
> beyond the lifecycle of the executors. This assumption no longer holds in a
> serverless cluster with only transient containers. Pluggable remote shuffle
> storage per https://issues.apache.org/jira/browse/SPARK-25299 would make it
> easier to introduce new cloud-backed shuffle.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]