Martin Grund created SPARK-39375:
------------------------------------

             Summary: SPIP: Spark Connect - A client and server interface for 
Apache Spark.
                 Key: SPARK-39375
                 URL: https://issues.apache.org/jira/browse/SPARK-39375
             Project: Spark
          Issue Type: Improvement
          Components: PySpark, Spark Core, SQL
    Affects Versions: 3.0.0
            Reporter: Martin Grund


Please find the full document for discussion here: [Spark Connect 
SPIP|https://docs.google.com/document/d/1Mnl6jmGszixLW4KcJU5j9IgpG9-UabS0dcM6PM2XGDc/edit#heading=h.wmsrrfealhrj]
 Below, we have just referenced the introduction.
h2. What are you trying to do?

While Spark is used extensively, it was designed nearly a decade ago, which, in 
the age of serverless computing and ubiquitous programming language use, poses 
a number of limitations. Most of the limitations stem from the tightly coupled 
Spark driver architecture and fact that clusters are typically shared across 
users: (1) {*}Lack of built-in remote connectivity{*}: the Spark driver runs 
both the client application and scheduler, which results in a heavyweight 
architecture that requires proximity to the cluster. There is no built-in 
capability to  remotely connect to a Spark cluster in languages other than SQL 
and users therefore rely on external solutions such as the inactive project 
[Apache Livy|https://livy.apache.org/]. (2) {*}Lack of rich developer 
experience{*}: The current architecture and APIs do not cater for interactive 
data exploration (as done with Notebooks), or allow for building out rich 
developer experience common in modern code editors. (3) {*}Stability{*}: with 
the current shared driver architecture, users causing critical exceptions (e.g. 
OOM) bring the whole cluster down for all users. (4) {*}Upgradability{*}: the 
current entangling of platform and client APIs (e.g. first and third-party 
dependencies in the classpath) does not allow for seamless upgrades between 
Spark versions (and with that, hinders new feature adoption).

 

We propose to overcome these challenges by building on the DataFrame API and 
the underlying unresolved logical plans. The DataFrame API is widely used and 
makes it very easy to iteratively express complex logic. We will introduce 
{_}Spark Connect{_}, a remote option of the DataFrame API that separates the 
client from the Spark server. With Spark Connect, Spark will become decoupled, 
allowing for built-in remote connectivity: The decoupled client SDK can be used 
to run interactive data exploration and connect to the server for DataFrame 
operations. 

 

Spark Connect will benefit Spark developers in different ways: The decoupled 
architecture will result in improved stability, as clients are separated from 
the driver. From the Spark Connect client perspective, Spark will be (almost) 
versionless, and thus enable seamless upgradability, as server APIs can evolve 
without affecting the client API. The decoupled client-server architecture can 
be leveraged to build close integrations with local developer tooling. Finally, 
separating the client process from the Spark server process will improve 
Spark’s overall security posture by avoiding the tight coupling of the client 
inside the Spark runtime environment.

 

Spark Connect will strengthen Spark’s position as the modern unified engine for 
large-scale data analytics and expand applicability to use cases and developers 
we could not reach with the current setup: Spark will become ubiquitously 
usable as the DataFrame API can be used with (almost) any programming language.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to