[jira] [Updated] (SPARK-39375) SPIP: Spark Connect - A client and server interface for Apache Spark

Hyukjin Kwon (Jira) Mon, 10 Jul 2023 04:20:21 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-39375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hyukjin Kwon updated SPARK-39375:
---------------------------------
    Description: 
Please find the full document for discussion here: [Spark Connect 
SPIP|https://docs.google.com/document/d/1Mnl6jmGszixLW4KcJU5j9IgpG9-UabS0dcM6PM2XGDc/edit#heading=h.wmsrrfealhrj]
 Below, we have just referenced the introduction.
h2. What are you trying to do?

While Spark is used extensively, it was designed nearly a decade ago, which, in 
the age of serverless computing and ubiquitous programming language use, poses 
a number of limitations. Most of the limitations stem from the tightly coupled 
Spark driver architecture and fact that clusters are typically shared across 
users: (1) {*}Lack of built-in remote connectivity{*}: the Spark driver runs 
both the client application and scheduler, which results in a heavyweight 
architecture that requires proximity to the cluster. There is no built-in 
capability to  remotely connect to a Spark cluster in languages other than SQL 
and users therefore rely on external solutions such as the inactive project 
[Apache Livy|https://livy.apache.org/]. (2) {*}Lack of rich developer 
experience{*}: The current architecture and APIs do not cater for interactive 
data exploration (as done with Notebooks), or allow for building out rich 
developer experience common in modern code editors. (3) {*}Stability{*}: with 
the current shared driver architecture, users causing critical exceptions (e.g. 
OOM) bring the whole cluster down for all users. (4) {*}Upgradability{*}: the 
current entangling of platform and client APIs (e.g. first and third-party 
dependencies in the classpath) does not allow for seamless upgrades between 
Spark versions (and with that, hinders new feature adoption).

 

We propose to overcome these challenges by building on the DataFrame API and 
the underlying unresolved logical plans. The DataFrame API is widely used and 
makes it very easy to iteratively express complex logic. We will introduce 
{_}Spark Connect{_}, a remote option of the DataFrame API that separates the 
client from the Spark server. With Spark Connect, Spark will become decoupled, 
allowing for built-in remote connectivity: The decoupled client SDK can be used 
to run interactive data exploration and connect to the server for DataFrame 
operations. 

 

Spark Connect will benefit Spark developers in different ways: The decoupled 
architecture will result in improved stability, as clients are separated from 
the driver. From the Spark Connect client perspective, Spark will be (almost) 
versionless, and thus enable seamless upgradability, as server APIs can evolve 
without affecting the client API. The decoupled client-server architecture can 
be leveraged to build close integrations with local developer tooling. Finally, 
separating the client process from the Spark server process will improve 
Spark’s overall security posture by avoiding the tight coupling of the client 
inside the Spark runtime environment.

 

Spark Connect will strengthen Spark’s position as the modern unified engine for 
large-scale data analytics and expand applicability to use cases and developers 
we could not reach with the current setup: Spark will become ubiquitously 
usable as the DataFrame API can be used with (almost) any programming language.
 
| |SPARK-41282|Feature parity: Column API in Spark 
Connect|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21130&avatarType=issuetype!|{color:#42526e}REOPENED{color}|[Ruifeng
 
Zheng|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=podongfeng]|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13506151/ActionsAndOperations]|
| |SPARK-41283|Feature parity: Functions API in Spark 
Connect|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21130&avatarType=issuetype!|{color:#006644}RESOLVED{color}|[Ruifeng
 
Zheng|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=podongfeng]|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13506152/ActionsAndOperations]|
| |SPARK-41279|Feature parity: DataFrame API in Spark 
Connect|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21130&avatarType=issuetype!|{color:#42526e}OPEN{color}|[Ruifeng
 
Zheng|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=podongfeng]|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13506146/ActionsAndOperations]|
| |SPARK-41281|Feature parity: SparkSession API in Spark 
Connect|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21130&avatarType=issuetype!|{color:#42526e}OPEN{color}|[Ruifeng
 
Zheng|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=podongfeng]|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13506150/ActionsAndOperations]|
| |SPARK-41284|Feature parity: I/O in Spark 
Connect|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21130&avatarType=issuetype!|{color:#42526e}REOPENED{color}|[Rui
 
Wang|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=amaliujia]|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13506153/ActionsAndOperations]|
| |SPARK-41289|Feature parity: Catalog 
API|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21130&avatarType=issuetype!|{color:#006644}RESOLVED{color}|[Hyukjin
 
Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=gurwls223]|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13506164/ActionsAndOperations]|
| |SPARK-41286|Build, package and infrastructure for Spark 
Connect|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21130&avatarType=issuetype!|{color:#006644}RESOLVED{color}|[Hyukjin
 
Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=gurwls223]|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13506158/ActionsAndOperations]|
| |SPARK-40451|Type annotations for Spark Connect Python 
client|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21130&avatarType=issuetype!|{color:#006644}RESOLVED{color}|[Hyukjin
 
Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=gurwls223]|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13481665/ActionsAndOperations]|
| |SPARK-40452|Developer 
documentation|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21130&avatarType=issuetype!|{color:#42526e}OPEN{color}|_Unassigned_|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13481666/ActionsAndOperations]|
| |SPARK-41285|Test basework and improvement of test coverage in Spark 
Connect|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21130&avatarType=issuetype!|{color:#42526e}OPEN{color}|[Hyukjin
 
Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=gurwls223]|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13506154/ActionsAndOperations]|
| |SPARK-41288|Server-specific improvement, error handling and 
API|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21130&avatarType=issuetype!|{color:#42526e}OPEN{color}|[Martin
 
Grund|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=grundprinzip-db]|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13506163/ActionsAndOperations]|
| |SPARK-41305|Connect Proto 
Completeness|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21130&avatarType=issuetype!|{color:#42526e}REOPENED{color}|[Rui
 
Wang|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=amaliujia]|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13506799/ActionsAndOperations]|
| |SPARK-41531|Debugging and 
Stability|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21130&avatarType=issuetype!|{color:#42526e}OPEN{color}|_Unassigned_|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13514061/ActionsAndOperations]|
| |SPARK-41625|Feature parity: Streaming 
support|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21130&avatarType=issuetype!|{color:#42526e}OPEN{color}|_Unassigned_|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13514886/ActionsAndOperations]|
| |SPARK-41627|Spark Connect Server 
Development|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21130&avatarType=issuetype!|{color:#42526e}OPEN{color}|_Unassigned_|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13514926/ActionsAndOperations]|
| |SPARK-41642|Deduplicate docstrings in Python Spark 
Connect|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21130&avatarType=issuetype!|{color:#006644}RESOLVED{color}|[Hyukjin
 
Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=gurwls223]|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13515016/ActionsAndOperations]|
| |SPARK-41651|Test parity: 
pyspark.sql.tests.test_dataframe|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21130&avatarType=issuetype!|{color:#006644}RESOLVED{color}|[Sandeep
 
Singh|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=techaddict]|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13515036/ActionsAndOperations]|
| |SPARK-41652|Test parity: 
pyspark.sql.tests.test_functions|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21130&avatarType=issuetype!|{color:#006644}RESOLVED{color}|[Sandeep
 
Singh|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=techaddict]|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13515037/ActionsAndOperations]|
| |SPARK-41661|Support for User-defined Functions in 
Python|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21130&avatarType=issuetype!|{color:#006644}RESOLVED{color}|[Xinrong
 
Meng|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=XinrongM]|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13515066/ActionsAndOperations]|
| |SPARK-41653|Test parity: enable doctests in Spark 
Connect|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21130&avatarType=issuetype!|{color:#006644}RESOLVED{color}|[Sandeep
 
Singh|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=techaddict]|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13515039/ActionsAndOperations]|
| |SPARK-41932|Bootstrapping Spark 
Connect|!https://issues.apache.org/jira/images/icons/issuetypes/requirement.png!|{color:#42526e}OPEN{color}|[Hyukjin
 
Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=gurwls223]|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13517108/ActionsAndOperations]|
| |SPARK-41997|Test parity: 
pyspark.sql.tests.test_readwriter|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21130&avatarType=issuetype!|{color:#006644}RESOLVED{color}|_Unassigned_|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13517712/ActionsAndOperations]|
| |SPARK-42006|Test parity: pyspark.sql.tests.test_group, test_serde, 
test_datasources and 
test_column|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21130&avatarType=issuetype!|{color:#006644}RESOLVED{color}|[Hyukjin
 
Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=gurwls223]|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13517732/ActionsAndOperations]|
| |SPARK-42018|Test parity: 
pyspark.sql.tests.test_types|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21130&avatarType=issuetype!|{color:#006644}RESOLVED{color}|_Unassigned_|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13517745/ActionsAndOperations]|
| |SPARK-42156|Support client-side retries in Spark Connect Python 
client|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21133&avatarType=issuetype!|{color:#006644}RESOLVED{color}|[Martin
 
Grund|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=grundprinzip-db]|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13520940/ActionsAndOperations]|
| |SPARK-42264|Test Parity: pyspark.sql.tests.test_udf and 
pyspark.sql.tests.pandas.test_pandas_udf|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21130&avatarType=issuetype!|{color:#006644}RESOLVED{color}|[Xinrong
 
Meng|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=XinrongM]|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13522380/ActionsAndOperations]|
| |SPARK-42374|User-facing 
documentation|!https://issues.apache.org/jira/images/icons/issuetypes/documentation.png!|{color:#42526e}OPEN{color}|[Haejoon
 
Lee|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=itholic]|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13523504/ActionsAndOperations]|
| |SPARK-42393|Support for Pandas/Arrow Functions 
API|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21130&avatarType=issuetype!|{color:#006644}RESOLVED{color}|[Xinrong
 
Meng|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=XinrongM]|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13524174/ActionsAndOperations]|
| |SPARK-42471|Distributed ML <> spark 
connect|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21130&avatarType=issuetype!|{color:#42526e}OPEN{color}|_Unassigned_|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13525136/ActionsAndOperations]|
| |SPARK-42497|Support of pandas API on Spark for Spark 
Connect|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21130&avatarType=issuetype!|{color:#0747a6}IN
 
PROGRESS{color}|_Unassigned_|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13525372/ActionsAndOperations]|
| |SPARK-42499|Support for Runtime SQL 
configuration|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21130&avatarType=issuetype!|{color:#006644}RESOLVED{color}|[Takuya
 
Ueshin|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=ueshin]|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13525401/ActionsAndOperations]|
| |SPARK-43289|PySpark UDF supports python package 
dependencies|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21141&avatarType=issuetype!|{color:#42526e}OPEN{color}|[Weichen
 
Xu|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=weichenxu123]|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13534079/ActionsAndOperations]|
| |SPARK-43612|Python: Artifact transfer from Scala/JVM client to 
Server|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21141&avatarType=issuetype!|{color:#006644}RESOLVED{color}|[Hyukjin
 
Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=gurwls223]|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13537018/ActionsAndOperations]|
| |SPARK-43747|Implement the pyfile support in 
SparkSession.addArtifacts|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21141&avatarType=issuetype!|{color:#006644}RESOLVED{color}|[Hyukjin
 
Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=gurwls223]|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13537306/ActionsAndOperations]|
| |SPARK-43768|Implement the archive support in 
SparkSession.addArtifacts|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21141&avatarType=issuetype!|{color:#006644}RESOLVED{color}|[Hyukjin
 
Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=gurwls223]|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13537415/ActionsAndOperations]|
| |SPARK-43795|Remove parameters not used for 
SparkConnectPlanner|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21140&avatarType=issuetype!|{color:#006644}RESOLVED{color}|[jiaan.geng|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=beliefer]|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13537645/ActionsAndOperations]|
| |SPARK-43829|Improve SparkConnectPlanner by reuse Dataset and avoid construct 
new 
Dataset|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21140&avatarType=issuetype!|{color:#42526e}OPEN{color}|_Unassigned_|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13537884/ActionsAndOperations]|
| |SPARK-44135|Document Spark Connect only API in 
PySpark|!https://issues.apache.org/jira/images/icons/issuetypes/documentation.png!|{color:#006644}RESOLVED{color}|[Hyukjin
 
Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=gurwls223]|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13540915/ActionsAndOperations]|
| |SPARK-44290|Session-based files and archives in Spark 
Connect|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21148&avatarType=issuetype!|{color:#006644}RESOLVED{color}|[Hyukjin
 
Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=gurwls223]|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13542359/ActionsAndOperations]|
h4.

  was:
Please find the full document for discussion here: [Spark Connect 
SPIP|https://docs.google.com/document/d/1Mnl6jmGszixLW4KcJU5j9IgpG9-UabS0dcM6PM2XGDc/edit#heading=h.wmsrrfealhrj]
 Below, we have just referenced the introduction.
h2. What are you trying to do?

While Spark is used extensively, it was designed nearly a decade ago, which, in 
the age of serverless computing and ubiquitous programming language use, poses 
a number of limitations. Most of the limitations stem from the tightly coupled 
Spark driver architecture and fact that clusters are typically shared across 
users: (1) {*}Lack of built-in remote connectivity{*}: the Spark driver runs 
both the client application and scheduler, which results in a heavyweight 
architecture that requires proximity to the cluster. There is no built-in 
capability to  remotely connect to a Spark cluster in languages other than SQL 
and users therefore rely on external solutions such as the inactive project 
[Apache Livy|https://livy.apache.org/]. (2) {*}Lack of rich developer 
experience{*}: The current architecture and APIs do not cater for interactive 
data exploration (as done with Notebooks), or allow for building out rich 
developer experience common in modern code editors. (3) {*}Stability{*}: with 
the current shared driver architecture, users causing critical exceptions (e.g. 
OOM) bring the whole cluster down for all users. (4) {*}Upgradability{*}: the 
current entangling of platform and client APIs (e.g. first and third-party 
dependencies in the classpath) does not allow for seamless upgrades between 
Spark versions (and with that, hinders new feature adoption).

 

We propose to overcome these challenges by building on the DataFrame API and 
the underlying unresolved logical plans. The DataFrame API is widely used and 
makes it very easy to iteratively express complex logic. We will introduce 
{_}Spark Connect{_}, a remote option of the DataFrame API that separates the 
client from the Spark server. With Spark Connect, Spark will become decoupled, 
allowing for built-in remote connectivity: The decoupled client SDK can be used 
to run interactive data exploration and connect to the server for DataFrame 
operations. 

 

Spark Connect will benefit Spark developers in different ways: The decoupled 
architecture will result in improved stability, as clients are separated from 
the driver. From the Spark Connect client perspective, Spark will be (almost) 
versionless, and thus enable seamless upgradability, as server APIs can evolve 
without affecting the client API. The decoupled client-server architecture can 
be leveraged to build close integrations with local developer tooling. Finally, 
separating the client process from the Spark server process will improve 
Spark’s overall security posture by avoiding the tight coupling of the client 
inside the Spark runtime environment.

 

Spark Connect will strengthen Spark’s position as the modern unified engine for 
large-scale data analytics and expand applicability to use cases and developers 
we could not reach with the current setup: Spark will become ubiquitously 
usable as the DataFrame API can be used with (almost) any programming language.


> SPIP: Spark Connect - A client and server interface for Apache Spark
> --------------------------------------------------------------------
>
>                 Key: SPARK-39375
>                 URL: https://issues.apache.org/jira/browse/SPARK-39375
>             Project: Spark
>          Issue Type: Epic
>          Components: Connect
>    Affects Versions: 3.4.0
>            Reporter: Martin Grund
>            Assignee: Martin Grund
>            Priority: Critical
>              Labels: SPIP
>
> Please find the full document for discussion here: [Spark Connect 
> SPIP|https://docs.google.com/document/d/1Mnl6jmGszixLW4KcJU5j9IgpG9-UabS0dcM6PM2XGDc/edit#heading=h.wmsrrfealhrj]
>  Below, we have just referenced the introduction.
> h2. What are you trying to do?
> While Spark is used extensively, it was designed nearly a decade ago, which, 
> in the age of serverless computing and ubiquitous programming language use, 
> poses a number of limitations. Most of the limitations stem from the tightly 
> coupled Spark driver architecture and fact that clusters are typically shared 
> across users: (1) {*}Lack of built-in remote connectivity{*}: the Spark 
> driver runs both the client application and scheduler, which results in a 
> heavyweight architecture that requires proximity to the cluster. There is no 
> built-in capability to  remotely connect to a Spark cluster in languages 
> other than SQL and users therefore rely on external solutions such as the 
> inactive project [Apache Livy|https://livy.apache.org/]. (2) {*}Lack of rich 
> developer experience{*}: The current architecture and APIs do not cater for 
> interactive data exploration (as done with Notebooks), or allow for building 
> out rich developer experience common in modern code editors. (3) 
> {*}Stability{*}: with the current shared driver architecture, users causing 
> critical exceptions (e.g. OOM) bring the whole cluster down for all users. 
> (4) {*}Upgradability{*}: the current entangling of platform and client APIs 
> (e.g. first and third-party dependencies in the classpath) does not allow for 
> seamless upgrades between Spark versions (and with that, hinders new feature 
> adoption).
>  
> We propose to overcome these challenges by building on the DataFrame API and 
> the underlying unresolved logical plans. The DataFrame API is widely used and 
> makes it very easy to iteratively express complex logic. We will introduce 
> {_}Spark Connect{_}, a remote option of the DataFrame API that separates the 
> client from the Spark server. With Spark Connect, Spark will become 
> decoupled, allowing for built-in remote connectivity: The decoupled client 
> SDK can be used to run interactive data exploration and connect to the server 
> for DataFrame operations. 
>  
> Spark Connect will benefit Spark developers in different ways: The decoupled 
> architecture will result in improved stability, as clients are separated from 
> the driver. From the Spark Connect client perspective, Spark will be (almost) 
> versionless, and thus enable seamless upgradability, as server APIs can 
> evolve without affecting the client API. The decoupled client-server 
> architecture can be leveraged to build close integrations with local 
> developer tooling. Finally, separating the client process from the Spark 
> server process will improve Spark’s overall security posture by avoiding the 
> tight coupling of the client inside the Spark runtime environment.
>  
> Spark Connect will strengthen Spark’s position as the modern unified engine 
> for large-scale data analytics and expand applicability to use cases and 
> developers we could not reach with the current setup: Spark will become 
> ubiquitously usable as the DataFrame API can be used with (almost) any 
> programming language.
>  
> | |SPARK-41282|Feature parity: Column API in Spark 
> Connect|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21130&avatarType=issuetype!|{color:#42526e}REOPENED{color}|[Ruifeng
>  
> Zheng|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=podongfeng]|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13506151/ActionsAndOperations]|
> | |SPARK-41283|Feature parity: Functions API in Spark 
> Connect|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21130&avatarType=issuetype!|{color:#006644}RESOLVED{color}|[Ruifeng
>  
> Zheng|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=podongfeng]|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13506152/ActionsAndOperations]|
> | |SPARK-41279|Feature parity: DataFrame API in Spark 
> Connect|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21130&avatarType=issuetype!|{color:#42526e}OPEN{color}|[Ruifeng
>  
> Zheng|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=podongfeng]|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13506146/ActionsAndOperations]|
> | |SPARK-41281|Feature parity: SparkSession API in Spark 
> Connect|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21130&avatarType=issuetype!|{color:#42526e}OPEN{color}|[Ruifeng
>  
> Zheng|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=podongfeng]|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13506150/ActionsAndOperations]|
> | |SPARK-41284|Feature parity: I/O in Spark 
> Connect|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21130&avatarType=issuetype!|{color:#42526e}REOPENED{color}|[Rui
>  
> Wang|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=amaliujia]|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13506153/ActionsAndOperations]|
> | |SPARK-41289|Feature parity: Catalog 
> API|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21130&avatarType=issuetype!|{color:#006644}RESOLVED{color}|[Hyukjin
>  
> Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=gurwls223]|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13506164/ActionsAndOperations]|
> | |SPARK-41286|Build, package and infrastructure for Spark 
> Connect|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21130&avatarType=issuetype!|{color:#006644}RESOLVED{color}|[Hyukjin
>  
> Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=gurwls223]|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13506158/ActionsAndOperations]|
> | |SPARK-40451|Type annotations for Spark Connect Python 
> client|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21130&avatarType=issuetype!|{color:#006644}RESOLVED{color}|[Hyukjin
>  
> Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=gurwls223]|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13481665/ActionsAndOperations]|
> | |SPARK-40452|Developer 
> documentation|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21130&avatarType=issuetype!|{color:#42526e}OPEN{color}|_Unassigned_|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13481666/ActionsAndOperations]|
> | |SPARK-41285|Test basework and improvement of test coverage in Spark 
> Connect|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21130&avatarType=issuetype!|{color:#42526e}OPEN{color}|[Hyukjin
>  
> Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=gurwls223]|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13506154/ActionsAndOperations]|
> | |SPARK-41288|Server-specific improvement, error handling and 
> API|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21130&avatarType=issuetype!|{color:#42526e}OPEN{color}|[Martin
>  
> Grund|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=grundprinzip-db]|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13506163/ActionsAndOperations]|
> | |SPARK-41305|Connect Proto 
> Completeness|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21130&avatarType=issuetype!|{color:#42526e}REOPENED{color}|[Rui
>  
> Wang|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=amaliujia]|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13506799/ActionsAndOperations]|
> | |SPARK-41531|Debugging and 
> Stability|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21130&avatarType=issuetype!|{color:#42526e}OPEN{color}|_Unassigned_|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13514061/ActionsAndOperations]|
> | |SPARK-41625|Feature parity: Streaming 
> support|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21130&avatarType=issuetype!|{color:#42526e}OPEN{color}|_Unassigned_|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13514886/ActionsAndOperations]|
> | |SPARK-41627|Spark Connect Server 
> Development|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21130&avatarType=issuetype!|{color:#42526e}OPEN{color}|_Unassigned_|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13514926/ActionsAndOperations]|
> | |SPARK-41642|Deduplicate docstrings in Python Spark 
> Connect|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21130&avatarType=issuetype!|{color:#006644}RESOLVED{color}|[Hyukjin
>  
> Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=gurwls223]|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13515016/ActionsAndOperations]|
> | |SPARK-41651|Test parity: 
> pyspark.sql.tests.test_dataframe|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21130&avatarType=issuetype!|{color:#006644}RESOLVED{color}|[Sandeep
>  
> Singh|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=techaddict]|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13515036/ActionsAndOperations]|
> | |SPARK-41652|Test parity: 
> pyspark.sql.tests.test_functions|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21130&avatarType=issuetype!|{color:#006644}RESOLVED{color}|[Sandeep
>  
> Singh|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=techaddict]|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13515037/ActionsAndOperations]|
> | |SPARK-41661|Support for User-defined Functions in 
> Python|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21130&avatarType=issuetype!|{color:#006644}RESOLVED{color}|[Xinrong
>  
> Meng|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=XinrongM]|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13515066/ActionsAndOperations]|
> | |SPARK-41653|Test parity: enable doctests in Spark 
> Connect|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21130&avatarType=issuetype!|{color:#006644}RESOLVED{color}|[Sandeep
>  
> Singh|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=techaddict]|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13515039/ActionsAndOperations]|
> | |SPARK-41932|Bootstrapping Spark 
> Connect|!https://issues.apache.org/jira/images/icons/issuetypes/requirement.png!|{color:#42526e}OPEN{color}|[Hyukjin
>  
> Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=gurwls223]|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13517108/ActionsAndOperations]|
> | |SPARK-41997|Test parity: 
> pyspark.sql.tests.test_readwriter|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21130&avatarType=issuetype!|{color:#006644}RESOLVED{color}|_Unassigned_|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13517712/ActionsAndOperations]|
> | |SPARK-42006|Test parity: pyspark.sql.tests.test_group, test_serde, 
> test_datasources and 
> test_column|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21130&avatarType=issuetype!|{color:#006644}RESOLVED{color}|[Hyukjin
>  
> Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=gurwls223]|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13517732/ActionsAndOperations]|
> | |SPARK-42018|Test parity: 
> pyspark.sql.tests.test_types|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21130&avatarType=issuetype!|{color:#006644}RESOLVED{color}|_Unassigned_|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13517745/ActionsAndOperations]|
> | |SPARK-42156|Support client-side retries in Spark Connect Python 
> client|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21133&avatarType=issuetype!|{color:#006644}RESOLVED{color}|[Martin
>  
> Grund|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=grundprinzip-db]|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13520940/ActionsAndOperations]|
> | |SPARK-42264|Test Parity: pyspark.sql.tests.test_udf and 
> pyspark.sql.tests.pandas.test_pandas_udf|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21130&avatarType=issuetype!|{color:#006644}RESOLVED{color}|[Xinrong
>  
> Meng|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=XinrongM]|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13522380/ActionsAndOperations]|
> | |SPARK-42374|User-facing 
> documentation|!https://issues.apache.org/jira/images/icons/issuetypes/documentation.png!|{color:#42526e}OPEN{color}|[Haejoon
>  
> Lee|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=itholic]|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13523504/ActionsAndOperations]|
> | |SPARK-42393|Support for Pandas/Arrow Functions 
> API|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21130&avatarType=issuetype!|{color:#006644}RESOLVED{color}|[Xinrong
>  
> Meng|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=XinrongM]|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13524174/ActionsAndOperations]|
> | |SPARK-42471|Distributed ML <> spark 
> connect|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21130&avatarType=issuetype!|{color:#42526e}OPEN{color}|_Unassigned_|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13525136/ActionsAndOperations]|
> | |SPARK-42497|Support of pandas API on Spark for Spark 
> Connect|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21130&avatarType=issuetype!|{color:#0747a6}IN
>  
> PROGRESS{color}|_Unassigned_|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13525372/ActionsAndOperations]|
> | |SPARK-42499|Support for Runtime SQL 
> configuration|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21130&avatarType=issuetype!|{color:#006644}RESOLVED{color}|[Takuya
>  
> Ueshin|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=ueshin]|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13525401/ActionsAndOperations]|
> | |SPARK-43289|PySpark UDF supports python package 
> dependencies|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21141&avatarType=issuetype!|{color:#42526e}OPEN{color}|[Weichen
>  
> Xu|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=weichenxu123]|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13534079/ActionsAndOperations]|
> | |SPARK-43612|Python: Artifact transfer from Scala/JVM client to 
> Server|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21141&avatarType=issuetype!|{color:#006644}RESOLVED{color}|[Hyukjin
>  
> Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=gurwls223]|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13537018/ActionsAndOperations]|
> | |SPARK-43747|Implement the pyfile support in 
> SparkSession.addArtifacts|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21141&avatarType=issuetype!|{color:#006644}RESOLVED{color}|[Hyukjin
>  
> Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=gurwls223]|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13537306/ActionsAndOperations]|
> | |SPARK-43768|Implement the archive support in 
> SparkSession.addArtifacts|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21141&avatarType=issuetype!|{color:#006644}RESOLVED{color}|[Hyukjin
>  
> Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=gurwls223]|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13537415/ActionsAndOperations]|
> | |SPARK-43795|Remove parameters not used for 
> SparkConnectPlanner|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21140&avatarType=issuetype!|{color:#006644}RESOLVED{color}|[jiaan.geng|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=beliefer]|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13537645/ActionsAndOperations]|
> | |SPARK-43829|Improve SparkConnectPlanner by reuse Dataset and avoid 
> construct new 
> Dataset|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21140&avatarType=issuetype!|{color:#42526e}OPEN{color}|_Unassigned_|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13537884/ActionsAndOperations]|
> | |SPARK-44135|Document Spark Connect only API in 
> PySpark|!https://issues.apache.org/jira/images/icons/issuetypes/documentation.png!|{color:#006644}RESOLVED{color}|[Hyukjin
>  
> Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=gurwls223]|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13540915/ActionsAndOperations]|
> | |SPARK-44290|Session-based files and archives in Spark 
> Connect|!https://issues.apache.org/jira/secure/viewavatar?size=xsmall&avatarId=21148&avatarType=issuetype!|{color:#006644}RESOLVED{color}|[Hyukjin
>  
> Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=gurwls223]|[_Actions_|https://issues.apache.org/jira/rest/api/1.0/issues/13542359/ActionsAndOperations]|
> h4.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39375) SPIP: Spark Connect - A client and server interface for Apache Spark

Reply via email to