[ 
https://issues.apache.org/jira/browse/SPARK-39375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17688393#comment-17688393
 ] 

Martin Grund commented on SPARK-39375:
--------------------------------------

[~tgraves] Currently the Python UDFs are implemented exactly the same way as 
they are today. In today's world, they are serialized bytes that are sent from 
the Python process via Py4J to the driver and then to the executors where 
they're deserialized and executed. The primary difference to Spark Connect is 
that we don't use Py4J anymore but leverage the protocol directly. This is 
backward compatible and allows us to make sure that we can build upon the 
existing architecture going forward. Please keep in mind that today, the Python 
process for the UDF execution is started by the executor as part of query 
execution. Depending on the setup the Python process is kept around or 
destroyed at the end of the processing. None of this behavior changed. This 
means that all of the existing applications using PySpark will simply continue 
to work.

Similarly, this means we're not changing the assumptions around the 
requirements of which Python version has to be present where. In the same way, 
the Python version on the client has to be the same as on the executor. 

The reason we did not create a design for it is that we did not change the 
semantics, the logic or the implementation. This is very similar to the way 
we're translating the Spark Connect proto API into Catalyst plans.



> SPIP: Spark Connect - A client and server interface for Apache Spark
> --------------------------------------------------------------------
>
>                 Key: SPARK-39375
>                 URL: https://issues.apache.org/jira/browse/SPARK-39375
>             Project: Spark
>          Issue Type: Epic
>          Components: Connect
>    Affects Versions: 3.4.0
>            Reporter: Martin Grund
>            Assignee: Martin Grund
>            Priority: Critical
>              Labels: SPIP
>
> Please find the full document for discussion here: [Spark Connect 
> SPIP|https://docs.google.com/document/d/1Mnl6jmGszixLW4KcJU5j9IgpG9-UabS0dcM6PM2XGDc/edit#heading=h.wmsrrfealhrj]
>  Below, we have just referenced the introduction.
> h2. What are you trying to do?
> While Spark is used extensively, it was designed nearly a decade ago, which, 
> in the age of serverless computing and ubiquitous programming language use, 
> poses a number of limitations. Most of the limitations stem from the tightly 
> coupled Spark driver architecture and fact that clusters are typically shared 
> across users: (1) {*}Lack of built-in remote connectivity{*}: the Spark 
> driver runs both the client application and scheduler, which results in a 
> heavyweight architecture that requires proximity to the cluster. There is no 
> built-in capability to  remotely connect to a Spark cluster in languages 
> other than SQL and users therefore rely on external solutions such as the 
> inactive project [Apache Livy|https://livy.apache.org/]. (2) {*}Lack of rich 
> developer experience{*}: The current architecture and APIs do not cater for 
> interactive data exploration (as done with Notebooks), or allow for building 
> out rich developer experience common in modern code editors. (3) 
> {*}Stability{*}: with the current shared driver architecture, users causing 
> critical exceptions (e.g. OOM) bring the whole cluster down for all users. 
> (4) {*}Upgradability{*}: the current entangling of platform and client APIs 
> (e.g. first and third-party dependencies in the classpath) does not allow for 
> seamless upgrades between Spark versions (and with that, hinders new feature 
> adoption).
>  
> We propose to overcome these challenges by building on the DataFrame API and 
> the underlying unresolved logical plans. The DataFrame API is widely used and 
> makes it very easy to iteratively express complex logic. We will introduce 
> {_}Spark Connect{_}, a remote option of the DataFrame API that separates the 
> client from the Spark server. With Spark Connect, Spark will become 
> decoupled, allowing for built-in remote connectivity: The decoupled client 
> SDK can be used to run interactive data exploration and connect to the server 
> for DataFrame operations. 
>  
> Spark Connect will benefit Spark developers in different ways: The decoupled 
> architecture will result in improved stability, as clients are separated from 
> the driver. From the Spark Connect client perspective, Spark will be (almost) 
> versionless, and thus enable seamless upgradability, as server APIs can 
> evolve without affecting the client API. The decoupled client-server 
> architecture can be leveraged to build close integrations with local 
> developer tooling. Finally, separating the client process from the Spark 
> server process will improve Spark’s overall security posture by avoiding the 
> tight coupling of the client inside the Spark runtime environment.
>  
> Spark Connect will strengthen Spark’s position as the modern unified engine 
> for large-scale data analytics and expand applicability to use cases and 
> developers we could not reach with the current setup: Spark will become 
> ubiquitously usable as the DataFrame API can be used with (almost) any 
> programming language.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to