[ 
https://issues.apache.org/jira/browse/SPARK-47540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-47540.
----------------------------------
    Fix Version/s: 4.0.0
         Assignee: Hyukjin Kwon
       Resolution: Done

> SPIP: Pure Python Package (Spark Connect)
> -----------------------------------------
>
>                 Key: SPARK-47540
>                 URL: https://issues.apache.org/jira/browse/SPARK-47540
>             Project: Spark
>          Issue Type: Umbrella
>          Components: Connect, PySpark
>    Affects Versions: 4.0.0
>            Reporter: Hyukjin Kwon
>            Assignee: Hyukjin Kwon
>            Priority: Critical
>             Fix For: 4.0.0
>
>
> *Q1. What are you trying to do? Articulate your objectives using absolutely 
> no jargon.*
> As part of the [Spark 
> Connect|https://spark.apache.org/docs/latest/spark-connect-overview.html] 
> development, we have introduced Scala and Python clients. While the Scala 
> client is already provided as a separate library and is available in Maven, 
> the Python client is not. This proposal aims for end users to install the 
> pure Python package for Spark Connect by using pip install pyspark-connect.
> The pure Python package contains only Python source code without jars, which 
> reduces the size of the package significantly and widens the use cases of 
> PySpark. See also [Introducing Spark Connect - The Power of Apache Spark, 
> Everywhere'|https://www.databricks.com/blog/2022/07/07/introducing-spark-connect-the-power-of-apache-spark-everywhere.html].
> *Q2. What problem is this proposal NOT designed to solve?*
> This proposal does not aim to Change existing PySpark package, e.g., pip 
> install pyspark is not affected
> - Implement full compatibility with classic PySpark, e.g., implementing RDD 
> API
> - Address how to launch Spark Connect server. Spark Connect server is 
> launched by users themselves
> - Local mode. Without launching Spark Connect server, users cannot use this 
> package.
> - [Official release channel|https://spark.apache.org/downloads.html] is not 
> affected but only PyPI.
> *Q3. How is it done today, and what are the limits of current practice?*
> Currently, we run pip install pyspark, and it is over 300MB because of 
> dependent jars. In addition, PySpark requires you to set up other 
> environments such as JDK installation.
> This is not suitable when the running environment and resource is limited 
> such as edge devices such as smart home devices.
> Requiring a non-Python environment is not Python friendly.
> *Q4. What is new in your approach and why do you think it will be successful?*
> It provides a pure Python library, which eliminates other environment 
> requirements such as JDK, and reduces the resource usage by decoupling Spark 
> Driver, and reduces the package size.
> *Q5. Who cares? If you are successful, what difference will it make?*
> Users who want to leverage Spark in the limited environment, and want to 
> decouple running JVM with Spark Driver to run Spark as a Service. They can 
> simply pip install pyspark-connect that does not require other dependencies 
> (except Python dependencies just like other Python libraries). 
> *Q6. What are the risks?*
> Because we do not change the existing PySpark package, I do not see any major 
> risk in classic PySpark itself. We will reuse the same Python source, and 
> therefore we should make sure no Py4J is used, and no JVM access is made. 
> This requirement might confuse the developers. At the very least, we should 
> add the dedicated CI to make sure the pure Python package works.
> *Q7. How long will it take?*
> I expect around one month including CI set up. In fact, the prototype is 
> ready so I expect this to be done sooner.
> *Q8. What are the mid-term and final “exams” to check for success?*
> The mid-term goal is to set up a scheduled CI job that builds the pure Python 
> library, and runs all the tests against them.
> The final goral would be to properly test end-to-end usecase from pip 
> installation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to