[jira] [Commented] (SPARK-34849) SPIP: Support pandas API layer on PySpark

Hyukjin Kwon (Jira) Wed, 24 Mar 2021 00:31:16 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-34849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307610#comment-17307610
 ]


Hyukjin Kwon commented on SPARK-34849:
--------------------------------------

The official SPIP preparation is in progress. It will be sent to the dev 
mailing list late this week (or early next week).

> SPIP: Support pandas API layer on PySpark
> -----------------------------------------
>
>                 Key: SPARK-34849
>                 URL: https://issues.apache.org/jira/browse/SPARK-34849
>             Project: Spark
>          Issue Type: Umbrella
>          Components: PySpark
>    Affects Versions: 3.2.0
>            Reporter: Haejoon Lee
>            Priority: Blocker
>              Labels: SPIP
>
> This is a SPIP for porting [Koalas 
> project|https://github.com/databricks/koalas] to PySpark, that is once 
> discussed on the dev-mailing list with the same title, [[DISCUSS] Support 
> pandas API layer on 
> PySpark|http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Support-pandas-API-layer-on-PySpark-td30945.html].
>  
> *Q1. What are you trying to do? Articulate your objectives using absolutely 
> no jargon.*
>  # Porting Koalas into PySpark to support the pandas API layer on PySpark for:
>  - Users can easily leverage their existing Spark cluster to scale their 
> pandas workloads.
>  - Support plot and drawing a chart in PySpark
>  - Users can easily switch between pandas APIs and PySpark APIs
> *Q2. What problem is this proposal NOT designed to solve?*
> Some APIs of pandas are explicitly unsupported. For example, {{memory_usage}} 
> in pandas will not be supported because DataFrames are not materialized in 
> memory in Spark unlike pandas.
> This does not replace the existing PySpark APIs. PySpark API has lots of 
> users and existing code in many projects, and there are still many PySpark 
> users who prefer Spark’s immutable DataFrame API to the pandas API.
> *Q3. How is it done today, and what are the limits of current practice?*
> The current practice has 2 limits as below.
>  # There are many features missing in Apache Spark that are very commonly 
> used in data science. Specifically, plotting and drawing a chart is missing 
> which is one of the most important features that almost every data scientist 
> use in their daily work.
>  # Data scientists tend to prefer pandas APIs, but it is very hard to change 
> them into PySpark APIs when they need to scale their workloads. This is 
> because PySpark APIs are difficult to learn compared to pandas' and there are 
> many missing features in PySpark.
> *Q4. What is new in your approach and why do you think it will be successful?*
> I believe this suggests a new way for both PySpark and pandas users to easily 
> scale their workloads. I think we can be successful because more and more 
> people tend to use Python and pandas. In fact, there are already similar 
> tries such as Dask and Modin which are all growing fast and successfully.
> *Q5. Who cares? If you are successful, what difference will it make?*
> Anyone who wants to scale their pandas workloads on their Spark cluster. It 
> will also significantly improve the usability of PySpark.
> *Q6. What are the risks?*
> Technically I don't see many risks yet given that:
> - Koalas has grown separately for more than two years, and has greatly 
> improved maturity and stability.
> - Koalas will be ported into PySpark as a separate package
> It is more about putting documentation and test cases in place properly with 
> properly handling dependencies. For example, Koalas currently uses pytest 
> with various dependencies whereas PySpark uses the plain unittest with fewer 
> dependencies.
> In addition, Koalas' default Indexing system could not be much loved because 
> it could potentially cause overhead, so applying it properly to PySpark might 
> be a challenge.
> *Q7. How long will it take?*
> Before the Spark 3.2 release.
> *Q8. What are the mid-term and final “exams” to check for success?*
> The first check for success would be to make sure that all the existing 
> Koalas APIs and tests work as they are without any affecting the existing 
> Koalas workloads on PySpark.
> The last thing to confirm is to check whether the usability and convenience 
> that we aim for is actually increased through user feedback and PySpark usage 
> statistics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34849) SPIP: Support pandas API layer on PySpark

Reply via email to