[jira] [Updated] (SPARK-34849) SPIP: Support pandas API layer on PySpark

Hyukjin Kwon (Jira) Wed, 24 Mar 2021 00:27:14 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-34849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hyukjin Kwon updated SPARK-34849:
---------------------------------
    Description: 
This is a SPIP for porting [Koalas 
project|https://github.com/databricks/koalas] to PySpark, that is once 
discussed on the dev-mailing list with the same title, [[DISCUSS] Support 
pandas API layer on 
PySpark|http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Support-pandas-API-layer-on-PySpark-td30945.html].
 

*Q1.* What are you trying to do? Articulate your objectives using absolutely no 
jargon.

 # Porting Koalas into PySpark to support the pandas API layer on PySpark for:
 - Users can easily leverage their existing Spark cluster to scale their pandas 
workloads.
 - Support plot and drawing a chart in PySpark
 - Users can easily switch between pandas APIs and PySpark APIs


*Q2.* What problem is this proposal NOT designed to solve?

Some APIs of pandas are explicitly unsupported. For example, {{memory_usage}} 
in pandas will not be supported because DataFrames are not materialized in 
memory in Spark unlike pandas.

This does not replace the existing PySpark APIs. PySpark API has lots of users 
and existing code in many projects, and there are still many PySpark users who 
prefer Spark’s immutable DataFrame API to the pandas API.


*Q3.* How is it done today, and what are the limits of current practice?

The current practice has 2 limits as below.
 # There are many features missing in Apache Spark that are very commonly used 
in data science. Specifically, plotting and drawing a chart is missing which is 
one of the most important features that almost every data scientist use in 
their daily work.
 # Data scientists tend to prefer pandas APIs, but it is very hard to change 
them into PySpark APIs when they need to scale their workloads. This is because 
PySpark APIs are difficult to learn compared to pandas' and there are many 
missing features in PySpark.


*Q4.* What is new in your approach and why do you think it will be successful?

I believe this suggests a new way for both PySpark and pandas users to easily 
scale their workloads. I think we can be successful because more and more 
people tend to use Python and pandas. In fact, there are already similar tries 
such as Dask and Modin which are all growing fast and successfully.


*Q5.* Who cares? If you are successful, what difference will it make?

Anyone who wants to scale their pandas workloads on their Spark cluster. It 
will also significantly improve the usability of PySpark.
 

*Q6.* What are the risks?

Technically I don't see many risks yet given that:
- Koalas has grown separately for more than two years, and has greatly improved 
maturity and stability.
- Koalas will be ported into PySpark as a separate package

It is more about putting documentation and test cases in place properly with 
properly handling dependencies. For example, Koalas currently uses pytest with 
various dependencies whereas PySpark uses the plain unittest with fewer 
dependencies.

In addition, Koalas' default Indexing system could not be much loved because it 
could potentially cause overhead, so applying it properly to PySpark might be a 
challenge.


*Q7.* How long will it take?

Before the Spark 3.2 release.


*Q8.* What are the mid-term and final “exams” to check for success?

The first check for success would be to make sure that all the existing Koalas 
APIs and tests work as they are without any affecting the existing Koalas 
workloads on PySpark.

The last thing to confirm is to check whether the usability and convenience 
that we aim for is actually increased through user feedback and PySpark usage 
statistics.


  was:
This is a SPIP for porting [Koalas 
project|https://github.com/databricks/koalas] to PySpark, that is once 
discussed on the dev-mailing list with the same title, [[DISCUSS] Support 
pandas API layer on 
PySpark|http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Support-pandas-API-layer-on-PySpark-td30945.html].
 

 

*Q1.* What are you trying to do? Articulate your objectives using absolutely no 
jargon.
 # Porting Koalas into PySpark to support the pandas API layer on PySpark so 
that PySpark supports missing important features such as plotting and drawing a 
chart, and users can easily leverage their existing Spark cluster to scale 
their pandas workloads.
 # Improving interoperability between PySpark and pandas.

 

*Q2.* What problem is this proposal NOT designed to solve?

Some APIs of pandas are explicitly unsupported. For example, memory_usage in 
pandas will not be supported because DataFrames are not materialized in memory 
in Spark unlike pandas.

This does not replace the existing PySpark APIs. We understand that PySpark API 
has lots of users and existing code in many projects, and many Spark users 
prefer Spark’s immutable DataFrame API to the pandas API.

 

*Q3.* How is it done today, and what are the limits of current practice?

The current practice has 2 limits as below.
 # There are many features missing in Apache Spark that are very commonly used 
in data science. Specifically, plotting and drawing a chart is missing which is 
one of the most important features that almost every data scientist use in 
their daily work.
 # Data scientists tend to prefer pandas APIs, but it is very hard to change 
them into PySpark APIs when they need to scale their workloads. This is because 
PySpark APIs are difficult to learn compared to pandas' and there are many 
missing features in PySpark.

 

*Q4.* What is new in your approach and why do you think it will be successful?

I believe this suggests a new way for both PySpark and pandas users to easily 
scale their workloads. I think we can be successful because more and more 
people tend to use Python and pandas. In fact, there are already similar tries 
such as Dask and Modin which are all growing fast and successfully.

 

*Q5.* Who cares? If you are successful, what difference will it make?

Anyone who wants to scale their pandas workloads on PySpark. This work can 
significantly improve the usability of PySpark.

 

*Q6.* What are the risks?

Technically I don't see many risks. Koalas has grown separately for more than 
two years, and has greatly improved maturity and stability. It is just a matter 
of putting documentation and test cases in place. Koalas currently uses pytest 
with various dependencies whereas PySpark uses the plain unittest with fewer 
dependencies. In addition, Koalas' default Indexing system could not be much 
loved because it could potentially cause overhead, so applying it properly to 
PySpark is a challenge.

 

*Q7.* How long will it take?

Before the Spark 3.2 release.

 

*Q8.* What are the mid-term and final “exams” to check for success?

The first check for success would be to make sure that all the existing Koalas 
APIs and tests work as they are without any affecting the existing Koalas 
workloads on PySpark.

The last thing to confirm is to check whether the usability and convenience 
that we aim for is actually increased through user feedback and PySpark usage 
statistics.


> SPIP: Support pandas API layer on PySpark
> -----------------------------------------
>
>                 Key: SPARK-34849
>                 URL: https://issues.apache.org/jira/browse/SPARK-34849
>             Project: Spark
>          Issue Type: Umbrella
>          Components: PySpark
>    Affects Versions: 3.2.0
>            Reporter: Haejoon Lee
>            Priority: Blocker
>              Labels: SPIP
>
> This is a SPIP for porting [Koalas 
> project|https://github.com/databricks/koalas] to PySpark, that is once 
> discussed on the dev-mailing list with the same title, [[DISCUSS] Support 
> pandas API layer on 
> PySpark|http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Support-pandas-API-layer-on-PySpark-td30945.html].
>  
> *Q1.* What are you trying to do? Articulate your objectives using absolutely 
> no jargon.
>  # Porting Koalas into PySpark to support the pandas API layer on PySpark for:
>  - Users can easily leverage their existing Spark cluster to scale their 
> pandas workloads.
>  - Support plot and drawing a chart in PySpark
>  - Users can easily switch between pandas APIs and PySpark APIs
> *Q2.* What problem is this proposal NOT designed to solve?
> Some APIs of pandas are explicitly unsupported. For example, {{memory_usage}} 
> in pandas will not be supported because DataFrames are not materialized in 
> memory in Spark unlike pandas.
> This does not replace the existing PySpark APIs. PySpark API has lots of 
> users and existing code in many projects, and there are still many PySpark 
> users who prefer Spark’s immutable DataFrame API to the pandas API.
> *Q3.* How is it done today, and what are the limits of current practice?
> The current practice has 2 limits as below.
>  # There are many features missing in Apache Spark that are very commonly 
> used in data science. Specifically, plotting and drawing a chart is missing 
> which is one of the most important features that almost every data scientist 
> use in their daily work.
>  # Data scientists tend to prefer pandas APIs, but it is very hard to change 
> them into PySpark APIs when they need to scale their workloads. This is 
> because PySpark APIs are difficult to learn compared to pandas' and there are 
> many missing features in PySpark.
> *Q4.* What is new in your approach and why do you think it will be successful?
> I believe this suggests a new way for both PySpark and pandas users to easily 
> scale their workloads. I think we can be successful because more and more 
> people tend to use Python and pandas. In fact, there are already similar 
> tries such as Dask and Modin which are all growing fast and successfully.
> *Q5.* Who cares? If you are successful, what difference will it make?
> Anyone who wants to scale their pandas workloads on their Spark cluster. It 
> will also significantly improve the usability of PySpark.
>  
> *Q6.* What are the risks?
> Technically I don't see many risks yet given that:
> - Koalas has grown separately for more than two years, and has greatly 
> improved maturity and stability.
> - Koalas will be ported into PySpark as a separate package
> It is more about putting documentation and test cases in place properly with 
> properly handling dependencies. For example, Koalas currently uses pytest 
> with various dependencies whereas PySpark uses the plain unittest with fewer 
> dependencies.
> In addition, Koalas' default Indexing system could not be much loved because 
> it could potentially cause overhead, so applying it properly to PySpark might 
> be a challenge.
> *Q7.* How long will it take?
> Before the Spark 3.2 release.
> *Q8.* What are the mid-term and final “exams” to check for success?
> The first check for success would be to make sure that all the existing 
> Koalas APIs and tests work as they are without any affecting the existing 
> Koalas workloads on PySpark.
> The last thing to confirm is to check whether the usability and convenience 
> that we aim for is actually increased through user feedback and PySpark usage 
> statistics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34849) SPIP: Support pandas API layer on PySpark

Reply via email to