[ https://issues.apache.org/jira/browse/SPARK-34849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307610#comment-17307610 ]
Hyukjin Kwon commented on SPARK-34849: -------------------------------------- The official SPIP preparation is in progress. It will be sent to the dev mailing list late this week (or early next week). > SPIP: Support pandas API layer on PySpark > ----------------------------------------- > > Key: SPARK-34849 > URL: https://issues.apache.org/jira/browse/SPARK-34849 > Project: Spark > Issue Type: Umbrella > Components: PySpark > Affects Versions: 3.2.0 > Reporter: Haejoon Lee > Priority: Blocker > Labels: SPIP > > This is a SPIP for porting [Koalas > project|https://github.com/databricks/koalas] to PySpark, that is once > discussed on the dev-mailing list with the same title, [[DISCUSS] Support > pandas API layer on > PySpark|http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Support-pandas-API-layer-on-PySpark-td30945.html]. > > *Q1. What are you trying to do? Articulate your objectives using absolutely > no jargon.* > # Porting Koalas into PySpark to support the pandas API layer on PySpark for: > - Users can easily leverage their existing Spark cluster to scale their > pandas workloads. > - Support plot and drawing a chart in PySpark > - Users can easily switch between pandas APIs and PySpark APIs > *Q2. What problem is this proposal NOT designed to solve?* > Some APIs of pandas are explicitly unsupported. For example, {{memory_usage}} > in pandas will not be supported because DataFrames are not materialized in > memory in Spark unlike pandas. > This does not replace the existing PySpark APIs. PySpark API has lots of > users and existing code in many projects, and there are still many PySpark > users who prefer Spark’s immutable DataFrame API to the pandas API. > *Q3. How is it done today, and what are the limits of current practice?* > The current practice has 2 limits as below. > # There are many features missing in Apache Spark that are very commonly > used in data science. Specifically, plotting and drawing a chart is missing > which is one of the most important features that almost every data scientist > use in their daily work. > # Data scientists tend to prefer pandas APIs, but it is very hard to change > them into PySpark APIs when they need to scale their workloads. This is > because PySpark APIs are difficult to learn compared to pandas' and there are > many missing features in PySpark. > *Q4. What is new in your approach and why do you think it will be successful?* > I believe this suggests a new way for both PySpark and pandas users to easily > scale their workloads. I think we can be successful because more and more > people tend to use Python and pandas. In fact, there are already similar > tries such as Dask and Modin which are all growing fast and successfully. > *Q5. Who cares? If you are successful, what difference will it make?* > Anyone who wants to scale their pandas workloads on their Spark cluster. It > will also significantly improve the usability of PySpark. > *Q6. What are the risks?* > Technically I don't see many risks yet given that: > - Koalas has grown separately for more than two years, and has greatly > improved maturity and stability. > - Koalas will be ported into PySpark as a separate package > It is more about putting documentation and test cases in place properly with > properly handling dependencies. For example, Koalas currently uses pytest > with various dependencies whereas PySpark uses the plain unittest with fewer > dependencies. > In addition, Koalas' default Indexing system could not be much loved because > it could potentially cause overhead, so applying it properly to PySpark might > be a challenge. > *Q7. How long will it take?* > Before the Spark 3.2 release. > *Q8. What are the mid-term and final “exams” to check for success?* > The first check for success would be to make sure that all the existing > Koalas APIs and tests work as they are without any affecting the existing > Koalas workloads on PySpark. > The last thing to confirm is to check whether the usability and convenience > that we aim for is actually increased through user feedback and PySpark usage > statistics. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org