[jira] [Updated] (SPARK-32082) Project Zen: Improving Python usability

Xiao Li (Jira) Mon, 07 Nov 2022 21:47:07 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-32082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Xiao Li updated SPARK-32082:
----------------------------
    Description: 
The importance of Python and PySpark has grown radically in the last few years. 
The number of PySpark downloads reached [more than 1.3 million _every 
week_|https://pypistats.org/packages/pyspark] when we count them _only_ in 
PyPI. Nevertheless, PySpark is still less Pythonic. It exposes many JVM error 
messages as an example, and the API documentation is poorly written.

This epic tickets aims to improve the usability in PySpark, and make it more 
Pythonic. To be more explicit, this JIRA targets four bullet points below. Each 
includes examples:
 * Being Pythonic
 ** Pandas UDF enhancements and type hints
 ** Avoid dynamic function definitions, for example, at {{funcitons.py}} which 
makes IDEs unable to detect.

 * Better and easier usability in PySpark
 ** User-facing error message and warnings
 ** Documentation
 ** User guide
 ** Better examples and API documentation, e.g. 
[Koalas|https://koalas.readthedocs.io/en/latest/] and 
[pandas|https://pandas.pydata.org/docs/]

 * Better interoperability with other Python libraries
 ** Visualization and plotting
 ** Potentially better interface by leveraging Arrow
 ** Compatibility with other libraries such as NumPy universal functions or 
pandas possibly by leveraging Koalas

 * PyPI Installation
 ** PySpark with Hadoop 3 support on PyPi
 ** Better error handling

 
| | | | |
|SPARK-31382|Show a better error message for different python and pip 
installation mistake|{color:#006644}RESOLVED{color}|[Hyukjin 
Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=hyukjin.kwon]|
|SPARK-31849|Improve Python exception messages to be more 
Pythonic|{color:#006644}RESOLVED{color}|[Hyukjin 
Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=hyukjin.kwon]|
|SPARK-31851|Redesign PySpark 
documentation|{color:#006644}RESOLVED{color}|[Hyukjin 
Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=hyukjin.kwon]|
|SPARK-32017|Make Pyspark Hadoop 3.2+ Variant available in 
PyPI|{color:#006644}RESOLVED{color}|[Hyukjin 
Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=hyukjin.kwon]|
|SPARK-32084|Replace dictionary-based function definitions to proper functions 
in functions.py|{color:#006644}RESOLVED{color}|[Maciej 
Szymkiewicz|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=zero323]|
|SPARK-32085|Migrate to NumPy documentation 
style|{color:#006644}RESOLVED{color}|[Maciej 
Szymkiewicz|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=zero323]|
|SPARK-32161|Hide JVM traceback for 
SparkUpgradeException|{color:#006644}RESOLVED{color}|[Pralabh 
Kumar|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=pralabhkumar]|
|SPARK-32185|User Guide - Monitoring|{color:#006644}RESOLVED{color}|[Abhijeet 
Prasad|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=a7prasad]|
|SPARK-32195|Standardize warning types and 
messages|{color:#006644}RESOLVED{color}|[Maciej 
Szymkiewicz|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=zero323]|
|SPARK-32204|Binder Integration|{color:#006644}RESOLVED{color}|[Hyukjin 
Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=hyukjin.kwon]|
|SPARK-32681|PySpark type hints support|{color:#006644}RESOLVED{color}|[Maciej 
Szymkiewicz|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=zero323]|
|SPARK-32686|Un-deprecate inferring DataFrame schema from list of 
dictionaries|{color:#006644}RESOLVED{color}|[Nicholas 
Chammas|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=nchammas]|
|SPARK-33247|Improve examples and scenarios in 
docstrings|{color:#006644}RESOLVED{color}|_Unassigned_|
|SPARK-33407|Simplify the exception message from Python 
UDFs|{color:#006644}RESOLVED{color}|[Hyukjin 
Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=hyukjin.kwon]|
|SPARK-33530|Support --archives option 
natively|{color:#006644}RESOLVED{color}|[Hyukjin 
Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=hyukjin.kwon]|
|SPARK-34629|Python type hints 
improvement|{color:#006644}RESOLVED{color}|[Maciej 
Szymkiewicz|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=zero323]|
|SPARK-34849|SPIP: Support pandas API layer on 
PySpark|{color:#006644}RESOLVED{color}|[Haejoon 
Lee|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=itholic]|
|SPARK-34885|Port/integrate Koalas documentation into 
PySpark|{color:#006644}RESOLVED{color}|[Hyukjin 
Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=hyukjin.kwon]|
|SPARK-35337|pandas API on Spark: Separate basic operations into data type 
based structures|{color:#006644}RESOLVED{color}|[Xinrong 
Meng|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=XinrongM]|
|SPARK-35419|Enable spark.sql.execution.pyspark.udf.simplifiedTraceback.enabled 
by default|{color:#006644}RESOLVED{color}|[Hyukjin 
Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=hyukjin.kwon]|
|SPARK-35464|pandas API on Spark: Enable mypy check "disallow_untyped_defs" for 
main codes.|{color:#006644}RESOLVED{color}|[Takuya 
Ueshin|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=ueshin]|
|SPARK-35805|API auditing in Pandas API on 
Spark|{color:#006644}RESOLVED{color}|[Haejoon 
Lee|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=itholic]|

  was:
The importance of Python and PySpark has grown radically in the last few years. 
The number of PySpark downloads reached [more than 1.3 million _every 
week_|https://pypistats.org/packages/pyspark] when we count them _only_ in 
PyPI. Nevertheless, PySpark is still less Pythonic. It exposes many JVM error 
messages as an example, and the API documentation is poorly written.

This epic tickets aims to improve the usability in PySpark, and make it more 
Pythonic. To be more explicit, this JIRA targets four bullet points below. Each 
includes examples:
 * Being Pythonic
 ** Pandas UDF enhancements and type hints
 ** Avoid dynamic function definitions, for example, at {{funcitons.py}} which 
makes IDEs unable to detect.

 * Better and easier usability in PySpark
 ** User-facing error message and warnings
 ** Documentation
 ** User guide
 ** Better examples and API documentation, e.g. 
[Koalas|https://koalas.readthedocs.io/en/latest/] and 
[pandas|https://pandas.pydata.org/docs/]

 * Better interoperability with other Python libraries
 ** Visualization and plotting
 ** Potentially better interface by leveraging Arrow
 ** Compatibility with other libraries such as NumPy universal functions or 
pandas possibly by leveraging Koalas

 * PyPI Installation
 ** PySpark with Hadoop 3 support on PyPi
 ** Better error handling



> Project Zen: Improving Python usability
> ---------------------------------------
>
>                 Key: SPARK-32082
>                 URL: https://issues.apache.org/jira/browse/SPARK-32082
>             Project: Spark
>          Issue Type: Epic
>          Components: PySpark
>    Affects Versions: 3.1.0
>            Reporter: Hyukjin Kwon
>            Assignee: Hyukjin Kwon
>            Priority: Critical
>             Fix For: 3.4.0
>
>
> The importance of Python and PySpark has grown radically in the last few 
> years. The number of PySpark downloads reached [more than 1.3 million _every 
> week_|https://pypistats.org/packages/pyspark] when we count them _only_ in 
> PyPI. Nevertheless, PySpark is still less Pythonic. It exposes many JVM error 
> messages as an example, and the API documentation is poorly written.
> This epic tickets aims to improve the usability in PySpark, and make it more 
> Pythonic. To be more explicit, this JIRA targets four bullet points below. 
> Each includes examples:
>  * Being Pythonic
>  ** Pandas UDF enhancements and type hints
>  ** Avoid dynamic function definitions, for example, at {{funcitons.py}} 
> which makes IDEs unable to detect.
>  * Better and easier usability in PySpark
>  ** User-facing error message and warnings
>  ** Documentation
>  ** User guide
>  ** Better examples and API documentation, e.g. 
> [Koalas|https://koalas.readthedocs.io/en/latest/] and 
> [pandas|https://pandas.pydata.org/docs/]
>  * Better interoperability with other Python libraries
>  ** Visualization and plotting
>  ** Potentially better interface by leveraging Arrow
>  ** Compatibility with other libraries such as NumPy universal functions or 
> pandas possibly by leveraging Koalas
>  * PyPI Installation
>  ** PySpark with Hadoop 3 support on PyPi
>  ** Better error handling
>  
> | | | | |
> |SPARK-31382|Show a better error message for different python and pip 
> installation mistake|{color:#006644}RESOLVED{color}|[Hyukjin 
> Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=hyukjin.kwon]|
> |SPARK-31849|Improve Python exception messages to be more 
> Pythonic|{color:#006644}RESOLVED{color}|[Hyukjin 
> Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=hyukjin.kwon]|
> |SPARK-31851|Redesign PySpark 
> documentation|{color:#006644}RESOLVED{color}|[Hyukjin 
> Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=hyukjin.kwon]|
> |SPARK-32017|Make Pyspark Hadoop 3.2+ Variant available in 
> PyPI|{color:#006644}RESOLVED{color}|[Hyukjin 
> Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=hyukjin.kwon]|
> |SPARK-32084|Replace dictionary-based function definitions to proper 
> functions in functions.py|{color:#006644}RESOLVED{color}|[Maciej 
> Szymkiewicz|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=zero323]|
> |SPARK-32085|Migrate to NumPy documentation 
> style|{color:#006644}RESOLVED{color}|[Maciej 
> Szymkiewicz|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=zero323]|
> |SPARK-32161|Hide JVM traceback for 
> SparkUpgradeException|{color:#006644}RESOLVED{color}|[Pralabh 
> Kumar|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=pralabhkumar]|
> |SPARK-32185|User Guide - Monitoring|{color:#006644}RESOLVED{color}|[Abhijeet 
> Prasad|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=a7prasad]|
> |SPARK-32195|Standardize warning types and 
> messages|{color:#006644}RESOLVED{color}|[Maciej 
> Szymkiewicz|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=zero323]|
> |SPARK-32204|Binder Integration|{color:#006644}RESOLVED{color}|[Hyukjin 
> Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=hyukjin.kwon]|
> |SPARK-32681|PySpark type hints 
> support|{color:#006644}RESOLVED{color}|[Maciej 
> Szymkiewicz|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=zero323]|
> |SPARK-32686|Un-deprecate inferring DataFrame schema from list of 
> dictionaries|{color:#006644}RESOLVED{color}|[Nicholas 
> Chammas|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=nchammas]|
> |SPARK-33247|Improve examples and scenarios in 
> docstrings|{color:#006644}RESOLVED{color}|_Unassigned_|
> |SPARK-33407|Simplify the exception message from Python 
> UDFs|{color:#006644}RESOLVED{color}|[Hyukjin 
> Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=hyukjin.kwon]|
> |SPARK-33530|Support --archives option 
> natively|{color:#006644}RESOLVED{color}|[Hyukjin 
> Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=hyukjin.kwon]|
> |SPARK-34629|Python type hints 
> improvement|{color:#006644}RESOLVED{color}|[Maciej 
> Szymkiewicz|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=zero323]|
> |SPARK-34849|SPIP: Support pandas API layer on 
> PySpark|{color:#006644}RESOLVED{color}|[Haejoon 
> Lee|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=itholic]|
> |SPARK-34885|Port/integrate Koalas documentation into 
> PySpark|{color:#006644}RESOLVED{color}|[Hyukjin 
> Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=hyukjin.kwon]|
> |SPARK-35337|pandas API on Spark: Separate basic operations into data type 
> based structures|{color:#006644}RESOLVED{color}|[Xinrong 
> Meng|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=XinrongM]|
> |SPARK-35419|Enable 
> spark.sql.execution.pyspark.udf.simplifiedTraceback.enabled by 
> default|{color:#006644}RESOLVED{color}|[Hyukjin 
> Kwon|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=hyukjin.kwon]|
> |SPARK-35464|pandas API on Spark: Enable mypy check "disallow_untyped_defs" 
> for main codes.|{color:#006644}RESOLVED{color}|[Takuya 
> Ueshin|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=ueshin]|
> |SPARK-35805|API auditing in Pandas API on 
> Spark|{color:#006644}RESOLVED{color}|[Haejoon 
> Lee|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=itholic]|



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32082) Project Zen: Improving Python usability

Reply via email to