This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-2.4 in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/branch-2.4 by this push: new c8f9ce8 [SPARK-30834][DOCS][PYTHON][2.4] Add note for recommended pandas and pyarrow versions c8f9ce8 is described below commit c8f9ce8c515baf8df3956f99246d52a0f4cb4413 Author: Bryan Cutler <cutl...@gmail.com> AuthorDate: Mon Feb 17 11:09:35 2020 +0900 [SPARK-30834][DOCS][PYTHON][2.4] Add note for recommended pandas and pyarrow versions ### What changes were proposed in this pull request? Add doc for recommended pandas and pyarrow versions. ### Why are the changes needed? The recommended versions are those that have been thoroughly tested by Spark CI. Other versions may be used at the discretion of the user. ### Does this PR introduce any user-facing change? No ### How was this patch tested? NA Closes #27586 from BryanCutler/python-doc-rec-pandas-pyarrow-SPARK-30834. Lead-authored-by: Bryan Cutler <cutl...@gmail.com> Co-authored-by: HyukjinKwon <gurwls...@apache.org> Signed-off-by: HyukjinKwon <gurwls...@apache.org> --- docs/sql-pyspark-pandas-with-arrow.md | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/docs/sql-pyspark-pandas-with-arrow.md b/docs/sql-pyspark-pandas-with-arrow.md index b11758b..08303c4 100644 --- a/docs/sql-pyspark-pandas-with-arrow.md +++ b/docs/sql-pyspark-pandas-with-arrow.md @@ -18,9 +18,11 @@ working with Arrow-enabled data. ### Ensure PyArrow Installed +To use Apache Arrow in PySpark, [the recommended version of PyArrow](#recommended-pandas-and-pyarrow-versions) +should be installed. If you install PySpark using pip, then PyArrow can be brought in as an extra dependency of the SQL module with the command `pip install pyspark[sql]`. Otherwise, you must ensure that PyArrow -is installed and available on all cluster nodes. The current supported version is 0.8.0. +is installed and available on all cluster nodes. You can install using pip or conda from the conda-forge channel. See PyArrow [installation](https://arrow.apache.org/docs/python/install.html) for details. @@ -166,6 +168,12 @@ different than a Pandas timestamp. It is recommended to use Pandas time series f working with timestamps in `pandas_udf`s to get the best performance, see [here](https://pandas.pydata.org/pandas-docs/stable/timeseries.html) for details. +### Recommended Pandas and PyArrow Versions + +For usage with pyspark.sql, the supported versions of Pandas is 0.19.2 and PyArrow is 0.8.0. Higher +versions may be used, however, compatibility and data correctness can not be guaranteed and should +be verified by the user. + ### Compatibiliy Setting for PyArrow >= 0.15.0 and Spark 2.3.x, 2.4.x Since Arrow 0.15.0, a change in the binary IPC format requires an environment variable to be --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org