[spark] branch branch-2.4 updated: [SPARK-30834][DOCS][PYTHON][2.4] Add note for recommended pandas and pyarrow versions

gurwls223 Sun, 16 Feb 2020 18:11:08 -0800

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-2.4
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/branch-2.4 by this push:
     new c8f9ce8  [SPARK-30834][DOCS][PYTHON][2.4] Add note for recommended 
pandas and pyarrow versions
c8f9ce8 is described below

commit c8f9ce8c515baf8df3956f99246d52a0f4cb4413
Author: Bryan Cutler <cutl...@gmail.com>
AuthorDate: Mon Feb 17 11:09:35 2020 +0900

    [SPARK-30834][DOCS][PYTHON][2.4] Add note for recommended pandas and 
pyarrow versions
    
    ### What changes were proposed in this pull request?
    
    Add doc for recommended pandas and pyarrow versions.
    
    ### Why are the changes needed?
    
    The recommended versions are those that have been thoroughly tested by 
Spark CI. Other versions may be used at the discretion of the user.
    
    ### Does this PR introduce any user-facing change?
    
    No
    
    ### How was this patch tested?
    
    NA
    
    Closes #27586 from BryanCutler/python-doc-rec-pandas-pyarrow-SPARK-30834.
    
    Lead-authored-by: Bryan Cutler <cutl...@gmail.com>
    Co-authored-by: HyukjinKwon <gurwls...@apache.org>
    Signed-off-by: HyukjinKwon <gurwls...@apache.org>
---
 docs/sql-pyspark-pandas-with-arrow.md | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/docs/sql-pyspark-pandas-with-arrow.md 
b/docs/sql-pyspark-pandas-with-arrow.md
index b11758b..08303c4 100644
--- a/docs/sql-pyspark-pandas-with-arrow.md
+++ b/docs/sql-pyspark-pandas-with-arrow.md
@@ -18,9 +18,11 @@ working with Arrow-enabled data.
 
 ### Ensure PyArrow Installed
 
+To use Apache Arrow in PySpark, [the recommended version of 
PyArrow](#recommended-pandas-and-pyarrow-versions)
+should be installed.
 If you install PySpark using pip, then PyArrow can be brought in as an extra 
dependency of the
 SQL module with the command `pip install pyspark[sql]`. Otherwise, you must 
ensure that PyArrow
-is installed and available on all cluster nodes. The current supported version 
is 0.8.0.
+is installed and available on all cluster nodes.
 You can install using pip or conda from the conda-forge channel. See PyArrow
 [installation](https://arrow.apache.org/docs/python/install.html) for details.
 
@@ -166,6 +168,12 @@ different than a Pandas timestamp. It is recommended to 
use Pandas time series f
 working with timestamps in `pandas_udf`s to get the best performance, see
 [here](https://pandas.pydata.org/pandas-docs/stable/timeseries.html) for 
details.
 
+### Recommended Pandas and PyArrow Versions
+
+For usage with pyspark.sql, the supported versions of Pandas is 0.19.2 and 
PyArrow is 0.8.0. Higher
+versions may be used, however, compatibility and data correctness can not be 
guaranteed and should
+be verified by the user.
+
 ### Compatibiliy Setting for PyArrow >= 0.15.0 and Spark 2.3.x, 2.4.x
 
 Since Arrow 0.15.0, a change in the binary IPC format requires an environment 
variable to be


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-2.4 updated: [SPARK-30834][DOCS][PYTHON][2.4] Add note for recommended pandas and pyarrow versions

Reply via email to