Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19575#discussion_r163975188 --- Diff: docs/sql-programming-guide.md --- @@ -1640,6 +1640,154 @@ Configuration of Hive is done by placing your `hive-site.xml`, `core-site.xml` a You may run `./bin/spark-sql --help` for a complete list of all available options. +# Usage Guide for Pandas with Arrow + +## Arrow in Spark + +Apache Arrow is an in-memory columnar data format that is used in Spark to efficiently transfer +data between JVM and Python processes. This currently is most beneficial to Python users that +work with Pandas/NumPy data. Its usage is not automatic and might require some minor +changes to configuration or code to take full advantage and ensure compatibility. This guide will +give a high-level description of how to use Arrow in Spark and highlight any differences when +working with Arrow-enabled data. + +## Ensure pyarrow Installed + +If you install pyspark using pip, then pyarrow can be brought in as an extra dependency of the sql +module with the command "pip install pyspark[sql]". Otherwise, you must ensure that pyarrow is +installed and available on all cluster node Python environments. The current supported version is +0.8.0. You can install using pip or conda from the conda-forge channel. See pyarrow +[installation](https://arrow.apache.org/docs/python/install.html) for details. + +## How to Enable for Conversion to/from Pandas + +Arrow is available as an optimization when converting a Spark DataFrame to Pandas using the call +`toPandas()` and when creating a Spark DataFrame from Pandas with `createDataFrame(pandas_df)`. +To use Arrow when executing these calls, it first must be enabled by setting the Spark conf +'spark.sql.execution.arrow.enabled' to 'true', this is disabled by default. + +<div class="codetabs"> +<div data-lang="python" markdown="1"> +{% highlight python %} + +import numpy as np --- End diff -- Yuo, sounds fine.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org