[GitHub] spark pull request #19575: [SPARK-22221][DOCS] Adding User Documentation for...

HyukjinKwon Thu, 25 Jan 2018 13:38:30 -0800

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19575#discussion_r163975188
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -1640,6 +1640,154 @@ Configuration of Hive is done by placing your 
`hive-site.xml`, `core-site.xml` a
     You may run `./bin/spark-sql --help` for a complete list of all available
     options.
     
    +# Usage Guide for Pandas with Arrow
    +
    +## Arrow in Spark
    +
    +Apache Arrow is an in-memory columnar data format that is used in Spark to 
efficiently transfer
    +data between JVM and Python processes. This currently is most beneficial 
to Python users that
    +work with Pandas/NumPy data. Its usage is not automatic and might require 
some minor
    +changes to configuration or code to take full advantage and ensure 
compatibility. This guide will
    +give a high-level description of how to use Arrow in Spark and highlight 
any differences when
    +working with Arrow-enabled data.
    +
    +## Ensure pyarrow Installed
    +
    +If you install pyspark using pip, then pyarrow can be brought in as an 
extra dependency of the sql
    +module with the command "pip install pyspark[sql]". Otherwise, you must 
ensure that pyarrow is
    +installed and available on all cluster node Python environments. The 
current supported version is
    +0.8.0. You can install using pip or conda from the conda-forge channel. 
See pyarrow
    +[installation](https://arrow.apache.org/docs/python/install.html) for 
details.
    +
    +## How to Enable for Conversion to/from Pandas
    +
    +Arrow is available as an optimization when converting a Spark DataFrame to 
Pandas using the call 
    +`toPandas()` and when creating a Spark DataFrame from Pandas with 
`createDataFrame(pandas_df)`.
    +To use Arrow when executing these calls, it first must be enabled by 
setting the Spark conf
    +'spark.sql.execution.arrow.enabled' to 'true', this is disabled by default.
    +
    +<div class="codetabs">
    +<div data-lang="python"  markdown="1">
    +{% highlight python %}
    +
    +import numpy as np
    --- End diff --
    
    Yuo, sounds fine.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19575: [SPARK-22221][DOCS] Adding User Documentation for...

Reply via email to