[GitHub] spark pull request #19575: [SPARK-22221][DOCS] Adding User Documentation for...

BryanCutler Sun, 28 Jan 2018 22:43:56 -0800

Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19575#discussion_r164345403
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -1640,6 +1640,133 @@ Configuration of Hive is done by placing your 
`hive-site.xml`, `core-site.xml` a
     You may run `./bin/spark-sql --help` for a complete list of all available
     options.
     
    +# PySpark Usage Guide for Pandas with Apache Arrow
    +
    +## Apache Arrow in Spark
    +
    +Apache Arrow is an in-memory columnar data format that is used in Spark to 
efficiently transfer
    +data between JVM and Python processes. This currently is most beneficial 
to Python users that
    +work with Pandas/NumPy data. Its usage is not automatic and might require 
some minor
    +changes to configuration or code to take full advantage and ensure 
compatibility. This guide will
    +give a high-level description of how to use Arrow in Spark and highlight 
any differences when
    +working with Arrow-enabled data.
    +
    +### Ensure PyArrow Installed
    +
    +If you install PySpark using pip, then PyArrow can be brought in as an 
extra dependency of the
    +SQL module with the command `pip install pyspark[sql]`. Otherwise, you 
must ensure that PyArrow
    +is installed and available on all cluster nodes. The current supported 
version is 0.8.0.
    +You can install using pip or conda from the conda-forge channel. See 
PyArrow
    +[installation](https://arrow.apache.org/docs/python/install.html) for 
details.
    +
    +## Enabling for Conversion to/from Pandas
    +
    +Arrow is available as an optimization when converting a Spark DataFrame to 
Pandas using the call
    +`toPandas()` and when creating a Spark DataFrame from Pandas with 
`createDataFrame(pandas_df)`.
    +To use Arrow when executing these calls, users need to first set the Spark 
configuration
    +'spark.sql.execution.arrow.enabled' to 'true'. This is disabled by default.
    +
    +<div class="codetabs">
    +<div data-lang="python" markdown="1">
    +{% include_example dataframe_with_arrow python/sql/arrow.py %}
    +</div>
    +</div>
    +
    +Using the above optimizations with Arrow will produce the same results as 
when Arrow is not
    +enabled. Note that even with Arrow, `toPandas()` results in the collection 
of all records in the
    +DataFrame to the driver program and should be done on a small subset of 
the data. Not all Spark
    +data types are currently supported and an error can be raised if a column 
has an unsupported type,
    +see [Supported Types](#supported-sql-arrow-types). If an error occurs 
during `createDataFrame()`,
    +Spark will fall back to create the DataFrame without Arrow.
    +
    +## Pandas UDFs (a.k.a. Vectorized UDFs)
    +
    +Pandas UDFs are user defined functions that are executed by Spark using 
Arrow to transfer data and
    +Pandas to work with the data. A Pandas UDF is defined using the keyword 
`pandas_udf` as a decorator
    +or to wrap the function, no additional configuration is required. 
Currently, there are two types of
    +Pandas UDF: Scalar and Group Map.
    +
    +### Scalar
    +
    +Scalar Pandas UDFs are used for vectorizing scalar operations. They can be 
used with functions such
    +as `select` and `withColumn`. The Python function should take 
`pandas.Series` as inputs and return
    +a `pandas.Series` of the same length. Internally, Spark will execute a 
Pandas UDF by splitting
    +columns into batches and calling the function for each batch as a subset 
of the data, then
    +concatenating the results together.
    +
    +The following example shows how to create a scalar Pandas UDF that 
computes the product of 2 columns.
    +
    +<div class="codetabs">
    +<div data-lang="python" markdown="1">
    +{% include_example scalar_pandas_udf python/sql/arrow.py %}
    +</div>
    +</div>
    +
    +### Group Map
    +Group map Pandas UDFs are used with `groupBy().apply()` which implements 
the "split-apply-combine" pattern.
    --- End diff --
    
    I can change to whatever you guys like, but I think these two section names 
were made to reflect the different pandas_udf types - scalar and group map.  Is 
that right @icexelloss ?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19575: [SPARK-22221][DOCS] Adding User Documentation for...

Reply via email to