[GitHub] spark pull request #19575: [SPARK-22221][DOCS] Adding User Documentation for...

gatorsmile Fri, 26 Jan 2018 16:23:33 -0800

Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/19575#discussion_r164252424

--- Diff: docs/sql-programming-guide.md ---
@@ -1640,6 +1640,133 @@ Configuration of Hive is done by placing your
`hive-site.xml`, `core-site.xml` a
You may run `./bin/spark-sql --help` for a complete list of all available
options.

+# PySpark Usage Guide for Pandas with Apache Arrow
+
+## Apache Arrow in Spark
+
+Apache Arrow is an in-memory columnar data format that is used in Spark to
efficiently transfer
+data between JVM and Python processes. This currently is most beneficial
to Python users that
+work with Pandas/NumPy data. Its usage is not automatic and might require
some minor
+changes to configuration or code to take full advantage and ensure
compatibility. This guide will
+give a high-level description of how to use Arrow in Spark and highlight
any differences when
+working with Arrow-enabled data.
+
+### Ensure PyArrow Installed
+
+If you install PySpark using pip, then PyArrow can be brought in as an
extra dependency of the
+SQL module with the command `pip install pyspark[sql]`. Otherwise, you
must ensure that PyArrow
+is installed and available on all cluster nodes. The current supported
version is 0.8.0.
+You can install using pip or conda from the conda-forge channel. See
PyArrow
+[installation](https://arrow.apache.org/docs/python/install.html) for
details.
+
+## Enabling for Conversion to/from Pandas
+
+Arrow is available as an optimization when converting a Spark DataFrame to
Pandas using the call
+`toPandas()` and when creating a Spark DataFrame from Pandas with
`createDataFrame(pandas_df)`.
+To use Arrow when executing these calls, users need to first set the Spark
configuration
+'spark.sql.execution.arrow.enabled' to 'true'. This is disabled by default.
+
+<div class="codetabs">
+<div data-lang="python" markdown="1">
+{% include_example dataframe_with_arrow python/sql/arrow.py %}
+</div>
+</div>
+
+Using the above optimizations with Arrow will produce the same results as
when Arrow is not
+enabled. Note that even with Arrow, `toPandas()` results in the collection
of all records in the
+DataFrame to the driver program and should be done on a small subset of
the data. Not all Spark
+data types are currently supported and an error can be raised if a column
has an unsupported type,
+see [Supported Types](#supported-sql-arrow-types). If an error occurs
during `createDataFrame()`,
+Spark will fall back to create the DataFrame without Arrow.
+
+## Pandas UDFs (a.k.a. Vectorized UDFs)
+
+Pandas UDFs are user defined functions that are executed by Spark using
Arrow to transfer data and
+Pandas to work with the data. A Pandas UDF is defined using the keyword
`pandas_udf` as a decorator
+or to wrap the function, no additional configuration is required.
Currently, there are two types of
+Pandas UDF: Scalar and Group Map.
+
+### Scalar
+
+Scalar Pandas UDFs are used for vectorizing scalar operations. They can be
used with functions such
+as `select` and `withColumn`. The Python function should take
`pandas.Series` as inputs and return
+a `pandas.Series` of the same length. Internally, Spark will execute a
Pandas UDF by splitting
+columns into batches and calling the function for each batch as a subset
of the data, then
+concatenating the results together.
+
+The following example shows how to create a scalar Pandas UDF that
computes the product of 2 columns.
+
+<div class="codetabs">
+<div data-lang="python" markdown="1">
+{% include_example scalar_pandas_udf python/sql/arrow.py %}
+</div>
+</div>
+
+### Group Map
+Group map Pandas UDFs are used with `groupBy().apply()` which implements
the "split-apply-combine" pattern.
+Split-apply-combine consists of three steps:
+* Split the data into groups by using `DataFrame.groupBy`.
+* Apply a function on each group. The input and output of the function are
both `pandas.DataFrame`. The
+ input data contains all the rows and columns for each group.
+* Combine the results into a new `DataFrame`.
+
+To use `groupBy().apply()`, the user needs to define the following:
+* A Python function that defines the computation for each group.
+* A `StructType` object or a string that defines the schema of the output
`DataFrame`.
+
+The following example shows how to use `groupby().apply()` to subtract the
mean from each value in the group.
+
+<div class="codetabs">
+<div data-lang="python" markdown="1">
+{% include_example group_map_pandas_udf python/sql/arrow.py %}
+</div>
+</div>
+
+For detailed usage, please see
[`pyspark.sql.functions.pandas_udf`](api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf)
and

+[`pyspark.sql.GroupedData.apply`](api/python/pyspark.sql.html#pyspark.sql.GroupedData.apply).
+
+## Usage Notes
+
+### Supported SQL-Arrow Types
+
+Currently, all Spark SQL data types are supported except `MapType`,
`ArrayType` of `TimestampType`, and
+nested `StructType`.
+
+### Setting Arrow Batch Size
+
+Data partitions in Spark are converted into Arrow record batches, which
can temporarily lead to
+high memory usage in the JVM. To avoid possible out of memory exceptions,
the size of the Arrow
+record batches can be adjusted by setting the conf
"spark.sql.execution.arrow.maxRecordsPerBatch"
--- End diff --

Based on this description, it sounds like we should not use the number of
records, right? cc @cloud-fan @taku-k too



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19575: [SPARK-22221][DOCS] Adding User Documentation for...

Reply via email to