[GitHub] spark pull request #19575: [SPARK-22221][DOCS] Adding User Documentation for...

BryanCutler Mon, 29 Jan 2018 09:52:14 -0800

Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/19575#discussion_r164509836

--- Diff: docs/sql-programming-guide.md ---
@@ -1640,6 +1640,133 @@ Configuration of Hive is done by placing your
`hive-site.xml`, `core-site.xml` a
You may run `./bin/spark-sql --help` for a complete list of all available
options.

+# PySpark Usage Guide for Pandas with Apache Arrow
+
+## Apache Arrow in Spark
+
+Apache Arrow is an in-memory columnar data format that is used in Spark to
efficiently transfer
+data between JVM and Python processes. This currently is most beneficial
to Python users that
+work with Pandas/NumPy data. Its usage is not automatic and might require
some minor
+changes to configuration or code to take full advantage and ensure
compatibility. This guide will
+give a high-level description of how to use Arrow in Spark and highlight
any differences when
+working with Arrow-enabled data.
+
+### Ensure PyArrow Installed
+
+If you install PySpark using pip, then PyArrow can be brought in as an
extra dependency of the
+SQL module with the command `pip install pyspark[sql]`. Otherwise, you
must ensure that PyArrow
+is installed and available on all cluster nodes. The current supported
version is 0.8.0.
+You can install using pip or conda from the conda-forge channel. See
PyArrow
+[installation](https://arrow.apache.org/docs/python/install.html) for
details.
+
+## Enabling for Conversion to/from Pandas
+
+Arrow is available as an optimization when converting a Spark DataFrame to
Pandas using the call
+`toPandas()` and when creating a Spark DataFrame from Pandas with
`createDataFrame(pandas_df)`.
+To use Arrow when executing these calls, users need to first set the Spark
configuration
+'spark.sql.execution.arrow.enabled' to 'true'. This is disabled by default.
+
+<div class="codetabs">
+<div data-lang="python" markdown="1">
+{% include_example dataframe_with_arrow python/sql/arrow.py %}
+</div>
+</div>
+
+Using the above optimizations with Arrow will produce the same results as
when Arrow is not
+enabled. Note that even with Arrow, `toPandas()` results in the collection
of all records in the
+DataFrame to the driver program and should be done on a small subset of
the data. Not all Spark
+data types are currently supported and an error can be raised if a column
has an unsupported type,
+see [Supported Types](#supported-sql-arrow-types). If an error occurs
during `createDataFrame()`,
+Spark will fall back to create the DataFrame without Arrow.
+
+## Pandas UDFs (a.k.a. Vectorized UDFs)
+
+Pandas UDFs are user defined functions that are executed by Spark using
Arrow to transfer data and
+Pandas to work with the data. A Pandas UDF is defined using the keyword
`pandas_udf` as a decorator
+or to wrap the function, no additional configuration is required.
Currently, there are two types of
+Pandas UDF: Scalar and Group Map.
+
+### Scalar
+
+Scalar Pandas UDFs are used for vectorizing scalar operations. They can be
used with functions such
+as `select` and `withColumn`. The Python function should take
`pandas.Series` as inputs and return
+a `pandas.Series` of the same length. Internally, Spark will execute a
Pandas UDF by splitting
+columns into batches and calling the function for each batch as a subset
of the data, then
+concatenating the results together.
+
+The following example shows how to create a scalar Pandas UDF that
computes the product of 2 columns.
+
+<div class="codetabs">
+<div data-lang="python" markdown="1">
+{% include_example scalar_pandas_udf python/sql/arrow.py %}
+</div>
+</div>
+
+### Group Map
+Group map Pandas UDFs are used with `groupBy().apply()` which implements
the "split-apply-combine" pattern.
+Split-apply-combine consists of three steps:
+* Split the data into groups by using `DataFrame.groupBy`.
+* Apply a function on each group. The input and output of the function are
both `pandas.DataFrame`. The
+ input data contains all the rows and columns for each group.
+* Combine the results into a new `DataFrame`.
+
+To use `groupBy().apply()`, the user needs to define the following:
+* A Python function that defines the computation for each group.
+* A `StructType` object or a string that defines the schema of the output
`DataFrame`.
+
+The following example shows how to use `groupby().apply()` to subtract the
mean from each value in the group.
+
+<div class="codetabs">
+<div data-lang="python" markdown="1">
+{% include_example group_map_pandas_udf python/sql/arrow.py %}
+</div>
+</div>
+
+For detailed usage, please see
[`pyspark.sql.functions.pandas_udf`](api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf)
and

+[`pyspark.sql.GroupedData.apply`](api/python/pyspark.sql.html#pyspark.sql.GroupedData.apply).
+
+## Usage Notes
+
+### Supported SQL-Arrow Types
+
+Currently, all Spark SQL data types are supported except `MapType`,
`ArrayType` of `TimestampType`, and
+nested `StructType`.
+
+### Setting Arrow Batch Size
+
+Data partitions in Spark are converted into Arrow record batches, which
can temporarily lead to
+high memory usage in the JVM. To avoid possible out of memory exceptions,
the size of the Arrow
+record batches can be adjusted by setting the conf
"spark.sql.execution.arrow.maxRecordsPerBatch"
--- End diff --

Yes, it's not an ideal approach. I'm happy to make a JIRA to followup and
look into other ways to break up the batches, but that won't be in before 2.3.
So does that mean our options here are (unless I'm not understanding
internal/external conf correctly)

1. Keep `maxRecordsPerBatch` internal and remove this doc section.
2. Externalize this conf and deprecate once a better approach is found.

I think (2) is better because if the user hits memory issues, then they can
at least find someway to adjust it



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #19575: [SPARK-22221][DOCS] Adding User Documentation for...

Reply via email to