Github user BryanCutler commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19575#discussion_r164509836
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -1640,6 +1640,133 @@ Configuration of Hive is done by placing your 
`hive-site.xml`, `core-site.xml` a
     You may run `./bin/spark-sql --help` for a complete list of all available
     options.
     
    +# PySpark Usage Guide for Pandas with Apache Arrow
    +
    +## Apache Arrow in Spark
    +
    +Apache Arrow is an in-memory columnar data format that is used in Spark to 
efficiently transfer
    +data between JVM and Python processes. This currently is most beneficial 
to Python users that
    +work with Pandas/NumPy data. Its usage is not automatic and might require 
some minor
    +changes to configuration or code to take full advantage and ensure 
compatibility. This guide will
    +give a high-level description of how to use Arrow in Spark and highlight 
any differences when
    +working with Arrow-enabled data.
    +
    +### Ensure PyArrow Installed
    +
    +If you install PySpark using pip, then PyArrow can be brought in as an 
extra dependency of the
    +SQL module with the command `pip install pyspark[sql]`. Otherwise, you 
must ensure that PyArrow
    +is installed and available on all cluster nodes. The current supported 
version is 0.8.0.
    +You can install using pip or conda from the conda-forge channel. See 
PyArrow
    +[installation](https://arrow.apache.org/docs/python/install.html) for 
details.
    +
    +## Enabling for Conversion to/from Pandas
    +
    +Arrow is available as an optimization when converting a Spark DataFrame to 
Pandas using the call
    +`toPandas()` and when creating a Spark DataFrame from Pandas with 
`createDataFrame(pandas_df)`.
    +To use Arrow when executing these calls, users need to first set the Spark 
configuration
    +'spark.sql.execution.arrow.enabled' to 'true'. This is disabled by default.
    +
    +<div class="codetabs">
    +<div data-lang="python" markdown="1">
    +{% include_example dataframe_with_arrow python/sql/arrow.py %}
    +</div>
    +</div>
    +
    +Using the above optimizations with Arrow will produce the same results as 
when Arrow is not
    +enabled. Note that even with Arrow, `toPandas()` results in the collection 
of all records in the
    +DataFrame to the driver program and should be done on a small subset of 
the data. Not all Spark
    +data types are currently supported and an error can be raised if a column 
has an unsupported type,
    +see [Supported Types](#supported-sql-arrow-types). If an error occurs 
during `createDataFrame()`,
    +Spark will fall back to create the DataFrame without Arrow.
    +
    +## Pandas UDFs (a.k.a. Vectorized UDFs)
    +
    +Pandas UDFs are user defined functions that are executed by Spark using 
Arrow to transfer data and
    +Pandas to work with the data. A Pandas UDF is defined using the keyword 
`pandas_udf` as a decorator
    +or to wrap the function, no additional configuration is required. 
Currently, there are two types of
    +Pandas UDF: Scalar and Group Map.
    +
    +### Scalar
    +
    +Scalar Pandas UDFs are used for vectorizing scalar operations. They can be 
used with functions such
    +as `select` and `withColumn`. The Python function should take 
`pandas.Series` as inputs and return
    +a `pandas.Series` of the same length. Internally, Spark will execute a 
Pandas UDF by splitting
    +columns into batches and calling the function for each batch as a subset 
of the data, then
    +concatenating the results together.
    +
    +The following example shows how to create a scalar Pandas UDF that 
computes the product of 2 columns.
    +
    +<div class="codetabs">
    +<div data-lang="python" markdown="1">
    +{% include_example scalar_pandas_udf python/sql/arrow.py %}
    +</div>
    +</div>
    +
    +### Group Map
    +Group map Pandas UDFs are used with `groupBy().apply()` which implements 
the "split-apply-combine" pattern.
    +Split-apply-combine consists of three steps:
    +* Split the data into groups by using `DataFrame.groupBy`.
    +* Apply a function on each group. The input and output of the function are 
both `pandas.DataFrame`. The
    +  input data contains all the rows and columns for each group.
    +* Combine the results into a new `DataFrame`.
    +
    +To use `groupBy().apply()`, the user needs to define the following:
    +* A Python function that defines the computation for each group.
    +* A `StructType` object or a string that defines the schema of the output 
`DataFrame`.
    +
    +The following example shows how to use `groupby().apply()` to subtract the 
mean from each value in the group.
    +
    +<div class="codetabs">
    +<div data-lang="python" markdown="1">
    +{% include_example group_map_pandas_udf python/sql/arrow.py %}
    +</div>
    +</div>
    +
    +For detailed usage, please see 
[`pyspark.sql.functions.pandas_udf`](api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf)
 and
    
+[`pyspark.sql.GroupedData.apply`](api/python/pyspark.sql.html#pyspark.sql.GroupedData.apply).
    +
    +## Usage Notes
    +
    +### Supported SQL-Arrow Types
    +
    +Currently, all Spark SQL data types are supported except `MapType`, 
`ArrayType` of `TimestampType`, and
    +nested `StructType`.
    +
    +### Setting Arrow Batch Size
    +
    +Data partitions in Spark are converted into Arrow record batches, which 
can temporarily lead to
    +high memory usage in the JVM. To avoid possible out of memory exceptions, 
the size of the Arrow
    +record batches can be adjusted by setting the conf 
"spark.sql.execution.arrow.maxRecordsPerBatch"
    --- End diff --
    
    Yes, it's not an ideal approach. I'm happy to make a JIRA to followup and 
look into other ways to break up the batches, but that won't be in before 2.3. 
So does that mean our options here are (unless I'm not understanding 
internal/external conf correctly)
    
    1. Keep `maxRecordsPerBatch` internal and remove this doc section.
    2. Externalize this conf and deprecate once a better approach is found.
    
    I think (2) is better because if the user hits memory issues, then they can 
at least find someway to adjust it 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to