Repository: spark
Updated Branches:
  refs/heads/branch-2.3 b37e76fa4 -> e56266ad7


[SPARK-24444][DOCS][PYTHON][BRANCH-2.3] Improve Pandas UDF docs to explain 
column assignment

## What changes were proposed in this pull request?
Added sections to pandas_udf docs, in the grouped map section, to indicate 
columns are assigned by position. Backported to branch-2.3.

## How was this patch tested?
NA

Author: Bryan Cutler <cutl...@gmail.com>

Closes #21478 from 
BryanCutler/arrow-doc-pandas_udf-column_by_pos-2_3_1-SPARK-21427.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/e56266ad
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/e56266ad
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/e56266ad

Branch: refs/heads/branch-2.3
Commit: e56266ad719488d3887fb7ea0985b3760b3ece12
Parents: b37e76f
Author: Bryan Cutler <cutl...@gmail.com>
Authored: Fri Jun 1 14:27:10 2018 +0800
Committer: hyukjinkwon <gurwls...@apache.org>
Committed: Fri Jun 1 14:27:10 2018 +0800

----------------------------------------------------------------------
 docs/sql-programming-guide.md   | 9 +++++++++
 python/pyspark/sql/functions.py | 9 ++++++++-
 2 files changed, 17 insertions(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/e56266ad/docs/sql-programming-guide.md
----------------------------------------------------------------------
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index 14bc5e6..461806a 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -1737,6 +1737,15 @@ To use `groupBy().apply()`, the user needs to define the 
following:
 * A Python function that defines the computation for each group.
 * A `StructType` object or a string that defines the schema of the output 
`DataFrame`.
 
+The output schema will be applied to the columns of the returned 
`pandas.DataFrame` in order by position,
+not by name. This means that the columns in the `pandas.DataFrame` must be 
indexed so that their
+position matches the corresponding field in the schema.
+
+Note that when creating a new `pandas.DataFrame` using a dictionary, the 
actual position of the column
+can differ from the order that it was placed in the dictionary. It is 
recommended in this case to
+explicitly define the column order using the `columns` keyword, e.g.
+`pandas.DataFrame({'id': ids, 'a': data}, columns=['id', 'a'])`, or 
alternatively use an `OrderedDict`.
+
 Note that all data for a group will be loaded into memory before the function 
is applied. This can
 lead to out of memory exceptons, especially if the group sizes are skewed. The 
configuration for
 [maxRecordsPerBatch](#setting-arrow-batch-size) is not applied on groups and 
it is up to the user

http://git-wip-us.apache.org/repos/asf/spark/blob/e56266ad/python/pyspark/sql/functions.py
----------------------------------------------------------------------
diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py
index cf26523..9c02982 100644
--- a/python/pyspark/sql/functions.py
+++ b/python/pyspark/sql/functions.py
@@ -2216,7 +2216,8 @@ def pandas_udf(f=None, returnType=None, 
functionType=None):
        A grouped map UDF defines transformation: A `pandas.DataFrame` -> A 
`pandas.DataFrame`
        The returnType should be a :class:`StructType` describing the schema of 
the returned
        `pandas.DataFrame`.
-       The length of the returned `pandas.DataFrame` can be arbitrary.
+       The length of the returned `pandas.DataFrame` can be arbitrary and the 
columns must be
+       indexed so that their position matches the corresponding field in the 
schema.
 
        Grouped map UDFs are used with :meth:`pyspark.sql.GroupedData.apply`.
 
@@ -2239,6 +2240,12 @@ def pandas_udf(f=None, returnType=None, 
functionType=None):
        |  2| 1.1094003924504583|
        +---+-------------------+
 
+       .. note:: If returning a new `pandas.DataFrame` constructed with a 
dictionary, it is
+           recommended to explicitly index the columns by name to ensure the 
positions are correct,
+           or alternatively use an `OrderedDict`.
+           For example, `pd.DataFrame({'id': ids, 'a': data}, columns=['id', 
'a'])` or
+           `pd.DataFrame(OrderedDict([('id', ids), ('a', data)]))`.
+
        .. seealso:: :meth:`pyspark.sql.GroupedData.apply`
 
     .. note:: The user-defined functions are considered deterministic by 
default. Due to


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

Reply via email to