spark git commit: [SPARK-25461][PYSPARK][SQL] Add document for mismatch between return type of Pandas.Series and return type of pandas udf

gurwls223 Sun, 07 Oct 2018 08:20:01 -0700

Repository: spark
Updated Branches:
  refs/heads/master fba722e31 -> 3eb842969



[SPARK-25461][PYSPARK][SQL] Add document for mismatch between return type of 
Pandas.Series and return type of pandas udf

## What changes were proposed in this pull request?

For Pandas UDFs, we get arrow type from defined Catalyst return data type of 
UDFs. We use this arrow type to do serialization of data. If the defined return 
data type doesn't match with actual return type of Pandas.Series returned by 
Pandas UDFs, it has a risk to return incorrect data from Python side.

Currently we don't have reliable approach to check if the data conversion is 
safe or not. We leave some document to notify this to users for now. When there 
is next upgrade of PyArrow available we can use to check it, we should add the 
option to check it.

## How was this patch tested?

Only document change.

Closes #22610 from viirya/SPARK-25461.

Authored-by: Liang-Chi Hsieh <vii...@gmail.com>
Signed-off-by: hyukjinkwon <gurwls...@apache.org>


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/3eb84296
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/3eb84296
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/3eb84296

Branch: refs/heads/master
Commit: 3eb842969906d6e81a137af6dc4339881df0a315
Parents: fba722e
Author: Liang-Chi Hsieh <vii...@gmail.com>
Authored: Sun Oct 7 23:18:46 2018 +0800
Committer: hyukjinkwon <gurwls...@apache.org>
Committed: Sun Oct 7 23:18:46 2018 +0800

----------------------------------------------------------------------
 python/pyspark/sql/functions.py | 6 ++++++
 1 file changed, 6 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/3eb84296/python/pyspark/sql/functions.py
----------------------------------------------------------------------
diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py
index 7685264..be089ee 100644
--- a/python/pyspark/sql/functions.py
+++ b/python/pyspark/sql/functions.py
@@ -2948,6 +2948,12 @@ def pandas_udf(f=None, returnType=None, 
functionType=None):
         can fail on special rows, the workaround is to incorporate the 
condition into the functions.
 
     .. note:: The user-defined functions do not take keyword arguments on the 
calling side.
+
+    .. note:: The data type of returned `pandas.Series` from the user-defined 
functions should be
+        matched with defined returnType (see :meth:`types.to_arrow_type` and
+        :meth:`types.from_arrow_type`). When there is mismatch between them, 
Spark might do
+        conversion on returned data. The conversion is not guaranteed to be 
correct and results
+        should be checked for accuracy by users.
     """
     # decorator @pandas_udf(returnType, functionType)
     is_decorator = f is None or isinstance(f, (str, DataType))


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-25461][PYSPARK][SQL] Add document for mismatch between return type of Pandas.Series and return type of pandas udf

Reply via email to