GitHub user viirya opened a pull request: https://github.com/apache/spark/pull/22610
[WIP][SPARK-25461][PySpark][SQL] Print warning when return type of Pandas.Series mismatches the arrow return type of pandas udf ## What changes were proposed in this pull request? For Pandas UDFs, we get arrow type from defined Catalyst return data type of UDFs. We use this arrow type to do serialization of data. If the defined return data type doesn't match with actual return type of Pandas.Series returned by Pandas UDFs, it has a risk to return incorrect data from Python side. This WIP work proposes to check if returned Pandas.Series's dtype matches with defined return type of Pandas UDFs. Although we can disallow it by throwing an exception to let users know they might need to set correct return type. But looks like we leverage such behavior in current codebase. For example, there is a test `test_vectorized_udf_null_short`: ```python data = [(None,), (2,), (3,), (4,)] schema = StructType().add("short", ShortType()) df = self.spark.createDataFrame(data, schema) short_f = pandas_udf(lambda x: x, ShortType()) res = df.select(short_f(col('short'))) self.assertEquals(df.collect(), res.collect()) ``` So instead, this work for now just prints warning message if such mismatching is detected. So users can read this message when debugging that their Pandas UDFs don't produce expected results. ## How was this patch tested? Manually test by running: ```python from pyspark.sql.functions import pandas_udf import pandas as pd values = [1.0] * 5 + [2.0] * 5 pdf = pd.DataFrame({'A': values}) df = spark.createDataFrame(pdf) @pandas_udf(returnType=BooleanType()) def to_boolean(column): return column df.select(['A']).withColumn('to_boolean', to_boolean('A')).show() ``` Output: ``` WARN: Arrow type double of return Pandas.Series of the user-defined function's dtype float64 doesn't match the arrow type bool of defined return type B ooleanType +---+----------+ | A|to_boolean| +---+----------+ |1.0| false| |1.0| false| |1.0| false| |1.0| false| |1.0| false| |2.0| false| |2.0| false| |2.0| false| |2.0| false| |2.0| false| +---+----------+ ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/viirya/spark-1 SPARK-25461 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22610.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22610 ---- commit 2fa15bda48ba64a102f114dc9119cb3c310200c4 Author: Liang-Chi Hsieh <viirya@...> Date: 2018-09-26T09:01:40Z Ensure return type of Pandas.Series matches the arrow return type of pandas udf. commit d206b7cf78f898e622f539a15e45515fcbd9e54a Author: Liang-Chi Hsieh <viirya@...> Date: 2018-10-02T05:29:44Z Print warning message instead of throwing exception. ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org