[ https://issues.apache.org/jira/browse/SPARK-25461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16635048#comment-16635048 ]
Liang-Chi Hsieh edited comment on SPARK-25461 at 10/2/18 5:27 AM: ------------------------------------------------------------------ I've looked more at this. We don't really check if pandas.Series's type matches with pre-defined return type. For this case, seems the conversion is not correct. I was trying to add some check and throw exception when mismatching is detected. But looks like we leverage such behavior in current codebase. For example, there is a test {{test_vectorized_udf_null_short}}: {code} data = [(None,), (2,), (3,), (4,)] schema = StructType().add("short", ShortType()) df = self.spark.createDataFrame(data, schema) short_f = pandas_udf(lambda x: x, ShortType()) res = df.select(short_f(col('short'))) self.assertEquals(df.collect(), res.collect()) {code} The Pandas.Series is of float64 but we define return type as ShortType. In this case, it works well. So seems to disallow such conversion is not feasible. For now, I think we can print some warning message if such mismatching is detected. cc [~hyukjin.kwon] What do you think about this idea? was (Author: viirya): I've looked more at this. We don't really check if pandas.Series's type matches with pre-defined return type. For this case, seems the conversion is not correct and silently ignored. I was trying to add some check and throw exception when mismatching is detected. But looks like we leverage such behavior in current codebase. For example, there is a test {{test_vectorized_udf_null_short}}: {code:python} data = [(None,), (2,), (3,), (4,)] schema = StructType().add("short", ShortType()) df = self.spark.createDataFrame(data, schema) short_f = pandas_udf(lambda x: x, ShortType()) res = df.select(short_f(col('short'))) self.assertEquals(df.collect(), res.collect()) {code} The Pandas.Series is of float64 but we define return type as ShortType. In this case, it works well. So seems to disallow such conversion is not feasible. For now, I think we can print some warning message if such mismatching is detected. cc [~hyukjin.kwon] What do you think about this idea? > PySpark Pandas UDF outputs incorrect results when input columns contain None > ---------------------------------------------------------------------------- > > Key: SPARK-25461 > URL: https://issues.apache.org/jira/browse/SPARK-25461 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 2.3.1 > Environment: I reproduced this issue by running pyspark locally on > mac: > Spark version: 2.3.1 pre-built with Hadoop 2.7 > Python library versions: pyarrow==0.10.0, pandas==0.20.2 > Reporter: Chongyuan Xiang > Priority: Major > > The following PySpark script uses a simple pandas UDF to calculate a column > given column 'A'. When column 'A' contains None, the results look incorrect. > Script: > > {code:java} > import pandas as pd > import random > import pyspark > from pyspark.sql.functions import col, lit, pandas_udf > values = [None] * 30000 + [1.0] * 170000 + [2.0] * 6000000 > random.shuffle(values) > pdf = pd.DataFrame({'A': values}) > df = spark.createDataFrame(pdf) > @pandas_udf(returnType=pyspark.sql.types.BooleanType()) > def gt_2(column): > return (column >= 2).where(column.notnull()) > calculated_df = (df.select(['A']) > .withColumn('potential_bad_col', gt_2('A')) > ) > calculated_df = calculated_df.withColumn('correct_col', (col("A") >= lit(2)) > | (col("A").isNull())) > calculated_df.show() > {code} > > Output: > {code:java} > +---+-----------------+-----------+ > | A|potential_bad_col|correct_col| > +---+-----------------+-----------+ > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |1.0| false| false| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > |2.0| false| true| > +---+-----------------+-----------+ > only showing top 20 rows > {code} > This problem disappears when the number of rows is small or when the input > column does not contain None. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org