[jira] [Commented] (SPARK-25461) PySpark Pandas UDF outputs incorrect results when input columns contain None

Bryan Cutler (JIRA) Tue, 02 Oct 2018 09:06:35 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-25461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16635762#comment-16635762
 ]


Bryan Cutler commented on SPARK-25461:
--------------------------------------

Thanks for looking into this [~viirya]! You are right that the above udf 
returns a float64 instead of boolean. I'm not sure what the expected cast from 
float to bool should be, but it does seem like pyarrow might be doing something 
wrong here. I'll look into it some more and raise an issue there if so.

> PySpark Pandas UDF outputs incorrect results when input columns contain None
> ----------------------------------------------------------------------------
>
>                 Key: SPARK-25461
>                 URL: https://issues.apache.org/jira/browse/SPARK-25461
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.3.1
>         Environment: I reproduced this issue by running pyspark locally on 
> mac:
> Spark version: 2.3.1 pre-built with Hadoop 2.7
> Python library versions: pyarrow==0.10.0, pandas==0.20.2
>            Reporter: Chongyuan Xiang
>            Priority: Major
>
> The following PySpark script uses a simple pandas UDF to calculate a column 
> given column 'A'. When column 'A' contains None, the results look incorrect.
> Script: 
>  
> {code:java}
> import pandas as pd
> import random
> import pyspark
> from pyspark.sql.functions import col, lit, pandas_udf
> values = [None] * 30000 + [1.0] * 170000 + [2.0] * 6000000
> random.shuffle(values)
> pdf = pd.DataFrame({'A': values})
> df = spark.createDataFrame(pdf)
> @pandas_udf(returnType=pyspark.sql.types.BooleanType())
> def gt_2(column):
>     return (column >= 2).where(column.notnull())
> calculated_df = (df.select(['A'])
>     .withColumn('potential_bad_col', gt_2('A'))
> )
> calculated_df = calculated_df.withColumn('correct_col', (col("A") >= lit(2)) 
> | (col("A").isNull()))
> calculated_df.show()
> {code}
>  
> Output:
> {code:java}
> +---+-----------------+-----------+
> | A|potential_bad_col|correct_col|
> +---+-----------------+-----------+
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |1.0| false| false|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> |2.0| false| true|
> +---+-----------------+-----------+
> only showing top 20 rows
> {code}
> This problem disappears when the number of rows is small or when the input 
> column does not contain None.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25461) PySpark Pandas UDF outputs incorrect results when input columns contain None

Reply via email to