[jira] [Commented] (SPARK-32612) int columns produce inconsistent results on pandas UDFs

Sean R. Owen (Jira) Sun, 16 Aug 2020 10:05:25 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-32612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178543#comment-17178543
 ]


Sean R. Owen commented on SPARK-32612:
--------------------------------------

I don't think it's correct to upgrade it to float in all cases. int overflow is 
simply what you have to deal with if you assert that you are using a 32-bit 
data type. Use long instead if you need more bytes.

> int columns produce inconsistent results on pandas UDFs
> -------------------------------------------------------
>
>                 Key: SPARK-32612
>                 URL: https://issues.apache.org/jira/browse/SPARK-32612
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.0.0
>            Reporter: Robert Joseph Evans
>            Priority: Major
>
> This is similar to SPARK-30187 but I personally consider this data corruption.
> If I have a simple pandas UDF
> {code}
>  >>> def add(a, b):
>         return a + b
>  >>> my_udf = pandas_udf(add, returnType=LongType())
> {code}
> And I want to process some data with it, say 32 bit ints
> {code}
> >>> df = spark.createDataFrame([(1037694399, 1204615848),(3,4)], 
> >>> StructType([StructField("a", IntegerType()), StructField("b", 
> >>> IntegerType())]))
> >>> df.select(my_udf(col("a") - 3, col("b")).show()
> +----------+----------+---------------+
> |         a|         b|add((a - 3), b)|
> +----------+----------+---------------+
> |1037694399|1204615848|    -2052657052|
> |         3|         4|              4|
> +----------+----------+---------------+
> {code}
> I get an integer overflow for the data as I would expect.  But as soon as I 
> add a {{None}} to the data, even on a different row the result I get back is 
> totally different.
> {code}
> >>> df = spark.createDataFrame([(1037694399, 1204615848),(3,None)], 
> >>> StructType([StructField("a", IntegerType()), StructField("b", 
> >>> IntegerType())]))
> >>> df.select(col("a"), col("b"), my_udf(col("a") - 3, col("b"))).show()
> +----------+----------+---------------+
> |         a|         b|add((a - 3), b)|
> +----------+----------+---------------+
> |1037694399|1204615848|     2242310244|
> |         3|      null|           null|
> +----------+----------+---------------+
> {code}
> The integer overflow disappears.  This is because arrow and/or pandas changes 
> the data type to a float in order to be able to store the null value.  So 
> then the processing is being done on floating point there is no overflow.  
> This in and of itself is annoying but understandable because it is dealing 
> with a limitation in pandas. 
> Where it becomes a bug is that this happens on a per batch basis.  This means 
> that I can have the same two rows in different parts of my data set and get 
> different results depending on their proximity to a null value.
> {code}
> >>> df = spark.createDataFrame([(1037694399, 
> >>> 1204615848),(3,None),(1037694399, 1204615848),(3,4)], 
> >>> StructType([StructField("a", IntegerType()), StructField("b", 
> >>> IntegerType())]))
> >>> df.select(col("a"), col("b"), my_udf(col("a") - 3, col("b"))).show()
> +----------+----------+---------------+
> |         a|         b|add((a - 3), b)|
> +----------+----------+---------------+
> |1037694399|1204615848|     2242310244|
> |         3|      null|           null|
> |1037694399|1204615848|     2242310244|
> |         3|         4|              4|
> +----------+----------+---------------+
> >>> spark.conf.set('spark.sql.execution.arrow.maxRecordsPerBatch', '2')
> >>> df.select(col("a"), col("b"), my_udf(col("a") - 3, col("b"))).show()
> +----------+----------+---------------+
> |         a|         b|add((a - 3), b)|
> +----------+----------+---------------+
> |1037694399|1204615848|     2242310244|
> |         3|      null|           null|
> |1037694399|1204615848|    -2052657052|
> |         3|         4|              4|
> +----------+----------+---------------+
> {code}
> For me personally I would prefer to have all nullable integer columns 
> upgraded to float all the time, that way it is at least consistent.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32612) int columns produce inconsistent results on pandas UDFs

Reply via email to