[ https://issues.apache.org/jira/browse/SPARK-32612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178543#comment-17178543 ]
Sean R. Owen commented on SPARK-32612: -------------------------------------- I don't think it's correct to upgrade it to float in all cases. int overflow is simply what you have to deal with if you assert that you are using a 32-bit data type. Use long instead if you need more bytes. > int columns produce inconsistent results on pandas UDFs > ------------------------------------------------------- > > Key: SPARK-32612 > URL: https://issues.apache.org/jira/browse/SPARK-32612 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 3.0.0 > Reporter: Robert Joseph Evans > Priority: Major > > This is similar to SPARK-30187 but I personally consider this data corruption. > If I have a simple pandas UDF > {code} > >>> def add(a, b): > return a + b > >>> my_udf = pandas_udf(add, returnType=LongType()) > {code} > And I want to process some data with it, say 32 bit ints > {code} > >>> df = spark.createDataFrame([(1037694399, 1204615848),(3,4)], > >>> StructType([StructField("a", IntegerType()), StructField("b", > >>> IntegerType())])) > >>> df.select(my_udf(col("a") - 3, col("b")).show() > +----------+----------+---------------+ > | a| b|add((a - 3), b)| > +----------+----------+---------------+ > |1037694399|1204615848| -2052657052| > | 3| 4| 4| > +----------+----------+---------------+ > {code} > I get an integer overflow for the data as I would expect. But as soon as I > add a {{None}} to the data, even on a different row the result I get back is > totally different. > {code} > >>> df = spark.createDataFrame([(1037694399, 1204615848),(3,None)], > >>> StructType([StructField("a", IntegerType()), StructField("b", > >>> IntegerType())])) > >>> df.select(col("a"), col("b"), my_udf(col("a") - 3, col("b"))).show() > +----------+----------+---------------+ > | a| b|add((a - 3), b)| > +----------+----------+---------------+ > |1037694399|1204615848| 2242310244| > | 3| null| null| > +----------+----------+---------------+ > {code} > The integer overflow disappears. This is because arrow and/or pandas changes > the data type to a float in order to be able to store the null value. So > then the processing is being done on floating point there is no overflow. > This in and of itself is annoying but understandable because it is dealing > with a limitation in pandas. > Where it becomes a bug is that this happens on a per batch basis. This means > that I can have the same two rows in different parts of my data set and get > different results depending on their proximity to a null value. > {code} > >>> df = spark.createDataFrame([(1037694399, > >>> 1204615848),(3,None),(1037694399, 1204615848),(3,4)], > >>> StructType([StructField("a", IntegerType()), StructField("b", > >>> IntegerType())])) > >>> df.select(col("a"), col("b"), my_udf(col("a") - 3, col("b"))).show() > +----------+----------+---------------+ > | a| b|add((a - 3), b)| > +----------+----------+---------------+ > |1037694399|1204615848| 2242310244| > | 3| null| null| > |1037694399|1204615848| 2242310244| > | 3| 4| 4| > +----------+----------+---------------+ > >>> spark.conf.set('spark.sql.execution.arrow.maxRecordsPerBatch', '2') > >>> df.select(col("a"), col("b"), my_udf(col("a") - 3, col("b"))).show() > +----------+----------+---------------+ > | a| b|add((a - 3), b)| > +----------+----------+---------------+ > |1037694399|1204615848| 2242310244| > | 3| null| null| > |1037694399|1204615848| -2052657052| > | 3| 4| 4| > +----------+----------+---------------+ > {code} > For me personally I would prefer to have all nullable integer columns > upgraded to float all the time, that way it is at least consistent. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org