Abdeali Kothari created SPARK-30187: ---------------------------------------
Summary: NULL handling in PySpark-PandasUDF Key: SPARK-30187 URL: https://issues.apache.org/jira/browse/SPARK-30187 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.4.4 Reporter: Abdeali Kothari The new pandasUDF has been very helpful in simplifying writing UDFs and more performant. But I cannot relliably use it because of it's different NULL value handling as compared to normal spark. Here is my understanding ... In spark, nulls/missing are as follows: * Float: null/NaN * Integer: null * String: null * DateTime: null In pandas, null/missing are as follows: * Float: NaN * Integer: <not possible> * String: null * DateTime: NaT When I use spark and am using a Pandas UDF, it looks to me like there is information loss, as I am unable to differentiate between null and NaN anymore. Which I could do when I was using the older PythonUDF {code:java} >>> @F.pandas_udf("integer", F.PandasUDFType.GROUPED_AGG) ... def pd_sum(x): ... return x.sum() >>> sdf = spark.createDataFrame( [ [1.0, 2.0], [None, 3.0], [float('nan'), 4.0] ], ['a', 'b']) >>> sdf.agg(pd_sum(sdf['a'])).collect() [Row(pd_sum(a)=1.0)] >>> sdf.select(F.sum(sdf['a'])).collect() [Row(sum(a)=nan)] {code} If I use an integer with NULL values -> the PandasUDF actually gets a float type: {code:java} >>> sdf = spark.createDataFrame([ [1, 2.0], [None, 3.0] ], ['a', 'b']) >>> sdf.dtypes [('a', 'bigint'), ('b', 'double')] >>> @F.pandas_udf("integer", F.PandasUDFType.GROUPED_AGG) ... def pd_sum(x): ... print(x) ... return x.sum() >>> sdf.agg(pd_sum(sdf['a'])).collect() 0 1.0 1 NaN Name: _0, dtype: float64 float64 [Row(pd_sum(a)=1)] {code} I'm not sure whether this is something Spark should handle, but wanted to understand whether there is a plan to manage this ? Because from what I understand, if someone wants to use pandas DataFrames as of now, they need to make some asusmptions like: * The entire range of BigInteger will not work, because it gets converted to float (if null values present) * The float type should have either NaN or NULL - not both -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org