Abdeali Kothari created SPARK-30187:
---------------------------------------

             Summary: NULL handling in PySpark-PandasUDF
                 Key: SPARK-30187
                 URL: https://issues.apache.org/jira/browse/SPARK-30187
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 2.4.4
            Reporter: Abdeali Kothari


The new pandasUDF has been very helpful in simplifying writing UDFs and more 
performant. But I cannot relliably use it because of it's different NULL value 
handling as compared to normal spark.

Here is my understanding ...

In spark, nulls/missing are as follows:
 * Float: null/NaN
 * Integer: null
 * String: null
 * DateTime: null

 

In pandas, null/missing are as follows:
 * Float: NaN
 * Integer: <not possible>
 * String: null
 * DateTime: NaT

When I use spark and am using a Pandas UDF, it looks to me like there is 
information loss, as I am unable to differentiate between null and NaN anymore. 
Which I could do when I was using the older PythonUDF

 
{code:java}
>>> @F.pandas_udf("integer", F.PandasUDFType.GROUPED_AGG)
... def pd_sum(x):
...     return x.sum()
>>> sdf = spark.createDataFrame(
    [ [1.0, 2.0],
      [None, 3.0],
      [float('nan'), 4.0]
    ], ['a', 'b'])
>>> sdf.agg(pd_sum(sdf['a'])).collect()
[Row(pd_sum(a)=1.0)]
>>> sdf.select(F.sum(sdf['a'])).collect()
[Row(sum(a)=nan)]
{code}
 

If I use an integer with NULL values -> the PandasUDF actually gets a float 
type:

 
{code:java}
>>> sdf = spark.createDataFrame([ [1, 2.0], [None, 3.0] ], ['a', 'b'])

>>> sdf.dtypes
[('a', 'bigint'), ('b', 'double')]
>>> @F.pandas_udf("integer", F.PandasUDFType.GROUPED_AGG)
... def pd_sum(x):
...     print(x)
...     return x.sum()
>>> sdf.agg(pd_sum(sdf['a'])).collect()
0 1.0
1 NaN
Name: _0, dtype: float64 float64
[Row(pd_sum(a)=1)]
{code}
 

I'm not sure whether this is something Spark should handle, but wanted to 
understand whether there is a plan to manage this ?

Because from what I understand, if someone wants to use pandas DataFrames as of 
now, they need to make some asusmptions like:
 * The entire range of BigInteger will not work, because it gets converted to 
float (if null values present)
 * The float type should have either NaN or NULL - not both

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to