[jira] [Updated] (SPARK-41855) `createDataFrame` doesn't handle None/NaN properly

Ruifeng Zheng (Jira) Mon, 02 Jan 2023 19:53:03 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-41855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ruifeng Zheng updated SPARK-41855:
----------------------------------
    Description: 
{code:python}
        data = [Row(id=1, value=float("NaN")), Row(id=2, value=42.0), Row(id=3, 
value=None)]

        # +---+-----+
        # | id|value|
        # +---+-----+
        # |  1|  NaN|
        # |  2| 42.0|
        # |  3| null|
        # +---+-----+

        cdf = self.connect.createDataFrame(data)
        sdf = self.spark.createDataFrame(data)

        print()
        print()
        print(cdf._show_string(100, 100, False))
        print()
        print(cdf.schema)
        print()
        print(sdf._jdf.showString(100, 100, False))
        print()
        print(sdf.schema)

        self.compare_by_show(cdf, sdf)
{code}



{code:java}
+---+-----+
| id|value|
+---+-----+
|  1| null|
|  2| 42.0|
|  3| null|
+---+-----+


StructType([StructField('id', LongType(), True), StructField('value', 
DoubleType(), True)])

+---+-----+
| id|value|
+---+-----+
|  1|  NaN|
|  2| 42.0|
|  3| null|
+---+-----+


StructType([StructField('id', LongType(), True), StructField('value', 
DoubleType(), True)])

{code}



this issue is due to that `createDataFrame` can't handle None/NaN properly:

1, in the conversion from local data to pd.DataFrame, it automatically converts 
None to NaN
2, then in the conversion from pd.DataFrame to pa.Table, it always converts NaN 
to null

  was:

{code:python}
        data = [Row(id=1, value=float("NaN")), Row(id=2, value=42.0), Row(id=3, 
value=None)]

        # +---+-----+
        # | id|value|
        # +---+-----+
        # |  1|  NaN|
        # |  2| 42.0|
        # |  3| null|
        # +---+-----+

        cdf = self.connect.createDataFrame(data)
        sdf = self.spark.createDataFrame(data)

        print()
        print()
        print(cdf._show_string(100, 100, False))
        print()
        print(cdf.schema)
        print()
        print(sdf._jdf.showString(100, 100, False))
        print()
        print(sdf.schema)

        self.compare_by_show(cdf, sdf)
{code}



{code:java}
+---+-----+
| id|value|
+---+-----+
|  1| null|
|  2| 42.0|
|  3| null|
+---+-----+


StructType([StructField('id', LongType(), True), StructField('value', 
DoubleType(), True)])

+---+-----+
| id|value|
+---+-----+
|  1|  NaN|
|  2| 42.0|
|  3| null|
+---+-----+


StructType([StructField('id', LongType(), True), StructField('value', 
DoubleType(), True)])

{code}



this issue is due to that `createDataFrame` can't handle None properly:

1, in the conversion from local data to pd.DataFrame, it automatically converts 
None to NaN
2, then in the conversion from pd.DataFrame to pa.Table, it always converts NaN 
to null


> `createDataFrame` doesn't handle None/NaN properly
> --------------------------------------------------
>
>                 Key: SPARK-41855
>                 URL: https://issues.apache.org/jira/browse/SPARK-41855
>             Project: Spark
>          Issue Type: Sub-task
>          Components: Connect, PySpark
>    Affects Versions: 3.4.0
>            Reporter: Ruifeng Zheng
>            Priority: Major
>
> {code:python}
>         data = [Row(id=1, value=float("NaN")), Row(id=2, value=42.0), 
> Row(id=3, value=None)]
>         # +---+-----+
>         # | id|value|
>         # +---+-----+
>         # |  1|  NaN|
>         # |  2| 42.0|
>         # |  3| null|
>         # +---+-----+
>         cdf = self.connect.createDataFrame(data)
>         sdf = self.spark.createDataFrame(data)
>         print()
>         print()
>         print(cdf._show_string(100, 100, False))
>         print()
>         print(cdf.schema)
>         print()
>         print(sdf._jdf.showString(100, 100, False))
>         print()
>         print(sdf.schema)
>         self.compare_by_show(cdf, sdf)
> {code}
> {code:java}
> +---+-----+
> | id|value|
> +---+-----+
> |  1| null|
> |  2| 42.0|
> |  3| null|
> +---+-----+
> StructType([StructField('id', LongType(), True), StructField('value', 
> DoubleType(), True)])
> +---+-----+
> | id|value|
> +---+-----+
> |  1|  NaN|
> |  2| 42.0|
> |  3| null|
> +---+-----+
> StructType([StructField('id', LongType(), True), StructField('value', 
> DoubleType(), True)])
> {code}
> this issue is due to that `createDataFrame` can't handle None/NaN properly:
> 1, in the conversion from local data to pd.DataFrame, it automatically 
> converts None to NaN
> 2, then in the conversion from pd.DataFrame to pa.Table, it always converts 
> NaN to null



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41855) `createDataFrame` doesn't handle None/NaN properly

Reply via email to