[ 
https://issues.apache.org/jira/browse/SPARK-38067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz resolved SPARK-38067.
----------------------------------------
    Fix Version/s: 3.3.0
       Resolution: Fixed

Issue resolved by pull request 35296
[https://github.com/apache/spark/pull/35296]

> Inconsistent missing values handling in Pandas on Spark to_json
> ---------------------------------------------------------------
>
>                 Key: SPARK-38067
>                 URL: https://issues.apache.org/jira/browse/SPARK-38067
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.2.1
>            Reporter: Bjørn Jørgensen
>            Assignee: Bjørn Jørgensen
>            Priority: Major
>             Fix For: 3.3.0
>
>
> If {{ps.DataFrame.to_json}} is called without {{path}} argument, missing 
> values are written explicitly 
> {code:python}
> import pandas as pd
> import pyspark.pandas as ps
> pdf = pd.DataFrame({"id": [1, 2, 3], "value": [None, 3, None]})
> psf = ps.from_pandas(pdf)
> psf.to_json()
> ## '[{"id":1,"value":null},{"id":2,"value":3.0},{"id":3,"value":null}]'
> {code:python}
> This behavior is consistent with Pandas:
> {code:python}
> pdf.to_json()
> ## '{"id":{"0":1,"1":2,"2":3},"value":{"0":null,"1":3.0,"2":null}}'
> {code}
> However, if {{path}} is provided, missing values are omitted by default:
> {code:python}
> import tempfile
> path = tempfile.mktemp()
> psf.to_json(path)
> spark.read.text(path).show()
> ## +--------------------+
> ## |               value|
> ## +--------------------+
> ## |{"id":2,"value":3.0}|
> ## |            {"id":3}|
> ## |            {"id":1}|
> ## +--------------------+
> {code}
> We should set {{ignoreNullFields}} for Pandas API, to be `False` by default, 
> so both cases handle missing values in the same way.
> {code:python}
> psf.to_json(path, ignoreNullFields=False)
> spark.read.text(path).show(truncate=False)
> ## +---------------------+
> ## |value                |
> ## +---------------------+
> ## |{"id":3,"value":null}|
> ## |{"id":1,"value":null}|
> ## |{"id":2,"value":3.0} |
> ## +---------------------+
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to