[jira] [Updated] (SPARK-38067) Inconsistent missing values handling in Pandas on Spark to_json

Maciej Szymkiewicz (Jira) Tue, 01 Feb 2022 03:55:06 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-38067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Maciej Szymkiewicz updated SPARK-38067:
---------------------------------------
    Description: 
If {{ps.DataFrame.to_json}} is called without {{path}} argument, missing values 
are written explicitly 

{code:python}
import pandas as pd
import pyspark.pandas as ps

pdf = pd.DataFrame({"id": [1, 2, 3], "value": [None, 3, None]})
psf = ps.from_pandas(pdf)
psf.to_json()
## '[{"id":1,"value":null},{"id":2,"value":3.0},{"id":3,"value":null}]'
{code:python}

This behavior is consistent with Pandas:

{code:python}
pdf.to_json()
## '{"id":{"0":1,"1":2,"2":3},"value":{"0":null,"1":3.0,"2":null}}'
{code}

However, if {{path}} is provided, missing values are omitted by default:


{code:python}
import tempfile

path = tempfile.mktemp()
psf.to_json(path)

spark.read.text(path).show()
## +--------------------+
## |               value|
## +--------------------+
## |{"id":2,"value":3.0}|
## |            {"id":3}|
## |            {"id":1}|
## +--------------------+
{code}


We should set {{ignoreNullFields}} for Pandas API, to be `False` by default, so 
both cases handle missing values in the same way.


{code:python}
psf.to_json(path, ignoreNullFields=False)
spark.read.text(path).show(truncate=False)


## +---------------------+
## |value                |
## +---------------------+
## |{"id":3,"value":null}|
## |{"id":1,"value":null}|
## |{"id":2,"value":3.0} |
## +---------------------+
{code}




  was:
With pandas

{code:java}
data = {'col_1': [3, 2, 1, 0], 'col_2': [None, None, None, None]}
test_pd = pd.DataFrame.from_dict(data)
test_pd.shape

{code}
(4, 2)


{code:java}
test_pd.to_json("testpd.json")

test_pd2 = pd.read_json("testpd.json")
test_pd2.shape

{code}
(4, 2)

Pandas on spark API does delete the column that has all values Null.

{code:java}
data = {'col_1': [3, 2, 1, 0], 'col_2': [None, None, None, None]}
test_ps = ps.DataFrame.from_dict(data)
test_ps.shape

{code}
(4, 2)


{code:java}
test_ps.to_json("testps.json")
test_ps2 = ps.read_json("testps.json/*")
test_ps2.shape

{code}
(4, 1)

We need to change this to make pandas on spark API be more like pandas.

I have opened a PR for this.





> Inconsistent missing values handling in Pandas on Spark to_json
> ---------------------------------------------------------------
>
>                 Key: SPARK-38067
>                 URL: https://issues.apache.org/jira/browse/SPARK-38067
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.2.1
>            Reporter: Bjørn Jørgensen
>            Priority: Major
>
> If {{ps.DataFrame.to_json}} is called without {{path}} argument, missing 
> values are written explicitly 
> {code:python}
> import pandas as pd
> import pyspark.pandas as ps
> pdf = pd.DataFrame({"id": [1, 2, 3], "value": [None, 3, None]})
> psf = ps.from_pandas(pdf)
> psf.to_json()
> ## '[{"id":1,"value":null},{"id":2,"value":3.0},{"id":3,"value":null}]'
> {code:python}
> This behavior is consistent with Pandas:
> {code:python}
> pdf.to_json()
> ## '{"id":{"0":1,"1":2,"2":3},"value":{"0":null,"1":3.0,"2":null}}'
> {code}
> However, if {{path}} is provided, missing values are omitted by default:
> {code:python}
> import tempfile
> path = tempfile.mktemp()
> psf.to_json(path)
> spark.read.text(path).show()
> ## +--------------------+
> ## |               value|
> ## +--------------------+
> ## |{"id":2,"value":3.0}|
> ## |            {"id":3}|
> ## |            {"id":1}|
> ## +--------------------+
> {code}
> We should set {{ignoreNullFields}} for Pandas API, to be `False` by default, 
> so both cases handle missing values in the same way.
> {code:python}
> psf.to_json(path, ignoreNullFields=False)
> spark.read.text(path).show(truncate=False)
> ## +---------------------+
> ## |value                |
> ## +---------------------+
> ## |{"id":3,"value":null}|
> ## |{"id":1,"value":null}|
> ## |{"id":2,"value":3.0} |
> ## +---------------------+
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38067) Inconsistent missing values handling in Pandas on Spark to_json

Reply via email to