[ https://issues.apache.org/jira/browse/SPARK-38067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Maciej Szymkiewicz updated SPARK-38067: --------------------------------------- Description: If {{ps.DataFrame.to_json}} is called without {{path}} argument, missing values are written explicitly {code:python} import pandas as pd import pyspark.pandas as ps pdf = pd.DataFrame({"id": [1, 2, 3], "value": [None, 3, None]}) psf = ps.from_pandas(pdf) psf.to_json() ## '[{"id":1,"value":null},{"id":2,"value":3.0},{"id":3,"value":null}]' {code:python} This behavior is consistent with Pandas: {code:python} pdf.to_json() ## '{"id":{"0":1,"1":2,"2":3},"value":{"0":null,"1":3.0,"2":null}}' {code} However, if {{path}} is provided, missing values are omitted by default: {code:python} import tempfile path = tempfile.mktemp() psf.to_json(path) spark.read.text(path).show() ## +--------------------+ ## | value| ## +--------------------+ ## |{"id":2,"value":3.0}| ## | {"id":3}| ## | {"id":1}| ## +--------------------+ {code} We should set {{ignoreNullFields}} for Pandas API, to be `False` by default, so both cases handle missing values in the same way. {code:python} psf.to_json(path, ignoreNullFields=False) spark.read.text(path).show(truncate=False) ## +---------------------+ ## |value | ## +---------------------+ ## |{"id":3,"value":null}| ## |{"id":1,"value":null}| ## |{"id":2,"value":3.0} | ## +---------------------+ {code} was: With pandas {code:java} data = {'col_1': [3, 2, 1, 0], 'col_2': [None, None, None, None]} test_pd = pd.DataFrame.from_dict(data) test_pd.shape {code} (4, 2) {code:java} test_pd.to_json("testpd.json") test_pd2 = pd.read_json("testpd.json") test_pd2.shape {code} (4, 2) Pandas on spark API does delete the column that has all values Null. {code:java} data = {'col_1': [3, 2, 1, 0], 'col_2': [None, None, None, None]} test_ps = ps.DataFrame.from_dict(data) test_ps.shape {code} (4, 2) {code:java} test_ps.to_json("testps.json") test_ps2 = ps.read_json("testps.json/*") test_ps2.shape {code} (4, 1) We need to change this to make pandas on spark API be more like pandas. I have opened a PR for this. > Inconsistent missing values handling in Pandas on Spark to_json > --------------------------------------------------------------- > > Key: SPARK-38067 > URL: https://issues.apache.org/jira/browse/SPARK-38067 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 3.2.1 > Reporter: Bjørn Jørgensen > Priority: Major > > If {{ps.DataFrame.to_json}} is called without {{path}} argument, missing > values are written explicitly > {code:python} > import pandas as pd > import pyspark.pandas as ps > pdf = pd.DataFrame({"id": [1, 2, 3], "value": [None, 3, None]}) > psf = ps.from_pandas(pdf) > psf.to_json() > ## '[{"id":1,"value":null},{"id":2,"value":3.0},{"id":3,"value":null}]' > {code:python} > This behavior is consistent with Pandas: > {code:python} > pdf.to_json() > ## '{"id":{"0":1,"1":2,"2":3},"value":{"0":null,"1":3.0,"2":null}}' > {code} > However, if {{path}} is provided, missing values are omitted by default: > {code:python} > import tempfile > path = tempfile.mktemp() > psf.to_json(path) > spark.read.text(path).show() > ## +--------------------+ > ## | value| > ## +--------------------+ > ## |{"id":2,"value":3.0}| > ## | {"id":3}| > ## | {"id":1}| > ## +--------------------+ > {code} > We should set {{ignoreNullFields}} for Pandas API, to be `False` by default, > so both cases handle missing values in the same way. > {code:python} > psf.to_json(path, ignoreNullFields=False) > spark.read.text(path).show(truncate=False) > ## +---------------------+ > ## |value | > ## +---------------------+ > ## |{"id":3,"value":null}| > ## |{"id":1,"value":null}| > ## |{"id":2,"value":3.0} | > ## +---------------------+ > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org