[ https://issues.apache.org/jira/browse/SPARK-37981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17481037#comment-17481037 ]
Bjørn Jørgensen commented on SPARK-37981: ----------------------------------------- {code:java} df_json = spark.read.option("multiline","true").json("json_null.json") def sparkShape(dataFrame): return (dataFrame.count(), len(dataFrame.columns)) pyspark.sql.dataframe.DataFrame.shape = sparkShape print(df_json.shape()) {code} (1, 4) {code:java} df_json.write.json("json_df.json") df_json2 = spark.read.option("multiline","true").json("json_df.json/*.json") pyspark.sql.dataframe.DataFrame.shape = sparkShape print(df_json2.shape()) {code} (1, 3) {code:java} df_json.write.parquet("json_df.parquet") df_parquet3 = spark.read.parquet("json_df.parquet/*") pyspark.sql.dataframe.DataFrame.shape = sparkShape print(df_parquet3.shape()) {code} (1, 4) {code:java} pandas_df_json = pd.read_json("json_null.json") pandas_df_json.shape {code} (6, 4) {code:java} pandas_df_json2 = pd.read_json("pandas_df_json.json") pandas_df_json2.shape {code} (6, 4) {code:java} pandas_api = ps.read_json("pandas_df_json.json") pandas_api.shape {code} (1, 4) {code:java} pandas_api.to_json("pandas_api_json.json") pandas_api2 = ps.read_json("pandas_api_json.json") pandas_api2.shape {code} (1, 3) As we can see pyspark and pandas_api does delete columns that have all values with Null or Nan. {code:java} df.write.json? {code} Does not have a note for this. There is a link to the documentation where we find that ignoreNullFields is control by (value of spark.sql.jsonGenerator.ignoreNullFields configuration) {code:java} pandas_api.to_json? {code} Does have a note on NaN and None, and that user should see pyspark docs for json files. "Note NaN's and None will be converted to null and datetime objects will be converted to UNIX timestamps." > Deletes columns with all Null as default. > ----------------------------------------- > > Key: SPARK-37981 > URL: https://issues.apache.org/jira/browse/SPARK-37981 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 3.2.0 > Reporter: Bjørn Jørgensen > Priority: Major > Attachments: json_null.json > > > Spark 3.2.1-RC2 > During write.json spark deletes columns with all Null as default. > > Spark does have dropFieldIfAllNull false as default, according to > https://spark.apache.org/docs/latest/sql-data-sources-json.html > {code:java} > from pyspark import pandas as ps > import re > import numpy as np > import os > import pandas as pd > from pyspark import SparkContext, SparkConf > from pyspark.sql import SparkSession > from pyspark.sql.functions import concat, concat_ws, lit, col, trim, expr > from pyspark.sql.types import StructType, StructField, StringType,IntegerType > os.environ["PYARROW_IGNORE_TIMEZONE"]="1" > def get_spark_session(app_name: str, conf: SparkConf): > conf.setMaster('local[*]') > conf \ > .set('spark.driver.memory', '64g')\ > .set("fs.s3a.access.key", "minio") \ > .set("fs.s3a.secret.key", "") \ > .set("fs.s3a.endpoint", "http://192.168.1.127:9000") \ > .set("spark.hadoop.fs.s3a.impl", > "org.apache.hadoop.fs.s3a.S3AFileSystem") \ > .set("spark.hadoop.fs.s3a.path.style.access", "true") \ > .set("spark.sql.repl.eagerEval.enabled", "True") \ > .set("spark.sql.adaptive.enabled", "True") \ > .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \ > .set("spark.sql.repl.eagerEval.maxNumRows", "10000") \ > .set("sc.setLogLevel", "error") > > return > SparkSession.builder.appName(app_name).config(conf=conf).getOrCreate() > spark = get_spark_session("Falk", SparkConf()) > d3 = > spark.read.option("multiline","true").json("/home/jovyan/notebooks/falk/data/norm_test/3/*.json") > import pyspark > def sparkShape(dataFrame): > return (dataFrame.count(), len(dataFrame.columns)) > pyspark.sql.dataframe.DataFrame.shape = sparkShape > print(d3.shape()) > (653610, 267) > d3.write.json("d3.json") > d3 = spark.read.json("d3.json/*.json") > import pyspark > def sparkShape(dataFrame): > return (dataFrame.count(), len(dataFrame.columns)) > pyspark.sql.dataframe.DataFrame.shape = sparkShape > print(d3.shape()) > (653610, 186) > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org