[ https://issues.apache.org/jira/browse/SPARK-37981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean R. Owen updated SPARK-37981: --------------------------------- Affects Version/s: 3.2.0 (was: 3.2.1) Priority: Major (was: Critical) This isn't possible to evaluate without seeing some input data > Deletes columns with all Null as default. > ----------------------------------------- > > Key: SPARK-37981 > URL: https://issues.apache.org/jira/browse/SPARK-37981 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 3.2.0 > Reporter: Bjørn Jørgensen > Priority: Major > > Spark 3.2.1-RC2 > During write.json spark deletes columns with all Null as default. > > Spark does have dropFieldIfAllNull false as default, according to > https://spark.apache.org/docs/latest/sql-data-sources-json.html > {code:java} > from pyspark import pandas as ps > import re > import numpy as np > import os > import pandas as pd > from pyspark import SparkContext, SparkConf > from pyspark.sql import SparkSession > from pyspark.sql.functions import concat, concat_ws, lit, col, trim, expr > from pyspark.sql.types import StructType, StructField, StringType,IntegerType > os.environ["PYARROW_IGNORE_TIMEZONE"]="1" > def get_spark_session(app_name: str, conf: SparkConf): > conf.setMaster('local[*]') > conf \ > .set('spark.driver.memory', '64g')\ > .set("fs.s3a.access.key", "minio") \ > .set("fs.s3a.secret.key", "") \ > .set("fs.s3a.endpoint", "http://192.168.1.127:9000") \ > .set("spark.hadoop.fs.s3a.impl", > "org.apache.hadoop.fs.s3a.S3AFileSystem") \ > .set("spark.hadoop.fs.s3a.path.style.access", "true") \ > .set("spark.sql.repl.eagerEval.enabled", "True") \ > .set("spark.sql.adaptive.enabled", "True") \ > .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \ > .set("spark.sql.repl.eagerEval.maxNumRows", "10000") \ > .set("sc.setLogLevel", "error") > > return > SparkSession.builder.appName(app_name).config(conf=conf).getOrCreate() > spark = get_spark_session("Falk", SparkConf()) > d3 = > spark.read.option("multiline","true").json("/home/jovyan/notebooks/falk/data/norm_test/3/*.json") > import pyspark > def sparkShape(dataFrame): > return (dataFrame.count(), len(dataFrame.columns)) > pyspark.sql.dataframe.DataFrame.shape = sparkShape > print(d3.shape()) > (653610, 267) > d3.write.json("d3.json") > d3 = spark.read.json("d3.json/*.json") > import pyspark > def sparkShape(dataFrame): > return (dataFrame.count(), len(dataFrame.columns)) > pyspark.sql.dataframe.DataFrame.shape = sparkShape > print(d3.shape()) > (653610, 186) > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org