[jira] [Commented] (SPARK-37981) Deletes columns with all Null as default.
[ https://issues.apache.org/jira/browse/SPARK-37981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17481037#comment-17481037 ] Bjørn Jørgensen commented on SPARK-37981: - {code:java} df_json = spark.read.option("multiline","true").json("json_null.json") def sparkShape(dataFrame): return (dataFrame.count(), len(dataFrame.columns)) pyspark.sql.dataframe.DataFrame.shape = sparkShape print(df_json.shape()) {code} (1, 4) {code:java} df_json.write.json("json_df.json") df_json2 = spark.read.option("multiline","true").json("json_df.json/*.json") pyspark.sql.dataframe.DataFrame.shape = sparkShape print(df_json2.shape()) {code} (1, 3) {code:java} df_json.write.parquet("json_df.parquet") df_parquet3 = spark.read.parquet("json_df.parquet/*") pyspark.sql.dataframe.DataFrame.shape = sparkShape print(df_parquet3.shape()) {code} (1, 4) {code:java} pandas_df_json = pd.read_json("json_null.json") pandas_df_json.shape {code} (6, 4) {code:java} pandas_df_json2 = pd.read_json("pandas_df_json.json") pandas_df_json2.shape {code} (6, 4) {code:java} pandas_api = ps.read_json("pandas_df_json.json") pandas_api.shape {code} (1, 4) {code:java} pandas_api.to_json("pandas_api_json.json") pandas_api2 = ps.read_json("pandas_api_json.json") pandas_api2.shape {code} (1, 3) As we can see pyspark and pandas_api does delete columns that have all values with Null or Nan. {code:java} df.write.json? {code} Does not have a note for this. There is a link to the documentation where we find that ignoreNullFields is control by (value of spark.sql.jsonGenerator.ignoreNullFields configuration) {code:java} pandas_api.to_json? {code} Does have a note on NaN and None, and that user should see pyspark docs for json files. "Note NaN's and None will be converted to null and datetime objects will be converted to UNIX timestamps." > Deletes columns with all Null as default. > - > > Key: SPARK-37981 > URL: https://issues.apache.org/jira/browse/SPARK-37981 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Bjørn Jørgensen >Priority: Major > Attachments: json_null.json > > > Spark 3.2.1-RC2 > During write.json spark deletes columns with all Null as default. > > Spark does have dropFieldIfAllNullfalse as default, according to > https://spark.apache.org/docs/latest/sql-data-sources-json.html > {code:java} > from pyspark import pandas as ps > import re > import numpy as np > import os > import pandas as pd > from pyspark import SparkContext, SparkConf > from pyspark.sql import SparkSession > from pyspark.sql.functions import concat, concat_ws, lit, col, trim, expr > from pyspark.sql.types import StructType, StructField, StringType,IntegerType > os.environ["PYARROW_IGNORE_TIMEZONE"]="1" > def get_spark_session(app_name: str, conf: SparkConf): > conf.setMaster('local[*]') > conf \ > .set('spark.driver.memory', '64g')\ > .set("fs.s3a.access.key", "minio") \ > .set("fs.s3a.secret.key", "") \ > .set("fs.s3a.endpoint", "http://192.168.1.127:9000;) \ > .set("spark.hadoop.fs.s3a.impl", > "org.apache.hadoop.fs.s3a.S3AFileSystem") \ > .set("spark.hadoop.fs.s3a.path.style.access", "true") \ > .set("spark.sql.repl.eagerEval.enabled", "True") \ > .set("spark.sql.adaptive.enabled", "True") \ > .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \ > .set("spark.sql.repl.eagerEval.maxNumRows", "1") \ > .set("sc.setLogLevel", "error") > > return > SparkSession.builder.appName(app_name).config(conf=conf).getOrCreate() > spark = get_spark_session("Falk", SparkConf()) > d3 = > spark.read.option("multiline","true").json("/home/jovyan/notebooks/falk/data/norm_test/3/*.json") > import pyspark > def sparkShape(dataFrame): > return (dataFrame.count(), len(dataFrame.columns)) > pyspark.sql.dataframe.DataFrame.shape = sparkShape > print(d3.shape()) > (653610, 267) > d3.write.json("d3.json") > d3 = spark.read.json("d3.json/*.json") > import pyspark > def sparkShape(dataFrame): > return (dataFrame.count(), len(dataFrame.columns)) > pyspark.sql.dataframe.DataFrame.shape = sparkShape > print(d3.shape()) > (653610, 186) > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37981) Deletes columns with all Null as default.
[ https://issues.apache.org/jira/browse/SPARK-37981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17480737#comment-17480737 ] Bjørn Jørgensen commented on SPARK-37981: - Please note that the PR that I have open are for pandas on spark, koalas or pandas on spark API. Pandas on spark have the to_json func. That the one that this PR are for. > Deletes columns with all Null as default. > - > > Key: SPARK-37981 > URL: https://issues.apache.org/jira/browse/SPARK-37981 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Bjørn Jørgensen >Priority: Major > Attachments: json_null.json > > > Spark 3.2.1-RC2 > During write.json spark deletes columns with all Null as default. > > Spark does have dropFieldIfAllNullfalse as default, according to > https://spark.apache.org/docs/latest/sql-data-sources-json.html > {code:java} > from pyspark import pandas as ps > import re > import numpy as np > import os > import pandas as pd > from pyspark import SparkContext, SparkConf > from pyspark.sql import SparkSession > from pyspark.sql.functions import concat, concat_ws, lit, col, trim, expr > from pyspark.sql.types import StructType, StructField, StringType,IntegerType > os.environ["PYARROW_IGNORE_TIMEZONE"]="1" > def get_spark_session(app_name: str, conf: SparkConf): > conf.setMaster('local[*]') > conf \ > .set('spark.driver.memory', '64g')\ > .set("fs.s3a.access.key", "minio") \ > .set("fs.s3a.secret.key", "") \ > .set("fs.s3a.endpoint", "http://192.168.1.127:9000;) \ > .set("spark.hadoop.fs.s3a.impl", > "org.apache.hadoop.fs.s3a.S3AFileSystem") \ > .set("spark.hadoop.fs.s3a.path.style.access", "true") \ > .set("spark.sql.repl.eagerEval.enabled", "True") \ > .set("spark.sql.adaptive.enabled", "True") \ > .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \ > .set("spark.sql.repl.eagerEval.maxNumRows", "1") \ > .set("sc.setLogLevel", "error") > > return > SparkSession.builder.appName(app_name).config(conf=conf).getOrCreate() > spark = get_spark_session("Falk", SparkConf()) > d3 = > spark.read.option("multiline","true").json("/home/jovyan/notebooks/falk/data/norm_test/3/*.json") > import pyspark > def sparkShape(dataFrame): > return (dataFrame.count(), len(dataFrame.columns)) > pyspark.sql.dataframe.DataFrame.shape = sparkShape > print(d3.shape()) > (653610, 267) > d3.write.json("d3.json") > d3 = spark.read.json("d3.json/*.json") > import pyspark > def sparkShape(dataFrame): > return (dataFrame.count(), len(dataFrame.columns)) > pyspark.sql.dataframe.DataFrame.shape = sparkShape > print(d3.shape()) > (653610, 186) > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37981) Deletes columns with all Null as default.
[ https://issues.apache.org/jira/browse/SPARK-37981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17480733#comment-17480733 ] Apache Spark commented on SPARK-37981: -- User 'bjornjorgensen' has created a pull request for this issue: https://github.com/apache/spark/pull/35296 > Deletes columns with all Null as default. > - > > Key: SPARK-37981 > URL: https://issues.apache.org/jira/browse/SPARK-37981 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Bjørn Jørgensen >Priority: Major > Attachments: json_null.json > > > Spark 3.2.1-RC2 > During write.json spark deletes columns with all Null as default. > > Spark does have dropFieldIfAllNullfalse as default, according to > https://spark.apache.org/docs/latest/sql-data-sources-json.html > {code:java} > from pyspark import pandas as ps > import re > import numpy as np > import os > import pandas as pd > from pyspark import SparkContext, SparkConf > from pyspark.sql import SparkSession > from pyspark.sql.functions import concat, concat_ws, lit, col, trim, expr > from pyspark.sql.types import StructType, StructField, StringType,IntegerType > os.environ["PYARROW_IGNORE_TIMEZONE"]="1" > def get_spark_session(app_name: str, conf: SparkConf): > conf.setMaster('local[*]') > conf \ > .set('spark.driver.memory', '64g')\ > .set("fs.s3a.access.key", "minio") \ > .set("fs.s3a.secret.key", "") \ > .set("fs.s3a.endpoint", "http://192.168.1.127:9000;) \ > .set("spark.hadoop.fs.s3a.impl", > "org.apache.hadoop.fs.s3a.S3AFileSystem") \ > .set("spark.hadoop.fs.s3a.path.style.access", "true") \ > .set("spark.sql.repl.eagerEval.enabled", "True") \ > .set("spark.sql.adaptive.enabled", "True") \ > .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \ > .set("spark.sql.repl.eagerEval.maxNumRows", "1") \ > .set("sc.setLogLevel", "error") > > return > SparkSession.builder.appName(app_name).config(conf=conf).getOrCreate() > spark = get_spark_session("Falk", SparkConf()) > d3 = > spark.read.option("multiline","true").json("/home/jovyan/notebooks/falk/data/norm_test/3/*.json") > import pyspark > def sparkShape(dataFrame): > return (dataFrame.count(), len(dataFrame.columns)) > pyspark.sql.dataframe.DataFrame.shape = sparkShape > print(d3.shape()) > (653610, 267) > d3.write.json("d3.json") > d3 = spark.read.json("d3.json/*.json") > import pyspark > def sparkShape(dataFrame): > return (dataFrame.count(), len(dataFrame.columns)) > pyspark.sql.dataframe.DataFrame.shape = sparkShape > print(d3.shape()) > (653610, 186) > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37981) Deletes columns with all Null as default.
[ https://issues.apache.org/jira/browse/SPARK-37981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17480734#comment-17480734 ] Apache Spark commented on SPARK-37981: -- User 'bjornjorgensen' has created a pull request for this issue: https://github.com/apache/spark/pull/35296 > Deletes columns with all Null as default. > - > > Key: SPARK-37981 > URL: https://issues.apache.org/jira/browse/SPARK-37981 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Bjørn Jørgensen >Priority: Major > Attachments: json_null.json > > > Spark 3.2.1-RC2 > During write.json spark deletes columns with all Null as default. > > Spark does have dropFieldIfAllNullfalse as default, according to > https://spark.apache.org/docs/latest/sql-data-sources-json.html > {code:java} > from pyspark import pandas as ps > import re > import numpy as np > import os > import pandas as pd > from pyspark import SparkContext, SparkConf > from pyspark.sql import SparkSession > from pyspark.sql.functions import concat, concat_ws, lit, col, trim, expr > from pyspark.sql.types import StructType, StructField, StringType,IntegerType > os.environ["PYARROW_IGNORE_TIMEZONE"]="1" > def get_spark_session(app_name: str, conf: SparkConf): > conf.setMaster('local[*]') > conf \ > .set('spark.driver.memory', '64g')\ > .set("fs.s3a.access.key", "minio") \ > .set("fs.s3a.secret.key", "") \ > .set("fs.s3a.endpoint", "http://192.168.1.127:9000;) \ > .set("spark.hadoop.fs.s3a.impl", > "org.apache.hadoop.fs.s3a.S3AFileSystem") \ > .set("spark.hadoop.fs.s3a.path.style.access", "true") \ > .set("spark.sql.repl.eagerEval.enabled", "True") \ > .set("spark.sql.adaptive.enabled", "True") \ > .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \ > .set("spark.sql.repl.eagerEval.maxNumRows", "1") \ > .set("sc.setLogLevel", "error") > > return > SparkSession.builder.appName(app_name).config(conf=conf).getOrCreate() > spark = get_spark_session("Falk", SparkConf()) > d3 = > spark.read.option("multiline","true").json("/home/jovyan/notebooks/falk/data/norm_test/3/*.json") > import pyspark > def sparkShape(dataFrame): > return (dataFrame.count(), len(dataFrame.columns)) > pyspark.sql.dataframe.DataFrame.shape = sparkShape > print(d3.shape()) > (653610, 267) > d3.write.json("d3.json") > d3 = spark.read.json("d3.json/*.json") > import pyspark > def sparkShape(dataFrame): > return (dataFrame.count(), len(dataFrame.columns)) > pyspark.sql.dataframe.DataFrame.shape = sparkShape > print(d3.shape()) > (653610, 186) > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37981) Deletes columns with all Null as default.
[ https://issues.apache.org/jira/browse/SPARK-37981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17480293#comment-17480293 ] Bjørn Jørgensen commented on SPARK-37981: - [^json_null.json] {code:java} from pyspark import pandas as ps import re import numpy as np import os import pandas as pd from pyspark import SparkContext, SparkConf from pyspark.sql import SparkSession from pyspark.sql.functions import concat, concat_ws, lit, col, trim, expr from pyspark.sql.types import StructType, StructField, StringType,IntegerType os.environ["PYARROW_IGNORE_TIMEZONE"]="1" def get_spark_session(app_name: str, conf: SparkConf): conf.setMaster('local[*]') conf \ .set('spark.driver.memory', '64g')\ .set("fs.s3a.access.key", "minio") \ .set("fs.s3a.secret.key", "") \ .set("fs.s3a.endpoint", "http://192.168.1.127:9000;) \ .set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \ .set("spark.hadoop.fs.s3a.path.style.access", "true") \ .set("spark.sql.repl.eagerEval.enabled", "True") \ .set("spark.sql.adaptive.enabled", "True") \ .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \ .set("spark.sql.repl.eagerEval.maxNumRows", "1") \ .set("sc.setLogLevel", "error") return SparkSession.builder.appName(app_name).config(conf=conf).getOrCreate() spark = get_spark_session("Falk", SparkConf()) df = spark.read.option("multiline","true").json("json_null.json") import pyspark def sparkShape(dataFrame): return (dataFrame.count(), len(dataFrame.columns)) pyspark.sql.dataframe.DataFrame.shape = sparkShape print(df.shape()) (1, 4) df.write.json("df.json") df = spark.read.json("df.json/*.json") import pyspark def sparkShape(dataFrame): return (dataFrame.count(), len(dataFrame.columns)) pyspark.sql.dataframe.DataFrame.shape = sparkShape print(df.shape()) (1, 3) {code} > Deletes columns with all Null as default. > - > > Key: SPARK-37981 > URL: https://issues.apache.org/jira/browse/SPARK-37981 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Bjørn Jørgensen >Priority: Major > Attachments: json_null.json > > > Spark 3.2.1-RC2 > During write.json spark deletes columns with all Null as default. > > Spark does have dropFieldIfAllNullfalse as default, according to > https://spark.apache.org/docs/latest/sql-data-sources-json.html > {code:java} > from pyspark import pandas as ps > import re > import numpy as np > import os > import pandas as pd > from pyspark import SparkContext, SparkConf > from pyspark.sql import SparkSession > from pyspark.sql.functions import concat, concat_ws, lit, col, trim, expr > from pyspark.sql.types import StructType, StructField, StringType,IntegerType > os.environ["PYARROW_IGNORE_TIMEZONE"]="1" > def get_spark_session(app_name: str, conf: SparkConf): > conf.setMaster('local[*]') > conf \ > .set('spark.driver.memory', '64g')\ > .set("fs.s3a.access.key", "minio") \ > .set("fs.s3a.secret.key", "") \ > .set("fs.s3a.endpoint", "http://192.168.1.127:9000;) \ > .set("spark.hadoop.fs.s3a.impl", > "org.apache.hadoop.fs.s3a.S3AFileSystem") \ > .set("spark.hadoop.fs.s3a.path.style.access", "true") \ > .set("spark.sql.repl.eagerEval.enabled", "True") \ > .set("spark.sql.adaptive.enabled", "True") \ > .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \ > .set("spark.sql.repl.eagerEval.maxNumRows", "1") \ > .set("sc.setLogLevel", "error") > > return > SparkSession.builder.appName(app_name).config(conf=conf).getOrCreate() > spark = get_spark_session("Falk", SparkConf()) > d3 = > spark.read.option("multiline","true").json("/home/jovyan/notebooks/falk/data/norm_test/3/*.json") > import pyspark > def sparkShape(dataFrame): > return (dataFrame.count(), len(dataFrame.columns)) > pyspark.sql.dataframe.DataFrame.shape = sparkShape > print(d3.shape()) > (653610, 267) > d3.write.json("d3.json") > d3 = spark.read.json("d3.json/*.json") > import pyspark > def sparkShape(dataFrame): > return (dataFrame.count(), len(dataFrame.columns)) > pyspark.sql.dataframe.DataFrame.shape = sparkShape > print(d3.shape()) > (653610, 186) > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37981) Deletes columns with all Null as default.
[ https://issues.apache.org/jira/browse/SPARK-37981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17480291#comment-17480291 ] Maciej Szymkiewicz commented on SPARK-37981: This doesn't seem valid. {{dropFieldIfAllNull}} is a reader option. For writes, we use {{ignoreNullFields}}. So your code should be {code} d3.write.option("ignoreNullFields", "false").json("d3.json") {code} > Deletes columns with all Null as default. > - > > Key: SPARK-37981 > URL: https://issues.apache.org/jira/browse/SPARK-37981 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Bjørn Jørgensen >Priority: Major > Attachments: json_null.json > > > Spark 3.2.1-RC2 > During write.json spark deletes columns with all Null as default. > > Spark does have dropFieldIfAllNullfalse as default, according to > https://spark.apache.org/docs/latest/sql-data-sources-json.html > {code:java} > from pyspark import pandas as ps > import re > import numpy as np > import os > import pandas as pd > from pyspark import SparkContext, SparkConf > from pyspark.sql import SparkSession > from pyspark.sql.functions import concat, concat_ws, lit, col, trim, expr > from pyspark.sql.types import StructType, StructField, StringType,IntegerType > os.environ["PYARROW_IGNORE_TIMEZONE"]="1" > def get_spark_session(app_name: str, conf: SparkConf): > conf.setMaster('local[*]') > conf \ > .set('spark.driver.memory', '64g')\ > .set("fs.s3a.access.key", "minio") \ > .set("fs.s3a.secret.key", "") \ > .set("fs.s3a.endpoint", "http://192.168.1.127:9000;) \ > .set("spark.hadoop.fs.s3a.impl", > "org.apache.hadoop.fs.s3a.S3AFileSystem") \ > .set("spark.hadoop.fs.s3a.path.style.access", "true") \ > .set("spark.sql.repl.eagerEval.enabled", "True") \ > .set("spark.sql.adaptive.enabled", "True") \ > .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \ > .set("spark.sql.repl.eagerEval.maxNumRows", "1") \ > .set("sc.setLogLevel", "error") > > return > SparkSession.builder.appName(app_name).config(conf=conf).getOrCreate() > spark = get_spark_session("Falk", SparkConf()) > d3 = > spark.read.option("multiline","true").json("/home/jovyan/notebooks/falk/data/norm_test/3/*.json") > import pyspark > def sparkShape(dataFrame): > return (dataFrame.count(), len(dataFrame.columns)) > pyspark.sql.dataframe.DataFrame.shape = sparkShape > print(d3.shape()) > (653610, 267) > d3.write.json("d3.json") > d3 = spark.read.json("d3.json/*.json") > import pyspark > def sparkShape(dataFrame): > return (dataFrame.count(), len(dataFrame.columns)) > pyspark.sql.dataframe.DataFrame.shape = sparkShape > print(d3.shape()) > (653610, 186) > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org