[jira] [Commented] (SPARK-37981) Deletes columns with all Null as default.

2022-01-24 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-37981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17481037#comment-17481037
 ] 

Bjørn Jørgensen commented on SPARK-37981:
-


{code:java}
df_json = spark.read.option("multiline","true").json("json_null.json")

def sparkShape(dataFrame):
return (dataFrame.count(), len(dataFrame.columns))
pyspark.sql.dataframe.DataFrame.shape = sparkShape
print(df_json.shape())

{code}

(1, 4)


{code:java}
df_json.write.json("json_df.json")

df_json2 = spark.read.option("multiline","true").json("json_df.json/*.json")
pyspark.sql.dataframe.DataFrame.shape = sparkShape
print(df_json2.shape())

{code}

(1, 3)


{code:java}
df_json.write.parquet("json_df.parquet")
df_parquet3 = spark.read.parquet("json_df.parquet/*")
pyspark.sql.dataframe.DataFrame.shape = sparkShape
print(df_parquet3.shape())

{code}

(1, 4)


{code:java}
pandas_df_json = pd.read_json("json_null.json")
pandas_df_json.shape

{code}

(6, 4)


{code:java}
pandas_df_json2 = pd.read_json("pandas_df_json.json")
pandas_df_json2.shape

{code}
(6, 4)


{code:java}
pandas_api = ps.read_json("pandas_df_json.json")

pandas_api.shape

{code}
(1, 4)


{code:java}
pandas_api.to_json("pandas_api_json.json")
pandas_api2 = ps.read_json("pandas_api_json.json")

pandas_api2.shape

{code}
(1, 3)

As we can see pyspark and pandas_api does delete columns that have all values 
with Null or Nan. 


{code:java}
df.write.json?
{code}

Does not have a note for this. There is a link to the documentation where we 
find that ignoreNullFields is control by (value of 
spark.sql.jsonGenerator.ignoreNullFields configuration)


{code:java}
pandas_api.to_json?
{code}
Does have a note on NaN and None, and that user should see pyspark docs for 
json files. 
"Note NaN's and None will be converted to null and datetime objects
will be converted to UNIX timestamps."


> Deletes columns with all Null as default.
> -
>
> Key: SPARK-37981
> URL: https://issues.apache.org/jira/browse/SPARK-37981
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Bjørn Jørgensen
>Priority: Major
> Attachments: json_null.json
>
>
> Spark 3.2.1-RC2 
> During write.json spark deletes columns with all Null as default. 
>  
> Spark does have dropFieldIfAllNullfalse as default, according to 
> https://spark.apache.org/docs/latest/sql-data-sources-json.html
> {code:java}
> from pyspark import pandas as ps
> import re
> import numpy as np
> import os
> import pandas as pd
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import concat, concat_ws, lit, col, trim, expr
> from pyspark.sql.types import StructType, StructField, StringType,IntegerType
> os.environ["PYARROW_IGNORE_TIMEZONE"]="1"
> def get_spark_session(app_name: str, conf: SparkConf):
> conf.setMaster('local[*]')
> conf \
>   .set('spark.driver.memory', '64g')\
>   .set("fs.s3a.access.key", "minio") \
>   .set("fs.s3a.secret.key", "") \
>   .set("fs.s3a.endpoint", "http://192.168.1.127:9000;) \
>   .set("spark.hadoop.fs.s3a.impl", 
> "org.apache.hadoop.fs.s3a.S3AFileSystem") \
>   .set("spark.hadoop.fs.s3a.path.style.access", "true") \
>   .set("spark.sql.repl.eagerEval.enabled", "True") \
>   .set("spark.sql.adaptive.enabled", "True") \
>   .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
>   .set("spark.sql.repl.eagerEval.maxNumRows", "1") \
>   .set("sc.setLogLevel", "error")
>
> return 
> SparkSession.builder.appName(app_name).config(conf=conf).getOrCreate()
> spark = get_spark_session("Falk", SparkConf())
> d3 = 
> spark.read.option("multiline","true").json("/home/jovyan/notebooks/falk/data/norm_test/3/*.json")
> import pyspark
> def sparkShape(dataFrame):
> return (dataFrame.count(), len(dataFrame.columns))
> pyspark.sql.dataframe.DataFrame.shape = sparkShape
> print(d3.shape())
> (653610, 267)
> d3.write.json("d3.json")
> d3 = spark.read.json("d3.json/*.json")
> import pyspark
> def sparkShape(dataFrame):
> return (dataFrame.count(), len(dataFrame.columns))
> pyspark.sql.dataframe.DataFrame.shape = sparkShape
> print(d3.shape())
> (653610, 186)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37981) Deletes columns with all Null as default.

2022-01-23 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-37981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17480737#comment-17480737
 ] 

Bjørn Jørgensen commented on SPARK-37981:
-

Please note that the PR that I have open are for pandas on spark, koalas or 
pandas on spark API. 

Pandas on spark have the to_json func. That the one that this PR are for. 


> Deletes columns with all Null as default.
> -
>
> Key: SPARK-37981
> URL: https://issues.apache.org/jira/browse/SPARK-37981
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Bjørn Jørgensen
>Priority: Major
> Attachments: json_null.json
>
>
> Spark 3.2.1-RC2 
> During write.json spark deletes columns with all Null as default. 
>  
> Spark does have dropFieldIfAllNullfalse as default, according to 
> https://spark.apache.org/docs/latest/sql-data-sources-json.html
> {code:java}
> from pyspark import pandas as ps
> import re
> import numpy as np
> import os
> import pandas as pd
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import concat, concat_ws, lit, col, trim, expr
> from pyspark.sql.types import StructType, StructField, StringType,IntegerType
> os.environ["PYARROW_IGNORE_TIMEZONE"]="1"
> def get_spark_session(app_name: str, conf: SparkConf):
> conf.setMaster('local[*]')
> conf \
>   .set('spark.driver.memory', '64g')\
>   .set("fs.s3a.access.key", "minio") \
>   .set("fs.s3a.secret.key", "") \
>   .set("fs.s3a.endpoint", "http://192.168.1.127:9000;) \
>   .set("spark.hadoop.fs.s3a.impl", 
> "org.apache.hadoop.fs.s3a.S3AFileSystem") \
>   .set("spark.hadoop.fs.s3a.path.style.access", "true") \
>   .set("spark.sql.repl.eagerEval.enabled", "True") \
>   .set("spark.sql.adaptive.enabled", "True") \
>   .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
>   .set("spark.sql.repl.eagerEval.maxNumRows", "1") \
>   .set("sc.setLogLevel", "error")
>
> return 
> SparkSession.builder.appName(app_name).config(conf=conf).getOrCreate()
> spark = get_spark_session("Falk", SparkConf())
> d3 = 
> spark.read.option("multiline","true").json("/home/jovyan/notebooks/falk/data/norm_test/3/*.json")
> import pyspark
> def sparkShape(dataFrame):
> return (dataFrame.count(), len(dataFrame.columns))
> pyspark.sql.dataframe.DataFrame.shape = sparkShape
> print(d3.shape())
> (653610, 267)
> d3.write.json("d3.json")
> d3 = spark.read.json("d3.json/*.json")
> import pyspark
> def sparkShape(dataFrame):
> return (dataFrame.count(), len(dataFrame.columns))
> pyspark.sql.dataframe.DataFrame.shape = sparkShape
> print(d3.shape())
> (653610, 186)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37981) Deletes columns with all Null as default.

2022-01-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17480733#comment-17480733
 ] 

Apache Spark commented on SPARK-37981:
--

User 'bjornjorgensen' has created a pull request for this issue:
https://github.com/apache/spark/pull/35296

> Deletes columns with all Null as default.
> -
>
> Key: SPARK-37981
> URL: https://issues.apache.org/jira/browse/SPARK-37981
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Bjørn Jørgensen
>Priority: Major
> Attachments: json_null.json
>
>
> Spark 3.2.1-RC2 
> During write.json spark deletes columns with all Null as default. 
>  
> Spark does have dropFieldIfAllNullfalse as default, according to 
> https://spark.apache.org/docs/latest/sql-data-sources-json.html
> {code:java}
> from pyspark import pandas as ps
> import re
> import numpy as np
> import os
> import pandas as pd
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import concat, concat_ws, lit, col, trim, expr
> from pyspark.sql.types import StructType, StructField, StringType,IntegerType
> os.environ["PYARROW_IGNORE_TIMEZONE"]="1"
> def get_spark_session(app_name: str, conf: SparkConf):
> conf.setMaster('local[*]')
> conf \
>   .set('spark.driver.memory', '64g')\
>   .set("fs.s3a.access.key", "minio") \
>   .set("fs.s3a.secret.key", "") \
>   .set("fs.s3a.endpoint", "http://192.168.1.127:9000;) \
>   .set("spark.hadoop.fs.s3a.impl", 
> "org.apache.hadoop.fs.s3a.S3AFileSystem") \
>   .set("spark.hadoop.fs.s3a.path.style.access", "true") \
>   .set("spark.sql.repl.eagerEval.enabled", "True") \
>   .set("spark.sql.adaptive.enabled", "True") \
>   .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
>   .set("spark.sql.repl.eagerEval.maxNumRows", "1") \
>   .set("sc.setLogLevel", "error")
>
> return 
> SparkSession.builder.appName(app_name).config(conf=conf).getOrCreate()
> spark = get_spark_session("Falk", SparkConf())
> d3 = 
> spark.read.option("multiline","true").json("/home/jovyan/notebooks/falk/data/norm_test/3/*.json")
> import pyspark
> def sparkShape(dataFrame):
> return (dataFrame.count(), len(dataFrame.columns))
> pyspark.sql.dataframe.DataFrame.shape = sparkShape
> print(d3.shape())
> (653610, 267)
> d3.write.json("d3.json")
> d3 = spark.read.json("d3.json/*.json")
> import pyspark
> def sparkShape(dataFrame):
> return (dataFrame.count(), len(dataFrame.columns))
> pyspark.sql.dataframe.DataFrame.shape = sparkShape
> print(d3.shape())
> (653610, 186)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37981) Deletes columns with all Null as default.

2022-01-23 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17480734#comment-17480734
 ] 

Apache Spark commented on SPARK-37981:
--

User 'bjornjorgensen' has created a pull request for this issue:
https://github.com/apache/spark/pull/35296

> Deletes columns with all Null as default.
> -
>
> Key: SPARK-37981
> URL: https://issues.apache.org/jira/browse/SPARK-37981
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Bjørn Jørgensen
>Priority: Major
> Attachments: json_null.json
>
>
> Spark 3.2.1-RC2 
> During write.json spark deletes columns with all Null as default. 
>  
> Spark does have dropFieldIfAllNullfalse as default, according to 
> https://spark.apache.org/docs/latest/sql-data-sources-json.html
> {code:java}
> from pyspark import pandas as ps
> import re
> import numpy as np
> import os
> import pandas as pd
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import concat, concat_ws, lit, col, trim, expr
> from pyspark.sql.types import StructType, StructField, StringType,IntegerType
> os.environ["PYARROW_IGNORE_TIMEZONE"]="1"
> def get_spark_session(app_name: str, conf: SparkConf):
> conf.setMaster('local[*]')
> conf \
>   .set('spark.driver.memory', '64g')\
>   .set("fs.s3a.access.key", "minio") \
>   .set("fs.s3a.secret.key", "") \
>   .set("fs.s3a.endpoint", "http://192.168.1.127:9000;) \
>   .set("spark.hadoop.fs.s3a.impl", 
> "org.apache.hadoop.fs.s3a.S3AFileSystem") \
>   .set("spark.hadoop.fs.s3a.path.style.access", "true") \
>   .set("spark.sql.repl.eagerEval.enabled", "True") \
>   .set("spark.sql.adaptive.enabled", "True") \
>   .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
>   .set("spark.sql.repl.eagerEval.maxNumRows", "1") \
>   .set("sc.setLogLevel", "error")
>
> return 
> SparkSession.builder.appName(app_name).config(conf=conf).getOrCreate()
> spark = get_spark_session("Falk", SparkConf())
> d3 = 
> spark.read.option("multiline","true").json("/home/jovyan/notebooks/falk/data/norm_test/3/*.json")
> import pyspark
> def sparkShape(dataFrame):
> return (dataFrame.count(), len(dataFrame.columns))
> pyspark.sql.dataframe.DataFrame.shape = sparkShape
> print(d3.shape())
> (653610, 267)
> d3.write.json("d3.json")
> d3 = spark.read.json("d3.json/*.json")
> import pyspark
> def sparkShape(dataFrame):
> return (dataFrame.count(), len(dataFrame.columns))
> pyspark.sql.dataframe.DataFrame.shape = sparkShape
> print(d3.shape())
> (653610, 186)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37981) Deletes columns with all Null as default.

2022-01-21 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-37981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17480293#comment-17480293
 ] 

Bjørn Jørgensen commented on SPARK-37981:
-

 [^json_null.json] 



{code:java}
from pyspark import pandas as ps
import re
import numpy as np
import os
import pandas as pd

from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.functions import concat, concat_ws, lit, col, trim, expr
from pyspark.sql.types import StructType, StructField, StringType,IntegerType

os.environ["PYARROW_IGNORE_TIMEZONE"]="1"

def get_spark_session(app_name: str, conf: SparkConf):
conf.setMaster('local[*]')
conf \
  .set('spark.driver.memory', '64g')\
  .set("fs.s3a.access.key", "minio") \
  .set("fs.s3a.secret.key", "") \
  .set("fs.s3a.endpoint", "http://192.168.1.127:9000;) \
  .set("spark.hadoop.fs.s3a.impl", 
"org.apache.hadoop.fs.s3a.S3AFileSystem") \
  .set("spark.hadoop.fs.s3a.path.style.access", "true") \
  .set("spark.sql.repl.eagerEval.enabled", "True") \
  .set("spark.sql.adaptive.enabled", "True") \
  .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
  .set("spark.sql.repl.eagerEval.maxNumRows", "1") \
  .set("sc.setLogLevel", "error")

return 
SparkSession.builder.appName(app_name).config(conf=conf).getOrCreate()

spark = get_spark_session("Falk", SparkConf())


df = spark.read.option("multiline","true").json("json_null.json")


import pyspark
def sparkShape(dataFrame):
return (dataFrame.count(), len(dataFrame.columns))
pyspark.sql.dataframe.DataFrame.shape = sparkShape
print(df.shape())

(1, 4)


df.write.json("df.json")

df = spark.read.json("df.json/*.json")

import pyspark
def sparkShape(dataFrame):
return (dataFrame.count(), len(dataFrame.columns))
pyspark.sql.dataframe.DataFrame.shape = sparkShape
print(df.shape())


(1, 3)

{code}


> Deletes columns with all Null as default.
> -
>
> Key: SPARK-37981
> URL: https://issues.apache.org/jira/browse/SPARK-37981
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Bjørn Jørgensen
>Priority: Major
> Attachments: json_null.json
>
>
> Spark 3.2.1-RC2 
> During write.json spark deletes columns with all Null as default. 
>  
> Spark does have dropFieldIfAllNullfalse as default, according to 
> https://spark.apache.org/docs/latest/sql-data-sources-json.html
> {code:java}
> from pyspark import pandas as ps
> import re
> import numpy as np
> import os
> import pandas as pd
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import concat, concat_ws, lit, col, trim, expr
> from pyspark.sql.types import StructType, StructField, StringType,IntegerType
> os.environ["PYARROW_IGNORE_TIMEZONE"]="1"
> def get_spark_session(app_name: str, conf: SparkConf):
> conf.setMaster('local[*]')
> conf \
>   .set('spark.driver.memory', '64g')\
>   .set("fs.s3a.access.key", "minio") \
>   .set("fs.s3a.secret.key", "") \
>   .set("fs.s3a.endpoint", "http://192.168.1.127:9000;) \
>   .set("spark.hadoop.fs.s3a.impl", 
> "org.apache.hadoop.fs.s3a.S3AFileSystem") \
>   .set("spark.hadoop.fs.s3a.path.style.access", "true") \
>   .set("spark.sql.repl.eagerEval.enabled", "True") \
>   .set("spark.sql.adaptive.enabled", "True") \
>   .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
>   .set("spark.sql.repl.eagerEval.maxNumRows", "1") \
>   .set("sc.setLogLevel", "error")
>
> return 
> SparkSession.builder.appName(app_name).config(conf=conf).getOrCreate()
> spark = get_spark_session("Falk", SparkConf())
> d3 = 
> spark.read.option("multiline","true").json("/home/jovyan/notebooks/falk/data/norm_test/3/*.json")
> import pyspark
> def sparkShape(dataFrame):
> return (dataFrame.count(), len(dataFrame.columns))
> pyspark.sql.dataframe.DataFrame.shape = sparkShape
> print(d3.shape())
> (653610, 267)
> d3.write.json("d3.json")
> d3 = spark.read.json("d3.json/*.json")
> import pyspark
> def sparkShape(dataFrame):
> return (dataFrame.count(), len(dataFrame.columns))
> pyspark.sql.dataframe.DataFrame.shape = sparkShape
> print(d3.shape())
> (653610, 186)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37981) Deletes columns with all Null as default.

2022-01-21 Thread Maciej Szymkiewicz (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17480291#comment-17480291
 ] 

Maciej Szymkiewicz commented on SPARK-37981:


This doesn't seem valid.

{{dropFieldIfAllNull}} is a reader option. For writes, we use 
{{ignoreNullFields}}.

So your code should be 

{code}

d3.write.option("ignoreNullFields", "false").json("d3.json")

{code}

> Deletes columns with all Null as default.
> -
>
> Key: SPARK-37981
> URL: https://issues.apache.org/jira/browse/SPARK-37981
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Bjørn Jørgensen
>Priority: Major
> Attachments: json_null.json
>
>
> Spark 3.2.1-RC2 
> During write.json spark deletes columns with all Null as default. 
>  
> Spark does have dropFieldIfAllNullfalse as default, according to 
> https://spark.apache.org/docs/latest/sql-data-sources-json.html
> {code:java}
> from pyspark import pandas as ps
> import re
> import numpy as np
> import os
> import pandas as pd
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import concat, concat_ws, lit, col, trim, expr
> from pyspark.sql.types import StructType, StructField, StringType,IntegerType
> os.environ["PYARROW_IGNORE_TIMEZONE"]="1"
> def get_spark_session(app_name: str, conf: SparkConf):
> conf.setMaster('local[*]')
> conf \
>   .set('spark.driver.memory', '64g')\
>   .set("fs.s3a.access.key", "minio") \
>   .set("fs.s3a.secret.key", "") \
>   .set("fs.s3a.endpoint", "http://192.168.1.127:9000;) \
>   .set("spark.hadoop.fs.s3a.impl", 
> "org.apache.hadoop.fs.s3a.S3AFileSystem") \
>   .set("spark.hadoop.fs.s3a.path.style.access", "true") \
>   .set("spark.sql.repl.eagerEval.enabled", "True") \
>   .set("spark.sql.adaptive.enabled", "True") \
>   .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
>   .set("spark.sql.repl.eagerEval.maxNumRows", "1") \
>   .set("sc.setLogLevel", "error")
>
> return 
> SparkSession.builder.appName(app_name).config(conf=conf).getOrCreate()
> spark = get_spark_session("Falk", SparkConf())
> d3 = 
> spark.read.option("multiline","true").json("/home/jovyan/notebooks/falk/data/norm_test/3/*.json")
> import pyspark
> def sparkShape(dataFrame):
> return (dataFrame.count(), len(dataFrame.columns))
> pyspark.sql.dataframe.DataFrame.shape = sparkShape
> print(d3.shape())
> (653610, 267)
> d3.write.json("d3.json")
> d3 = spark.read.json("d3.json/*.json")
> import pyspark
> def sparkShape(dataFrame):
> return (dataFrame.count(), len(dataFrame.columns))
> pyspark.sql.dataframe.DataFrame.shape = sparkShape
> print(d3.shape())
> (653610, 186)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org