[jira] [Updated] (SPARK-30006) printSchema indeterministic output
[ https://issues.apache.org/jira/browse/SPARK-30006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hasil Sharma updated SPARK-30006: - Description: printSchema doesn't give a consistent output in following example. {code:python} from pyspark.sql import SparkSession from pyspark.sql import Row spark = SparkSession.builder.appName("new-session").getOrCreate() l = [('Ankit',25),('Jalfaizy',22),('saurabh',20),('Bala',26)] rdd = spark.sparkContext.parallelize(l) people = rdd.map(lambda x: Row(name=x[0], age=int(x[1]))) df1 = spark.createDataFrame(people) print(df1.printSchema()) df2 = df1.select("name", "age") print(df2.printSchema()) {code} first print outputs {noformat} root |– age: long (nullable = true) |– name: string (nullable = true) {noformat} second print outputs {noformat} root |– name: string (nullable = true) |– age: long (nullable = true) {noformat} Expectation: The output should be same because the column names are same. was: printSchema doesn't give a consistent output in following example. {code:python} from pyspark.sql import SparkSession from pyspark.sql import Row spark = SparkSession.builder.appName("new-session").getOrCreate() l = [('Ankit',25),('Jalfaizy',22),('saurabh',20),('Bala',26)] rdd = spark.sparkContext.parallelize(l) people_1 = rdd.map(lambda x: Row(name=x[0], age=int(x[1]))) df1 = spark.createDataFrame(people_1) print(df1.printSchema()) df2 = df1.select("name", "age") print(df2.printSchema()) {code} first print outputs {noformat} root |– age: long (nullable = true) |– name: string (nullable = true) {noformat} second print outputs {noformat} root |– name: string (nullable = true) |– age: long (nullable = true) {noformat} Expectation: The output should be same because the column names are same. > printSchema indeterministic output > -- > > Key: SPARK-30006 > URL: https://issues.apache.org/jira/browse/SPARK-30006 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 >Reporter: Hasil Sharma >Priority: Minor > > printSchema doesn't give a consistent output in following example. > > {code:python} > from pyspark.sql import SparkSession > from pyspark.sql import Row > spark = SparkSession.builder.appName("new-session").getOrCreate() > l = [('Ankit',25),('Jalfaizy',22),('saurabh',20),('Bala',26)] > rdd = spark.sparkContext.parallelize(l) > people = rdd.map(lambda x: Row(name=x[0], age=int(x[1]))) > df1 = spark.createDataFrame(people) > print(df1.printSchema()) > df2 = df1.select("name", "age") > print(df2.printSchema()) > {code} > > first print outputs > {noformat} > root > |– age: long (nullable = true) > |– name: string (nullable = true) > {noformat} > > second print outputs > {noformat} > root > |– name: string (nullable = true) > |– age: long (nullable = true) > {noformat} > Expectation: The output should be same because the column names are same. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30006) printSchema indeterministic output
[ https://issues.apache.org/jira/browse/SPARK-30006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hasil Sharma updated SPARK-30006: - Description: printSchema doesn't give a consistent output in following example. {code:python} from pyspark.sql import SparkSession from pyspark.sql import Row spark = SparkSession.builder.appName("new-session").getOrCreate() l = [('Ankit',25),('Jalfaizy',22),('saurabh',20),('Bala',26)] rdd = spark.sparkContext.parallelize(l) people_1 = rdd.map(lambda x: Row(name=x[0], age=int(x[1]))) df1 = spark.createDataFrame(people_1) print(df1.printSchema()) df2 = df1.select("name", "age") print(df2.printSchema()) {code} first print outputs {noformat} root |– age: long (nullable = true) |– name: string (nullable = true) {noformat} second print outputs {noformat} root |– name: string (nullable = true) |– age: long (nullable = true) {noformat} Expectation: The output should be same because the column names are same. was: printSchema doesn't give a consistent output in following example. ```python from pyspark.sql import SparkSession from pyspark.sql import Row spark = SparkSession.builder.appName("new-session").getOrCreate() l = [('Ankit',25),('Jalfaizy',22),('saurabh',20),('Bala',26)] rdd = spark.sparkContext.parallelize(l) people_1 = rdd.map(lambda x: Row(name=x[0], age=int(x[1]))) df1 = spark.createDataFrame(people_1) print(df1.printSchema()) df2 = df1.select("name", "age") print(df2.printSchema()) ``` first print outputs ```root |– age: long (nullable = true)| |– name: string (nullable = true)|``` second print outputs ```root |– name: string (nullable = true)| |– age: long (nullable = true)|``` Expectation: The output should be same because the column names are same. > printSchema indeterministic output > -- > > Key: SPARK-30006 > URL: https://issues.apache.org/jira/browse/SPARK-30006 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 >Reporter: Hasil Sharma >Priority: Minor > > printSchema doesn't give a consistent output in following example. > > {code:python} > from pyspark.sql import SparkSession > from pyspark.sql import Row > spark = SparkSession.builder.appName("new-session").getOrCreate() > l = [('Ankit',25),('Jalfaizy',22),('saurabh',20),('Bala',26)] > rdd = spark.sparkContext.parallelize(l) > people_1 = rdd.map(lambda x: Row(name=x[0], age=int(x[1]))) > df1 = spark.createDataFrame(people_1) > print(df1.printSchema()) > df2 = df1.select("name", "age") > print(df2.printSchema()) > {code} > > first print outputs > {noformat} > root > |– age: long (nullable = true) > |– name: string (nullable = true) > {noformat} > > second print outputs > {noformat} > root > |– name: string (nullable = true) > |– age: long (nullable = true) > {noformat} > Expectation: The output should be same because the column names are same. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30006) printSchema indeterministic output
[ https://issues.apache.org/jira/browse/SPARK-30006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hasil Sharma updated SPARK-30006: - Description: printSchema doesn't give a consistent output in following example. ```python from pyspark.sql import SparkSession from pyspark.sql import Row spark = SparkSession.builder.appName("new-session").getOrCreate() l = [('Ankit',25),('Jalfaizy',22),('saurabh',20),('Bala',26)] rdd = spark.sparkContext.parallelize(l) people_1 = rdd.map(lambda x: Row(name=x[0], age=int(x[1]))) df1 = spark.createDataFrame(people_1) print(df1.printSchema()) df2 = df1.select("name", "age") print(df2.printSchema()) ``` first print outputs ```root |– age: long (nullable = true)| |– name: string (nullable = true)|``` second print outputs ```root |– name: string (nullable = true)| |– age: long (nullable = true)|``` Expectation: The output should be same because the column names are same. was: printSchema doesn't give a consistent output in following example. ```python from pyspark.sql import SparkSession from pyspark.sql import Row spark = SparkSession.builder.appName("new-session").getOrCreate() l = [('Ankit',25),('Jalfaizy',22),('saurabh',20),('Bala',26)] rdd = spark.sparkContext.parallelize(l) people_1 = rdd.map(lambda x: Row(name=x[0], age=int(x[1]))) df1 = spark.createDataFrame(people_1) print(df1.printSchema()) df2 = df1.select("name", "age") print(df2.printSchema())``` first print outputs ``` ``` second print outputs ``` root |– name: string (nullable = true)| |– age: long (nullable = true)| ``` Expectation: The output should be same because the column names are same. > printSchema indeterministic output > -- > > Key: SPARK-30006 > URL: https://issues.apache.org/jira/browse/SPARK-30006 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 >Reporter: Hasil Sharma >Priority: Minor > > printSchema doesn't give a consistent output in following example. > > ```python > from pyspark.sql import SparkSession > from pyspark.sql import Row > spark = SparkSession.builder.appName("new-session").getOrCreate() > l = [('Ankit',25),('Jalfaizy',22),('saurabh',20),('Bala',26)] > rdd = spark.sparkContext.parallelize(l) > people_1 = rdd.map(lambda x: Row(name=x[0], age=int(x[1]))) > df1 = spark.createDataFrame(people_1) > print(df1.printSchema()) > df2 = df1.select("name", "age") > print(df2.printSchema()) > ``` > > first print outputs > ```root > |– age: long (nullable = true)| > |– name: string (nullable = true)|``` > > second print outputs > ```root > |– name: string (nullable = true)| > |– age: long (nullable = true)|``` > Expectation: The output should be same because the column names are same. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30006) printSchema indeterministic output
[ https://issues.apache.org/jira/browse/SPARK-30006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hasil Sharma updated SPARK-30006: - Description: printSchema doesn't give a consistent output in following example. ```python from pyspark.sql import SparkSession from pyspark.sql import Row spark = SparkSession.builder.appName("new-session").getOrCreate() l = [('Ankit',25),('Jalfaizy',22),('saurabh',20),('Bala',26)] rdd = spark.sparkContext.parallelize(l) people_1 = rdd.map(lambda x: Row(name=x[0], age=int(x[1]))) df1 = spark.createDataFrame(people_1) print(df1.printSchema()) df2 = df1.select("name", "age") print(df2.printSchema())``` first print outputs ``` ``` second print outputs ``` root |– name: string (nullable = true)| |– age: long (nullable = true)| ``` Expectation: The output should be same because the column names are same. was: printSchema doesn't give a consistent output in following example. ```python from pyspark.sql import SparkSession from pyspark.sql import Row spark = SparkSession.builder.appName("new-session").getOrCreate() l = [('Ankit',25),('Jalfaizy',22),('saurabh',20),('Bala',26)] rdd = spark.sparkContext.parallelize(l) people_1 = rdd.map(lambda x: Row(name=x[0], age=int(x[1]))) df1 = spark.createDataFrame(people_1) print(df1.printSchema()) df2 = df1.select("name", "age") print(df2.printSchema()) ``` first print outputs ``` root |-- age: long (nullable = true) |-- name: string (nullable = true) ``` second print outputs ``` root |-- name: string (nullable = true) |-- age: long (nullable = true) ``` Expectation: The output should be same because the column names are same. > printSchema indeterministic output > -- > > Key: SPARK-30006 > URL: https://issues.apache.org/jira/browse/SPARK-30006 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 >Reporter: Hasil Sharma >Priority: Minor > > printSchema doesn't give a consistent output in following example. > > ```python > from pyspark.sql import SparkSession > from pyspark.sql import Row > spark = SparkSession.builder.appName("new-session").getOrCreate() > l = [('Ankit',25),('Jalfaizy',22),('saurabh',20),('Bala',26)] > rdd = spark.sparkContext.parallelize(l) > people_1 = rdd.map(lambda x: Row(name=x[0], age=int(x[1]))) > df1 = spark.createDataFrame(people_1) > print(df1.printSchema()) > df2 = df1.select("name", "age") > print(df2.printSchema())``` > > first print outputs > ``` > > ``` > > second print outputs > ``` > root > |– name: string (nullable = true)| > |– age: long (nullable = true)| > ``` > Expectation: The output should be same because the column names are same. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30006) printSchema indeterministic output
Hasil Sharma created SPARK-30006: Summary: printSchema indeterministic output Key: SPARK-30006 URL: https://issues.apache.org/jira/browse/SPARK-30006 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.4 Reporter: Hasil Sharma printSchema doesn't give a consistent output in following example. ```python from pyspark.sql import SparkSession from pyspark.sql import Row spark = SparkSession.builder.appName("new-session").getOrCreate() l = [('Ankit',25),('Jalfaizy',22),('saurabh',20),('Bala',26)] rdd = spark.sparkContext.parallelize(l) people_1 = rdd.map(lambda x: Row(name=x[0], age=int(x[1]))) df1 = spark.createDataFrame(people_1) print(df1.printSchema()) df2 = df1.select("name", "age") print(df2.printSchema()) ``` first print outputs ``` root |-- age: long (nullable = true) |-- name: string (nullable = true) ``` second print outputs ``` root |-- name: string (nullable = true) |-- age: long (nullable = true) ``` Expectation: The output should be same because the column names are same. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12436) If all values of a JSON field is null, JSON's inferSchema should return NullType instead of StringType
[ https://issues.apache.org/jira/browse/SPARK-12436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410946#comment-15410946 ] Hasil Sharma commented on SPARK-12436: -- Is this issue solved ? If not, would like to contribute > If all values of a JSON field is null, JSON's inferSchema should return > NullType instead of StringType > -- > > Key: SPARK-12436 > URL: https://issues.apache.org/jira/browse/SPARK-12436 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Reynold Xin > Labels: starter > > Right now, JSON's inferSchema will return {{StringType}} for a field that > always has null values or an {{ArrayType(StringType)}} for a field that > always has empty array values. Although this behavior makes writing JSON data > to other data sources easy (i.e. when writing data, we do not need to remove > those {{NullType}} or {{ArrayType(NullType)}} columns), it makes downstream > application hard to reason about the actual schema of the data and thus makes > schema merging hard. We should allow JSON's inferSchema returns {{NullType}} > and {{ArrayType(NullType)}}. Also, we need to make sure that when we write > data out, we should remove those {{NullType}} or {{ArrayType(NullType)}} > columns first. > Besides {{NullType}} and {{ArrayType(NullType)}}, we may need to do the same > thing for empty {{StructType}}s (i.e. a {{StructType}} having 0 fields). > To finish this work, we need to finish the following sub-tasks: > * Allow JSON's inferSchema returns {{NullType}} and {{ArrayType(NullType)}}. > * Determine whether we need to add the operation of removing {{NullType}} and > {{ArrayType(NullType)}} columns from the data that will be write out for all > data sources (i.e. data sources based our data source API and Hive tables). > Or, we should just add this operation for certain data sources (e.g. > Parquet). For example, we may not need this operation for Hive because Hive > has VoidObjectInspector. > * Implement the change and get it merged to Spark master. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org