[ https://issues.apache.org/jira/browse/SPARK-30006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hasil Sharma updated SPARK-30006: --------------------------------- Description: printSchema doesn't give a consistent output in following example. {code:python} from pyspark.sql import SparkSession from pyspark.sql import Row spark = SparkSession.builder.appName("new-session").getOrCreate() l = [('Ankit',25),('Jalfaizy',22),('saurabh',20),('Bala',26)] rdd = spark.sparkContext.parallelize(l) people_1 = rdd.map(lambda x: Row(name=x[0], age=int(x[1]))) df1 = spark.createDataFrame(people_1) print(df1.printSchema()) df2 = df1.select("name", "age") print(df2.printSchema()) {code} first print outputs {noformat} root |– age: long (nullable = true) |– name: string (nullable = true) {noformat} second print outputs {noformat} root |– name: string (nullable = true) |– age: long (nullable = true) {noformat} Expectation: The output should be same because the column names are same. was: printSchema doesn't give a consistent output in following example. ```python from pyspark.sql import SparkSession from pyspark.sql import Row spark = SparkSession.builder.appName("new-session").getOrCreate() l = [('Ankit',25),('Jalfaizy',22),('saurabh',20),('Bala',26)] rdd = spark.sparkContext.parallelize(l) people_1 = rdd.map(lambda x: Row(name=x[0], age=int(x[1]))) df1 = spark.createDataFrame(people_1) print(df1.printSchema()) df2 = df1.select("name", "age") print(df2.printSchema()) ``` first print outputs ```root |– age: long (nullable = true)| |– name: string (nullable = true)|``` second print outputs ```root |– name: string (nullable = true)| |– age: long (nullable = true)|``` Expectation: The output should be same because the column names are same. > printSchema indeterministic output > ---------------------------------- > > Key: SPARK-30006 > URL: https://issues.apache.org/jira/browse/SPARK-30006 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.4.4 > Reporter: Hasil Sharma > Priority: Minor > > printSchema doesn't give a consistent output in following example. > > {code:python} > from pyspark.sql import SparkSession > from pyspark.sql import Row > spark = SparkSession.builder.appName("new-session").getOrCreate() > l = [('Ankit',25),('Jalfaizy',22),('saurabh',20),('Bala',26)] > rdd = spark.sparkContext.parallelize(l) > people_1 = rdd.map(lambda x: Row(name=x[0], age=int(x[1]))) > df1 = spark.createDataFrame(people_1) > print(df1.printSchema()) > df2 = df1.select("name", "age") > print(df2.printSchema()) > {code} > > first print outputs > {noformat} > root > |– age: long (nullable = true) > |– name: string (nullable = true) > {noformat} > > second print outputs > {noformat} > root > |– name: string (nullable = true) > |– age: long (nullable = true) > {noformat} > Expectation: The output should be same because the column names are same. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org