[ https://issues.apache.org/jira/browse/SPARK-30006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16981407#comment-16981407 ]
Kun edited comment on SPARK-30006 at 11/25/19 9:26 AM: ------------------------------------------------------- In the constructor of *pyspark.sql.Row*, the fields get sorted by their names. So the order changes. Is the sorting by design? {code:python} class Row(tuple): def __new__(self, *args, **kwargs): if args and kwargs: raise ValueError("Can not use both args " "and kwargs to create Row") if kwargs: # create row objects names = sorted(kwargs.keys()) row = tuple.__new__(self, [kwargs[n] for n in names]) row.__fields__ = names row.__from_dict__ = True return row else: # create row class or objects return tuple.__new__(self, args) {code} was (Author: konjac): In the constructor of *pyspark.sql.Row*, the fields order get sorted by the name. So the order changes. Is the sorting by design? {code:python} class Row(tuple): def __new__(self, *args, **kwargs): if args and kwargs: raise ValueError("Can not use both args " "and kwargs to create Row") if kwargs: # create row objects names = sorted(kwargs.keys()) row = tuple.__new__(self, [kwargs[n] for n in names]) row.__fields__ = names row.__from_dict__ = True return row else: # create row class or objects return tuple.__new__(self, args) {code} > printSchema indeterministic output > ---------------------------------- > > Key: SPARK-30006 > URL: https://issues.apache.org/jira/browse/SPARK-30006 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.4.4 > Reporter: Hasil Sharma > Priority: Minor > > printSchema doesn't give a consistent output in following example. > > {code:python} > from pyspark.sql import SparkSession > from pyspark.sql import Row > spark = SparkSession.builder.appName("new-session").getOrCreate() > l = [('Ankit',25),('Jalfaizy',22),('saurabh',20),('Bala',26)] > rdd = spark.sparkContext.parallelize(l) > people = rdd.map(lambda x: Row(name=x[0], age=int(x[1]))) > df1 = spark.createDataFrame(people) > print(df1.printSchema()) > df2 = df1.select("name", "age") > print(df2.printSchema()) > {code} > > first print outputs > {noformat} > root > |– age: long (nullable = true) > |– name: string (nullable = true) > {noformat} > > second print outputs > {noformat} > root > |– name: string (nullable = true) > |– age: long (nullable = true) > {noformat} > Expectation: The output should be same because the column names are same. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org