[ https://issues.apache.org/jira/browse/SPARK-20761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-20761. ---------------------------------- Resolution: Duplicate I am pretty sure that it is a duplicate of SPARK-15918. Please reopen this if I misunderstood. > Union uses column order rather than schema > ------------------------------------------ > > Key: SPARK-20761 > URL: https://issues.apache.org/jira/browse/SPARK-20761 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.1.1 > Reporter: Nakul Jeirath > Priority: Minor > > I believe there is an issue when using union to combine two dataframes when > the order of columns differ between the left and right side of the union: > {code} > import org.apache.spark.sql.{Row, SparkSession} > import org.apache.spark.sql.types.{BooleanType, StringType, StructField, > StructType} > val schema = StructType(Seq( > StructField("id", StringType, false), > StructField("flag_one", BooleanType, false), > StructField("flag_two", BooleanType, false), > StructField("flag_three", BooleanType, false) > )) > val rowRdd = spark.sparkContext.parallelize(Seq( > Row("1", true, false, false), > Row("2", false, true, false), > Row("3", false, false, true) > )) > spark.createDataFrame(rowRdd, schema).createOrReplaceTempView("temp_flags") > val emptyData = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], > schema) > //Select columns out of order with respect to the emptyData schema > val data = emptyData.union(spark.sql("select id, flag_two, flag_three, > flag_one from temp_flags")) > {code} > Selecting the data from the "temp_flags" table results in: > {noformat} > spark.sql("select * from temp_flags").show > +---+--------+--------+----------+ > | id|flag_one|flag_two|flag_three| > +---+--------+--------+----------+ > | 1| true| false| false| > | 2| false| true| false| > | 3| false| false| true| > +---+--------+--------+----------+ > {noformat} > Which is the data we'd expect but when inspecting "data" we get: > {noformat} > data.show() > +---+--------+--------+----------+ > | id|flag_one|flag_two|flag_three| > +---+--------+--------+----------+ > | 1| false| false| true| > | 2| true| false| false| > | 3| false| true| false| > +---+--------+--------+----------+ > {noformat} > Having a non-empty dataframe on the left side of the union doesn't seem to > make a difference either: > {noformat} > spark.sql("select * from temp_flags").union(spark.sql("select id, flag_two, > flag_three, flag_one from temp_flags")).show > +---+--------+--------+----------+ > | id|flag_one|flag_two|flag_three| > +---+--------+--------+----------+ > | 1| true| false| false| > | 2| false| true| false| > | 3| false| false| true| > | 1| false| false| true| > | 2| true| false| false| > | 3| false| true| false| > +---+--------+--------+----------+ > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org