[ https://issues.apache.org/jira/browse/SPARK-22335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Carlos Bribiescas updated SPARK-22335: -------------------------------------- Description: I see union uses column order for a DF. This to me is "fine" since they aren't typed. However, for a dataset which is supposed to be strongly typed it is actually giving the wrong result. If you try to access the members by name, it will use the order. Heres is a reproducible case. 2.2.0 {code:java} case class AB(a : String, b : String) val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b") val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a") abDf.union(baDf).show() // as linked ticket states, its "Not a problem" val abDs = abDf.as[AB] val baDs = baDf.as[AB] abDs.union(baDs).show() // This gives wrong result since a Dataset[AB] should be correctly mapped by type, not by column order abDs.union(baDs).map(_.a).show() // This gives wrong result since a Dataset[AB] should be correctly mapped by type, not by column order abDs.union(baDs).rdd.take(2) // This also gives wrong result baDs.map(_.a).show() // However, this gives the correct result, even though columns were out of order. abDs.map(_.a).show() // This is correct too {code} So its inconsistent and a bug IMO. I imagine its just lazily converting to typed DS instead of initially. So either that could be prioritized or unioning of DF could be done with column order taken into account. Again, this is speculation.. was: I see union uses column order for a DF. This to me is "fine" since they aren't typed. However, for a dataset which is supposed to be strongly typed it is actually giving the wrong result. If you try to access the members by name, it will use the order. Heres is a reproducible case. 2.2.0 {code:java} case class AB(a : String, b : String) val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b") val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a") abDf.union(baDf).show() // as this ticket states, its "Not a problem" val abDs = abDf.as[AB] val baDs = baDf.as[AB] abDs.union(baDs).show() abDs.union(baDs).map(_.a).show() // this gives wrong result since a Dataset[AB] should be correctly mapped by type, not by column order abDs.union(baDs).rdd.take(2) // This also gives wrong result baDs.map(_.a).show() // However, this gives the correct result, even though columns were out of order. abDs.map(_.a).show() // This is correct too {code} So its inconsistent and a bug IMO. I imagine its just lazily converting to typed DS instead of initially. So either that could be prioritized or unioning of DF could be done with column order taken into account. Again, this is speculation.. > Union for DataSet uses column order instead of types for union > -------------------------------------------------------------- > > Key: SPARK-22335 > URL: https://issues.apache.org/jira/browse/SPARK-22335 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.2.0 > Reporter: Carlos Bribiescas > Priority: Minor > > I see union uses column order for a DF. This to me is "fine" since they > aren't typed. > However, for a dataset which is supposed to be strongly typed it is actually > giving the wrong result. If you try to access the members by name, it will > use the order. Heres is a reproducible case. 2.2.0 > {code:java} > case class AB(a : String, b : String) > val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b") > val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a") > > abDf.union(baDf).show() // as linked ticket states, its "Not a problem" > > val abDs = abDf.as[AB] > val baDs = baDf.as[AB] > > abDs.union(baDs).show() // This gives wrong result since a Dataset[AB] > should be correctly mapped by type, not by column order > > abDs.union(baDs).map(_.a).show() // This gives wrong result since a > Dataset[AB] should be correctly mapped by type, not by column order > abDs.union(baDs).rdd.take(2) // This also gives wrong result > baDs.map(_.a).show() // However, this gives the correct result, even though > columns were out of order. > abDs.map(_.a).show() // This is correct too > {code} > So its inconsistent and a bug IMO. > I imagine its just lazily converting to typed DS instead of initially. So > either that could be prioritized or unioning of DF could be done with column > order taken into account. Again, this is speculation.. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org