[jira] [Resolved] (SPARK-22335) Union for DataSet uses column order instead of types for union

Hyukjin Kwon (JIRA) Sat, 28 Oct 2017 05:48:55 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-22335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hyukjin Kwon resolved SPARK-22335.
----------------------------------
       Resolution: Fixed
    Fix Version/s: 2.3.0

Issue resolved by pull request 19570
[https://github.com/apache/spark/pull/19570]

> Union for DataSet uses column order instead of types for union
> --------------------------------------------------------------
>
>                 Key: SPARK-22335
>                 URL: https://issues.apache.org/jira/browse/SPARK-22335
>             Project: Spark
>          Issue Type: Documentation
>          Components: SQL
>    Affects Versions: 2.2.0
>            Reporter: Carlos Bribiescas
>             Fix For: 2.3.0
>
>
> I see union uses column order for a DF. This to me is "fine" since they 
> aren't typed.
> However, for a dataset which is supposed to be strongly typed it is actually 
> giving the wrong result. If you try to access the members by name, it will 
> use the order. Heres is a reproducible case. 2.2.0
> {code:java}
>   case class AB(a : String, b : String)
>   val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b")
>   val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a")
>   
>   abDf.union(baDf).show() // as linked ticket states, its "Not a problem"
>   
>   val abDs = abDf.as[AB]
>   val baDs = baDf.as[AB]
>   
>   abDs.union(baDs).show()  // This gives wrong result since a Dataset[AB] 
> should be correctly mapped by type, not by column order
>   
>   abDs.union(baDs).map(_.a).show() // This gives wrong result since a 
> Dataset[AB] should be correctly mapped by type, not by column order
>    abDs.union(baDs).rdd.take(2) // This also gives wrong result
>   baDs.map(_.a).show() // However, this gives the correct result, even though 
> columns were out of order.
>   abDs.map(_.a).show() // This is correct too
>   baDs.select("a","b").as[AB].union(abDs).show() // This is the same 
> workaround for linked issue, slightly modified.  However this seems wrong 
> since its supposed to be strongly typed
>   
>   baDs.rdd.toDF().as[AB].union(abDs).show()  // This however gives correct 
> result, which is logically inconsistent behavior
>   abDs.rdd.union(baDs.rdd).toDF().show() // Simpler example that gives 
> correct result
> {code}
> So its inconsistent and a bug IMO.  And I'm not sure that the suggested work 
> around is really fair, since I'm supposed to be getting of type `AB`.  More 
> importantly I think the issue is bigger when you consider that it happens 
> even if you read from parquet (as you would expect).  And that its 
> inconsistent when going to/from rdd.
> I imagine its just lazily converting to typed DS instead of initially.  So 
> either that typing could be prioritized to happen before the union or 
> unioning of DF could be done with column order taken into account.  Again, 
> this is speculation..



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22335) Union for DataSet uses column order instead of types for union

Reply via email to