[jira] [Comment Edited] (SPARK-22335) Union for DataSet uses column order instead of types for union

Carlos Bribiescas (JIRA) Wed, 25 Oct 2017 05:22:24 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-22335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16218465#comment-16218465
 ]


Carlos Bribiescas edited comment on SPARK-22335 at 10/25/17 12:21 PM:
----------------------------------------------------------------------

[~viirya] That is definitely a work around.  And effectively Union will 
sometimes work correctly for typed things.  Unlike in RDD which is also type, 
but works consistently.  I guess this is just a point where two different 
concepts are at ends.  I think this was what [~dongjoon] was getting at.  
(maybe?)  

Suppose you're working with a traditional typed data structure.  It has an 
operation that worked with case classes only sometimes, depending on how they 
were constructed.  Like, if you created this case class via reflection, then 
flat map won’t work correctly for some reason that can be explained and 
documented.

I also understand that SQL does unions by column order, and thats why thats how 
union traditionally works.

So we have two things which aren't spark, but which spark is inspired by that 
would do two things in a slightly different way.  Updating documentation is 
definitely a good step at making the api more useable, but ultimately I guess 
the decision is to go, in this case, with a more SQL-like approach rather than 
real typing.


was (Author: cbribiescas):
[~viirya] That is definitely a work around.  And effectively Union will 
sometimes work correctly for typed things.  Unlike in RDD which is also typed 
and also works consistently.

I guess this is just a point where two different concepts are at ends.  I think 
this was what [~dongjoon] was getting at.  The traditional notion of a DS/DF is 
at ends with the be implied ability of working with a typed data structure. 

 Consider you were working with any collection.  Then the api for that data 
structure had operations that worked with case classes only sometimes, 
depending on how they were constructed.  Like, if you created this case class 
via reflection, then flat map won’t work correctly for some reason that can be 
explained and documented.

Updating documentation is definitely a good step at making the api more 
useable, it’s just that my feeling is that a typed object is a typed no matter 
what.  I also understand that a Datasets implementation is strongly tied to row 
order as well.

> Union for DataSet uses column order instead of types for union
> --------------------------------------------------------------
>
>                 Key: SPARK-22335
>                 URL: https://issues.apache.org/jira/browse/SPARK-22335
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.2.0
>            Reporter: Carlos Bribiescas
>
> I see union uses column order for a DF. This to me is "fine" since they 
> aren't typed.
> However, for a dataset which is supposed to be strongly typed it is actually 
> giving the wrong result. If you try to access the members by name, it will 
> use the order. Heres is a reproducible case. 2.2.0
> {code:java}
>   case class AB(a : String, b : String)
>   val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b")
>   val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a")
>   
>   abDf.union(baDf).show() // as linked ticket states, its "Not a problem"
>   
>   val abDs = abDf.as[AB]
>   val baDs = baDf.as[AB]
>   
>   abDs.union(baDs).show()  // This gives wrong result since a Dataset[AB] 
> should be correctly mapped by type, not by column order
>   
>   abDs.union(baDs).map(_.a).show() // This gives wrong result since a 
> Dataset[AB] should be correctly mapped by type, not by column order
>    abDs.union(baDs).rdd.take(2) // This also gives wrong result
>   baDs.map(_.a).show() // However, this gives the correct result, even though 
> columns were out of order.
>   abDs.map(_.a).show() // This is correct too
>   baDs.select("a","b").as[AB].union(abDs).show() // This is the same 
> workaround for linked issue, slightly modified.  However this seems wrong 
> since its supposed to be strongly typed
>   
>   baDs.rdd.toDF().as[AB].union(abDs).show()  // This however gives correct 
> result, which is logically inconsistent behavior
>   abDs.rdd.union(baDs.rdd).toDF().show() // Simpler example that gives 
> correct result
> {code}
> So its inconsistent and a bug IMO.  And I'm not sure that the suggested work 
> around is really fair, since I'm supposed to be getting of type `AB`.  More 
> importantly I think the issue is bigger when you consider that it happens 
> even if you read from parquet (as you would expect).  And that its 
> inconsistent when going to/from rdd.
> I imagine its just lazily converting to typed DS instead of initially.  So 
> either that typing could be prioritized to happen before the union or 
> unioning of DF could be done with column order taken into account.  Again, 
> this is speculation..



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-22335) Union for DataSet uses column order instead of types for union

Reply via email to