[jira] [Updated] (SPARK-22335) Union for DataSet uses column order instead of types for union

Carlos Bribiescas (JIRA) Mon, 23 Oct 2017 08:48:52 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-22335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Carlos Bribiescas updated SPARK-22335:
--------------------------------------
    Description: 
I see union uses column order for a DF. This to me is "fine" since they aren't 
typed.
However, for a dataset which is supposed to be strongly typed it is actually 
giving the wrong result. If you try to access the members by name, it will use 
the order. Heres is a reproducible case. 2.2.0

{code:java}

  case class AB(a : String, b : String)

  val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b")
  val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a")
  
  abDf.union(baDf).show() // as linked ticket states, its "Not a problem"
  
  val abDs = abDf.as[AB]
  val baDs = baDf.as[AB]
  
  abDs.union(baDs).show()  // This gives wrong result since a Dataset[AB] 
should be correctly mapped by type, not by column order
  
  abDs.union(baDs).map(_.a).show() // This gives wrong result since a 
Dataset[AB] should be correctly mapped by type, not by column order

   abDs.union(baDs).rdd.take(2) // This also gives wrong result

  baDs.map(_.a).show() // However, this gives the correct result, even though 
columns were out of order.
  abDs.map(_.a).show() // This is correct too

  baDs.select("a","b").as[AB].union(abDs).show() // This is the same workaround 
for linked issue, slightly modified.  However this seems wrong since its 
supposed to be strongly typed
  
  baDs.rdd.toDF().as[AB].union(abDs).show()  // This however gives correct 
result, which is logically inconsistent behavior
{code}

So its inconsistent and a bug IMO.  And I'm not sure that the suggested work 
around is really fair, since I'm supposed to be getting of type `AB`.  More 
importantly I think the issue is bigger when you consider that it happens even 
if you read from parquet (as you would expect).  And that its inconsistent when 
going to/from rdd.

I imagine its just lazily converting to typed DS instead of initially.  So 
either that typing could be prioritized to happen before the union or unioning 
of DF could be done with column order taken into account.  Again, this is 
speculation..

  was:
I see union uses column order for a DF. This to me is "fine" since they aren't 
typed.
However, for a dataset which is supposed to be strongly typed it is actually 
giving the wrong result. If you try to access the members by name, it will use 
the order. Heres is a reproducible case. 2.2.0

{code:java}

  case class AB(a : String, b : String)

  val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b")
  val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a")
  
  abDf.union(baDf).show() // as linked ticket states, its "Not a problem"
  
  val abDs = abDf.as[AB]
  val baDs = baDf.as[AB]
  
  abDs.union(baDs).show()  // This gives wrong result since a Dataset[AB] 
should be correctly mapped by type, not by column order
  
  abDs.union(baDs).map(_.a).show() // This gives wrong result since a 
Dataset[AB] should be correctly mapped by type, not by column order

   abDs.union(baDs).rdd.take(2) // This also gives wrong result

  baDs.map(_.a).show() // However, this gives the correct result, even though 
columns were out of order.
  abDs.map(_.a).show() // This is correct too

  baDs.select("a","b").as[AB].union(abDs).show() // This is the same workaround 
for linked issue, slightly modified.  However this seems wrong since its 
supposed to be strongly typed
  
  baDs.rdd.toDF().as[AB].union(abDs).show()  // This however gives correct 
result, which is logically inconsistent behavior
{code}

So its inconsistent and a bug IMO.  And I'm not sure that the suggested work 
around is really fair, since I'm supposed to be getting of type `AB`.  More 
importantly I think the issue is bigger when you consider that it happens even 
if you read from parquet (as you would expect).  And that its inconsistent when 
going to/from rdd.

I imagine its just lazily converting to typed DS instead of initially.  So 
either that could be prioritized or unioning of DF could be done with column 
order taken into account.  Again, this is speculation..


> Union for DataSet uses column order instead of types for union
> --------------------------------------------------------------
>
>                 Key: SPARK-22335
>                 URL: https://issues.apache.org/jira/browse/SPARK-22335
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.2.0
>            Reporter: Carlos Bribiescas
>            Priority: Minor
>
> I see union uses column order for a DF. This to me is "fine" since they 
> aren't typed.
> However, for a dataset which is supposed to be strongly typed it is actually 
> giving the wrong result. If you try to access the members by name, it will 
> use the order. Heres is a reproducible case. 2.2.0
> {code:java}
>   case class AB(a : String, b : String)
>   val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b")
>   val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a")
>   
>   abDf.union(baDf).show() // as linked ticket states, its "Not a problem"
>   
>   val abDs = abDf.as[AB]
>   val baDs = baDf.as[AB]
>   
>   abDs.union(baDs).show()  // This gives wrong result since a Dataset[AB] 
> should be correctly mapped by type, not by column order
>   
>   abDs.union(baDs).map(_.a).show() // This gives wrong result since a 
> Dataset[AB] should be correctly mapped by type, not by column order
>    abDs.union(baDs).rdd.take(2) // This also gives wrong result
>   baDs.map(_.a).show() // However, this gives the correct result, even though 
> columns were out of order.
>   abDs.map(_.a).show() // This is correct too
>   baDs.select("a","b").as[AB].union(abDs).show() // This is the same 
> workaround for linked issue, slightly modified.  However this seems wrong 
> since its supposed to be strongly typed
>   
>   baDs.rdd.toDF().as[AB].union(abDs).show()  // This however gives correct 
> result, which is logically inconsistent behavior
> {code}
> So its inconsistent and a bug IMO.  And I'm not sure that the suggested work 
> around is really fair, since I'm supposed to be getting of type `AB`.  More 
> importantly I think the issue is bigger when you consider that it happens 
> even if you read from parquet (as you would expect).  And that its 
> inconsistent when going to/from rdd.
> I imagine its just lazily converting to typed DS instead of initially.  So 
> either that typing could be prioritized to happen before the union or 
> unioning of DF could be done with column order taken into account.  Again, 
> this is speculation..



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22335) Union for DataSet uses column order instead of types for union

Reply via email to