Martin Mauch created SPARK-23677:
------------------------------------

             Summary: Selecting columns from joined DataFrames with the same 
origin yields wrong results
                 Key: SPARK-23677
                 URL: https://issues.apache.org/jira/browse/SPARK-23677
             Project: Spark
          Issue Type: Bug
          Components: Spark Core, SQL
    Affects Versions: 2.3.0
            Reporter: Martin Mauch


When trying to join two DataFrames with the same origin DataFrame and later 
selecting columns from the join, Spark can't distinguish between the columns 
and gives a wrong (or at least very surprising) result. One can work around 
this using expr.

Here is a minimal example:

 
{code:java}
import spark.implicits._
val edf = Seq((1), (2), (3), (4), (5)).toDF("num")
val big = edf.where(edf("num") > 2).alias("big")
val small = edf.where(edf("num") < 4).alias("small")
small.join(big, expr("big.num == (small.num + 1)")).select(small("num"), 
big("num")).show()
// +---+---+
// |num|num|
// +---+---+
// | 2| 2|
// | 3| 3|
// +—+—+
small.join(big, expr("big.num == (small.num + 1)")).select(expr("small.num"), 
expr("big.num")).show()
// +---+---+
// |num|num|
// +---+---+
// | 2| 3|
// | 3| 4|
// +---+---+
{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to