Martin Mauch created SPARK-23677: ------------------------------------ Summary: Selecting columns from joined DataFrames with the same origin yields wrong results Key: SPARK-23677 URL: https://issues.apache.org/jira/browse/SPARK-23677 Project: Spark Issue Type: Bug Components: Spark Core, SQL Affects Versions: 2.3.0 Reporter: Martin Mauch
When trying to join two DataFrames with the same origin DataFrame and later selecting columns from the join, Spark can't distinguish between the columns and gives a wrong (or at least very surprising) result. One can work around this using expr. Here is a minimal example: {code:java} import spark.implicits._ val edf = Seq((1), (2), (3), (4), (5)).toDF("num") val big = edf.where(edf("num") > 2).alias("big") val small = edf.where(edf("num") < 4).alias("small") small.join(big, expr("big.num == (small.num + 1)")).select(small("num"), big("num")).show() // +---+---+ // |num|num| // +---+---+ // | 2| 2| // | 3| 3| // +—+—+ small.join(big, expr("big.num == (small.num + 1)")).select(expr("small.num"), expr("big.num")).show() // +---+---+ // |num|num| // +---+---+ // | 2| 3| // | 3| 4| // +---+---+ {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org