Ravindra Bajpai created SPARK-20008: ---------------------------------------
Summary: hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() returns 1 Key: SPARK-20008 URL: https://issues.apache.org/jira/browse/SPARK-20008 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.0.2 Reporter: Ravindra Bajpai hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() yields 1 against expected 0. This was not the case with spark 1.5.2. This is an api change from usage point of view and hence I consider this as a bug. May be a boundary case, not sure. Work around - For now I check the counts != 0 before this operation. Not good for performance. Hence creating a jira to track it. As Young Zhang explained in reply to my mail - Starting from Spark 2, these kind of operation are implemented in left anti join, instead of using RDD operation directly. Same issue also on sqlContext. scala> spark.version res25: String = 2.0.2 spark.sqlContext.emptyDataFrame.except(spark.sqlContext.emptyDataFrame).explain(true) == Physical Plan == *HashAggregate(keys=[], functions=[], output=[]) +- Exchange SinglePartition +- *HashAggregate(keys=[], functions=[], output=[]) +- BroadcastNestedLoopJoin BuildRight, LeftAnti, false :- Scan ExistingRDD[] +- BroadcastExchange IdentityBroadcastMode +- Scan ExistingRDD[] This arguably means a bug. But my guess is liking the logic of comparing NULL = NULL, should it return true or false, causing this kind of confusion. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org