Starting from Spark 2, these kind of operation are implemented in left anti join, instead of using RDD operation directly.
Same issue also on sqlContext. scala> spark.version res25: String = 2.0.2 spark.sqlContext.emptyDataFrame.except(spark.sqlContext.emptyDataFrame).explain(true) == Physical Plan == *HashAggregate(keys=[], functions=[], output=[]) +- Exchange SinglePartition +- *HashAggregate(keys=[], functions=[], output=[]) +- BroadcastNestedLoopJoin BuildRight, LeftAnti, false :- Scan ExistingRDD[] +- BroadcastExchange IdentityBroadcastMode +- Scan ExistingRDD[] This arguably means a bug. But my guess is liking the logic of comparing NULL = NULL, should it return true or false, causing this kind of confusion. Yong ________________________________ From: Ravindra <ravindra.baj...@gmail.com> Sent: Friday, March 17, 2017 4:30 AM To: user@spark.apache.org Subject: Spark 2.0.2 - hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() Can someone please explain why println ( " Empty count " + hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() prints - Empty count 1 This was not the case in Spark 1.5.2... I am upgrading to spark 2.0.2 and found this. This causes my tests to fail. Is there another way to check full equality of 2 dataframes. Thanks, Ravindra.