Starting from Spark 2, these kind of operation are implemented in left anti 
join, instead of using RDD operation directly.


Same issue also on sqlContext.


scala> spark.version
res25: String = 2.0.2


spark.sqlContext.emptyDataFrame.except(spark.sqlContext.emptyDataFrame).explain(true)

== Physical Plan ==
*HashAggregate(keys=[], functions=[], output=[])
+- Exchange SinglePartition
   +- *HashAggregate(keys=[], functions=[], output=[])
      +- BroadcastNestedLoopJoin BuildRight, LeftAnti, false
         :- Scan ExistingRDD[]
         +- BroadcastExchange IdentityBroadcastMode
            +- Scan ExistingRDD[]


This arguably means a bug. But my guess is liking the logic of comparing NULL = 
NULL, should it return true or false, causing this kind of confusion.

Yong

________________________________
From: Ravindra <ravindra.baj...@gmail.com>
Sent: Friday, March 17, 2017 4:30 AM
To: user@spark.apache.org
Subject: Spark 2.0.2 - 
hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count()

Can someone please explain why

println ( " Empty count " + 
hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count()

prints -  Empty count 1

This was not the case in Spark 1.5.2... I am upgrading to spark 2.0.2 and found 
this. This causes my tests to fail. Is there another way to check full equality 
of 2 dataframes.

Thanks,
Ravindra.

Reply via email to