[jira] [Commented] (SPARK-20008) hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() returns 1
[ https://issues.apache.org/jira/browse/SPARK-20008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15937430#comment-15937430 ] Apache Spark commented on SPARK-20008: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/17392 > hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() returns > 1 > --- > > Key: SPARK-20008 > URL: https://issues.apache.org/jira/browse/SPARK-20008 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.2.0 >Reporter: Ravindra Bajpai >Assignee: Xiao Li >Priority: Minor > > hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() yields > 1 against expected 0. > This was not the case with spark 1.5.2. This is an api change from usage > point of view and hence I consider this as a bug. May be a boundary case, not > sure. > Work around - For now I check the counts != 0 before this operation. Not good > for performance. Hence creating a jira to track it. > As Young Zhang explained in reply to my mail - > Starting from Spark 2, these kind of operation are implemented in left anti > join, instead of using RDD operation directly. > Same issue also on sqlContext. > scala> spark.version > res25: String = 2.0.2 > spark.sqlContext.emptyDataFrame.except(spark.sqlContext.emptyDataFrame).explain(true) > == Physical Plan == > *HashAggregate(keys=[], functions=[], output=[]) > +- Exchange SinglePartition >+- *HashAggregate(keys=[], functions=[], output=[]) > +- BroadcastNestedLoopJoin BuildRight, LeftAnti, false > :- Scan ExistingRDD[] > +- BroadcastExchange IdentityBroadcastMode > +- Scan ExistingRDD[] > This arguably means a bug. But my guess is liking the logic of comparing NULL > = NULL, should it return true or false, causing this kind of confusion. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20008) hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() returns 1
[ https://issues.apache.org/jira/browse/SPARK-20008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15936847#comment-15936847 ] Xiao Li commented on SPARK-20008: - This sounds a general issue for our Spark SQL. For example, {{spark.emptyDataFrame.distinct()}} also returns a non empty result set. > hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() returns > 1 > --- > > Key: SPARK-20008 > URL: https://issues.apache.org/jira/browse/SPARK-20008 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.2.0 >Reporter: Ravindra Bajpai >Assignee: Xiao Li > > hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() yields > 1 against expected 0. > This was not the case with spark 1.5.2. This is an api change from usage > point of view and hence I consider this as a bug. May be a boundary case, not > sure. > Work around - For now I check the counts != 0 before this operation. Not good > for performance. Hence creating a jira to track it. > As Young Zhang explained in reply to my mail - > Starting from Spark 2, these kind of operation are implemented in left anti > join, instead of using RDD operation directly. > Same issue also on sqlContext. > scala> spark.version > res25: String = 2.0.2 > spark.sqlContext.emptyDataFrame.except(spark.sqlContext.emptyDataFrame).explain(true) > == Physical Plan == > *HashAggregate(keys=[], functions=[], output=[]) > +- Exchange SinglePartition >+- *HashAggregate(keys=[], functions=[], output=[]) > +- BroadcastNestedLoopJoin BuildRight, LeftAnti, false > :- Scan ExistingRDD[] > +- BroadcastExchange IdentityBroadcastMode > +- Scan ExistingRDD[] > This arguably means a bug. But my guess is liking the logic of comparing NULL > = NULL, should it return true or false, causing this kind of confusion. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20008) hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() returns 1
[ https://issues.apache.org/jira/browse/SPARK-20008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15936516#comment-15936516 ] Xiao Li commented on SPARK-20008: - Sure, will do. > hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() returns > 1 > --- > > Key: SPARK-20008 > URL: https://issues.apache.org/jira/browse/SPARK-20008 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.2.0 >Reporter: Ravindra Bajpai > > hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() yields > 1 against expected 0. > This was not the case with spark 1.5.2. This is an api change from usage > point of view and hence I consider this as a bug. May be a boundary case, not > sure. > Work around - For now I check the counts != 0 before this operation. Not good > for performance. Hence creating a jira to track it. > As Young Zhang explained in reply to my mail - > Starting from Spark 2, these kind of operation are implemented in left anti > join, instead of using RDD operation directly. > Same issue also on sqlContext. > scala> spark.version > res25: String = 2.0.2 > spark.sqlContext.emptyDataFrame.except(spark.sqlContext.emptyDataFrame).explain(true) > == Physical Plan == > *HashAggregate(keys=[], functions=[], output=[]) > +- Exchange SinglePartition >+- *HashAggregate(keys=[], functions=[], output=[]) > +- BroadcastNestedLoopJoin BuildRight, LeftAnti, false > :- Scan ExistingRDD[] > +- BroadcastExchange IdentityBroadcastMode > +- Scan ExistingRDD[] > This arguably means a bug. But my guess is liking the logic of comparing NULL > = NULL, should it return true or false, causing this kind of confusion. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20008) hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() returns 1
[ https://issues.apache.org/jira/browse/SPARK-20008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935620#comment-15935620 ] Hyukjin Kwon commented on SPARK-20008: -- Thank you for your kind explanation. I think you are more insightful in this issue than me. Could you fix this? > hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() returns > 1 > --- > > Key: SPARK-20008 > URL: https://issues.apache.org/jira/browse/SPARK-20008 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.2.0 >Reporter: Ravindra Bajpai > > hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() yields > 1 against expected 0. > This was not the case with spark 1.5.2. This is an api change from usage > point of view and hence I consider this as a bug. May be a boundary case, not > sure. > Work around - For now I check the counts != 0 before this operation. Not good > for performance. Hence creating a jira to track it. > As Young Zhang explained in reply to my mail - > Starting from Spark 2, these kind of operation are implemented in left anti > join, instead of using RDD operation directly. > Same issue also on sqlContext. > scala> spark.version > res25: String = 2.0.2 > spark.sqlContext.emptyDataFrame.except(spark.sqlContext.emptyDataFrame).explain(true) > == Physical Plan == > *HashAggregate(keys=[], functions=[], output=[]) > +- Exchange SinglePartition >+- *HashAggregate(keys=[], functions=[], output=[]) > +- BroadcastNestedLoopJoin BuildRight, LeftAnti, false > :- Scan ExistingRDD[] > +- BroadcastExchange IdentityBroadcastMode > +- Scan ExistingRDD[] > This arguably means a bug. But my guess is liking the logic of comparing NULL > = NULL, should it return true or false, causing this kind of confusion. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20008) hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() returns 1
[ https://issues.apache.org/jira/browse/SPARK-20008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935603#comment-15935603 ] Xiao Li commented on SPARK-20008: - In the traditional RDBMS, we do not allow users to create a table with zero column. Thus, the existing solution did not cover it. Do you want to fix it? [~hyukjin.kwon] Or you want me to fix it? > hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() returns > 1 > --- > > Key: SPARK-20008 > URL: https://issues.apache.org/jira/browse/SPARK-20008 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.2.0 >Reporter: Ravindra Bajpai > > hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() yields > 1 against expected 0. > This was not the case with spark 1.5.2. This is an api change from usage > point of view and hence I consider this as a bug. May be a boundary case, not > sure. > Work around - For now I check the counts != 0 before this operation. Not good > for performance. Hence creating a jira to track it. > As Young Zhang explained in reply to my mail - > Starting from Spark 2, these kind of operation are implemented in left anti > join, instead of using RDD operation directly. > Same issue also on sqlContext. > scala> spark.version > res25: String = 2.0.2 > spark.sqlContext.emptyDataFrame.except(spark.sqlContext.emptyDataFrame).explain(true) > == Physical Plan == > *HashAggregate(keys=[], functions=[], output=[]) > +- Exchange SinglePartition >+- *HashAggregate(keys=[], functions=[], output=[]) > +- BroadcastNestedLoopJoin BuildRight, LeftAnti, false > :- Scan ExistingRDD[] > +- BroadcastExchange IdentityBroadcastMode > +- Scan ExistingRDD[] > This arguably means a bug. But my guess is liking the logic of comparing NULL > = NULL, should it return true or false, causing this kind of confusion. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20008) hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() returns 1
[ https://issues.apache.org/jira/browse/SPARK-20008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935540#comment-15935540 ] Hyukjin Kwon commented on SPARK-20008: -- [~smilegator], it seems the discussion is about deuplicates in the result if I understood correctly. The problem here is {{Set() - Set()}} should return empty {{Set()}} which was previously done However, it seems now returning {{Set(Row())}} from empty dataframes. In the current master, {code} scala> spark.emptyDataFrame.except(spark.emptyDataFrame).collect() res0: Array[org.apache.spark.sql.Row] = Array([]) scala> spark.emptyDataFrame.collect() res1: Array[org.apache.spark.sql.Row] = Array() {code} I thought S∖S=∅ as below: {code} scala> spark.range(1).except(spark.range(1)).collect() res14: Array[Long] = Array() {code} > hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() returns > 1 > --- > > Key: SPARK-20008 > URL: https://issues.apache.org/jira/browse/SPARK-20008 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.2.0 >Reporter: Ravindra Bajpai > > hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() yields > 1 against expected 0. > This was not the case with spark 1.5.2. This is an api change from usage > point of view and hence I consider this as a bug. May be a boundary case, not > sure. > Work around - For now I check the counts != 0 before this operation. Not good > for performance. Hence creating a jira to track it. > As Young Zhang explained in reply to my mail - > Starting from Spark 2, these kind of operation are implemented in left anti > join, instead of using RDD operation directly. > Same issue also on sqlContext. > scala> spark.version > res25: String = 2.0.2 > spark.sqlContext.emptyDataFrame.except(spark.sqlContext.emptyDataFrame).explain(true) > == Physical Plan == > *HashAggregate(keys=[], functions=[], output=[]) > +- Exchange SinglePartition >+- *HashAggregate(keys=[], functions=[], output=[]) > +- BroadcastNestedLoopJoin BuildRight, LeftAnti, false > :- Scan ExistingRDD[] > +- BroadcastExchange IdentityBroadcastMode > +- Scan ExistingRDD[] > This arguably means a bug. But my guess is liking the logic of comparing NULL > = NULL, should it return true or false, causing this kind of confusion. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20008) hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() returns 1
[ https://issues.apache.org/jira/browse/SPARK-20008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15935278#comment-15935278 ] Xiao Li commented on SPARK-20008: - See the discussion https://github.com/apache/spark/pull/12736#r61344182 The behavior of the previous EXCEPT is wrong. > hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() returns > 1 > --- > > Key: SPARK-20008 > URL: https://issues.apache.org/jira/browse/SPARK-20008 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.2.0 >Reporter: Ravindra Bajpai > > hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() yields > 1 against expected 0. > This was not the case with spark 1.5.2. This is an api change from usage > point of view and hence I consider this as a bug. May be a boundary case, not > sure. > Work around - For now I check the counts != 0 before this operation. Not good > for performance. Hence creating a jira to track it. > As Young Zhang explained in reply to my mail - > Starting from Spark 2, these kind of operation are implemented in left anti > join, instead of using RDD operation directly. > Same issue also on sqlContext. > scala> spark.version > res25: String = 2.0.2 > spark.sqlContext.emptyDataFrame.except(spark.sqlContext.emptyDataFrame).explain(true) > == Physical Plan == > *HashAggregate(keys=[], functions=[], output=[]) > +- Exchange SinglePartition >+- *HashAggregate(keys=[], functions=[], output=[]) > +- BroadcastNestedLoopJoin BuildRight, LeftAnti, false > :- Scan ExistingRDD[] > +- BroadcastExchange IdentityBroadcastMode > +- Scan ExistingRDD[] > This arguably means a bug. But my guess is liking the logic of comparing NULL > = NULL, should it return true or false, causing this kind of confusion. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20008) hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() returns 1
[ https://issues.apache.org/jira/browse/SPARK-20008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15931225#comment-15931225 ] Hyukjin Kwon commented on SPARK-20008: -- I could reproduce this in the current master with {code} println(spark.emptyDataFrame.except(spark.emptyDataFrame).collect().size) {code} > hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() returns > 1 > --- > > Key: SPARK-20008 > URL: https://issues.apache.org/jira/browse/SPARK-20008 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.2.0 >Reporter: Ravindra Bajpai > > hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() yields > 1 against expected 0. > This was not the case with spark 1.5.2. This is an api change from usage > point of view and hence I consider this as a bug. May be a boundary case, not > sure. > Work around - For now I check the counts != 0 before this operation. Not good > for performance. Hence creating a jira to track it. > As Young Zhang explained in reply to my mail - > Starting from Spark 2, these kind of operation are implemented in left anti > join, instead of using RDD operation directly. > Same issue also on sqlContext. > scala> spark.version > res25: String = 2.0.2 > spark.sqlContext.emptyDataFrame.except(spark.sqlContext.emptyDataFrame).explain(true) > == Physical Plan == > *HashAggregate(keys=[], functions=[], output=[]) > +- Exchange SinglePartition >+- *HashAggregate(keys=[], functions=[], output=[]) > +- BroadcastNestedLoopJoin BuildRight, LeftAnti, false > :- Scan ExistingRDD[] > +- BroadcastExchange IdentityBroadcastMode > +- Scan ExistingRDD[] > This arguably means a bug. But my guess is liking the logic of comparing NULL > = NULL, should it return true or false, causing this kind of confusion. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20008) hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() returns 1
[ https://issues.apache.org/jira/browse/SPARK-20008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15931224#comment-15931224 ] Hyukjin Kwon commented on SPARK-20008: -- This was fine in 1.6.3 with {{ExceptExec}} too but this small bug seems introduced when it is replaced to {{Join}}. > hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() returns > 1 > --- > > Key: SPARK-20008 > URL: https://issues.apache.org/jira/browse/SPARK-20008 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.2 >Reporter: Ravindra Bajpai > > hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() yields > 1 against expected 0. > This was not the case with spark 1.5.2. This is an api change from usage > point of view and hence I consider this as a bug. May be a boundary case, not > sure. > Work around - For now I check the counts != 0 before this operation. Not good > for performance. Hence creating a jira to track it. > As Young Zhang explained in reply to my mail - > Starting from Spark 2, these kind of operation are implemented in left anti > join, instead of using RDD operation directly. > Same issue also on sqlContext. > scala> spark.version > res25: String = 2.0.2 > spark.sqlContext.emptyDataFrame.except(spark.sqlContext.emptyDataFrame).explain(true) > == Physical Plan == > *HashAggregate(keys=[], functions=[], output=[]) > +- Exchange SinglePartition >+- *HashAggregate(keys=[], functions=[], output=[]) > +- BroadcastNestedLoopJoin BuildRight, LeftAnti, false > :- Scan ExistingRDD[] > +- BroadcastExchange IdentityBroadcastMode > +- Scan ExistingRDD[] > This arguably means a bug. But my guess is liking the logic of comparing NULL > = NULL, should it return true or false, causing this kind of confusion. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20008) hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() returns 1
[ https://issues.apache.org/jira/browse/SPARK-20008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15931221#comment-15931221 ] Hyukjin Kwon commented on SPARK-20008: -- I just took a quick look. {{BroadcastNestedLoopJoin}} looks fine with empty rows but {{HashAggregate}} produces an iterator with single empty row when {{groupingExpressions}} is empty at here - https://github.com/apache/spark/blob/dd9049e0492cc70b629518fee9b3d1632374c612/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala#L104-L125 > hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() returns > 1 > --- > > Key: SPARK-20008 > URL: https://issues.apache.org/jira/browse/SPARK-20008 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.2 >Reporter: Ravindra Bajpai > > hiveContext.emptyDataFrame.except(hiveContext.emptyDataFrame).count() yields > 1 against expected 0. > This was not the case with spark 1.5.2. This is an api change from usage > point of view and hence I consider this as a bug. May be a boundary case, not > sure. > Work around - For now I check the counts != 0 before this operation. Not good > for performance. Hence creating a jira to track it. > As Young Zhang explained in reply to my mail - > Starting from Spark 2, these kind of operation are implemented in left anti > join, instead of using RDD operation directly. > Same issue also on sqlContext. > scala> spark.version > res25: String = 2.0.2 > spark.sqlContext.emptyDataFrame.except(spark.sqlContext.emptyDataFrame).explain(true) > == Physical Plan == > *HashAggregate(keys=[], functions=[], output=[]) > +- Exchange SinglePartition >+- *HashAggregate(keys=[], functions=[], output=[]) > +- BroadcastNestedLoopJoin BuildRight, LeftAnti, false > :- Scan ExistingRDD[] > +- BroadcastExchange IdentityBroadcastMode > +- Scan ExistingRDD[] > This arguably means a bug. But my guess is liking the logic of comparing NULL > = NULL, should it return true or false, causing this kind of confusion. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org