[GitHub] [spark] tanelk commented on a change in pull request #30999: [SPARK-33971][SQL] Eliminate distinct from more aggregates

GitBox Fri, 08 Jan 2021 09:07:26 -0800


tanelk commented on a change in pull request #30999:
URL: https://github.com/apache/spark/pull/30999#discussion_r554075890




##########
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -349,11 +349,20 @@ abstract class Optimizer(catalogManager: CatalogManager)
  */
 object EliminateDistinct extends Rule[LogicalPlan] {
   override def apply(plan: LogicalPlan): LogicalPlan = plan 
transformExpressions  {
-    case ae: AggregateExpression if ae.isDistinct =>
-      ae.aggregateFunction match {
-        case _: Max | _: Min => ae.copy(isDistinct = false)
-        case _ => ae
-      }
+    case ae: AggregateExpression if ae.isDistinct && 
isDuplicateAgnostic(ae.aggregateFunction) =>
+      ae.copy(isDistinct = false)
+  }
+
+  private def isDuplicateAgnostic(af: AggregateFunction): Boolean = af match {
+    case _: Max => true
+    case _: Min => true
+    case _: BitAndAgg => true
+    case _: BitOrAgg => true
+    case _: First => true
+    case _: Last => true
+    case _: HyperLogLogPlusPlus => true

Review comment:
       There was a similar discussion before: 
https://github.com/apache/spark/pull/29740#discussion_r487692977
   
   In short HLL++ keeps track of the max number of leading zeros in the hashes 
of inputs. Repeatedly adding the same valus does not change its results. But as 
@maropu said the last time, that perhaps we are not making any guarantees about 
it and so it might not be the best idea to include it here.
   
   I'm okay with removing it if you say so.  




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] tanelk commented on a change in pull request #30999: [SPARK-33971][SQL] Eliminate distinct from more aggregates

Reply via email to