tanelk commented on a change in pull request #30999: URL: https://github.com/apache/spark/pull/30999#discussion_r554075890
########## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ########## @@ -349,11 +349,20 @@ abstract class Optimizer(catalogManager: CatalogManager) */ object EliminateDistinct extends Rule[LogicalPlan] { override def apply(plan: LogicalPlan): LogicalPlan = plan transformExpressions { - case ae: AggregateExpression if ae.isDistinct => - ae.aggregateFunction match { - case _: Max | _: Min => ae.copy(isDistinct = false) - case _ => ae - } + case ae: AggregateExpression if ae.isDistinct && isDuplicateAgnostic(ae.aggregateFunction) => + ae.copy(isDistinct = false) + } + + private def isDuplicateAgnostic(af: AggregateFunction): Boolean = af match { + case _: Max => true + case _: Min => true + case _: BitAndAgg => true + case _: BitOrAgg => true + case _: First => true + case _: Last => true + case _: HyperLogLogPlusPlus => true Review comment: There was a similar discussion before: https://github.com/apache/spark/pull/29740#discussion_r487692977 In short HLL++ keeps track of the max number of leading zeros in the hashes of inputs. Repeatedly adding the same valus does not change its results. But as @maropu said the last time, that perhaps we are not making any guarantees about it and so it might not be the best idea to include it here. I'm okay with removing it if you say so. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org