Hi all, I would like to introduce the idea of duplicate insensitive aggregate functions.
For such functions, the aggregation results remain the same even after deduplication. For example, given a sequence of data {1, 1, 2, 2, 3, 5, 5}, the aggregation results of MIN are the same regardless of whether we perform data deduplication first. That is, MIN({1, 1, 2, 2, 3, 5, 5}) = MIN({1, 2, 3, 5}) So MIN is a *deduplicate insensitive function*. On the other hand, function SUM is not duplicate insensitive, because SUM({1, 1, 2, 2, 3, 5, 5}) != SUM({1, 2, 3, 5}) The concept of deduplicate insensitiveness can help us in many optimization scenarios. For example, the curent implementation of AggregateMergeRule rules out any aggregate calls for which the isDistict() method returns true. However, for duplicate insensitive functions, the rule should be applicable. Could you please give your valuable feedback? Best, Liya Fan