cloud-fan commented on a change in pull request #27428: URL: https://github.com/apache/spark/pull/27428#discussion_r450157305
########## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala ########## @@ -118,7 +118,63 @@ import org.apache.spark.sql.types.IntegerType * LocalTableScan [...] * }}} * - * The rule does the following things here: + * Third example: single distinct aggregate function with filter clauses (in sql): + * {{{ + * SELECT + * COUNT(DISTINCT cat1) FILTER (WHERE id > 1) as cat1_cnt, + * COUNT(DISTINCT cat2) as cat2_cnt, + * SUM(value) AS total + * FROM + * data + * GROUP BY + * key + * }}} + * + * This translates to the following (pseudo) logical plan: + * {{{ + * Aggregate( + * key = ['key] + * functions = [COUNT(DISTINCT 'cat1) with FILTER('id > 1), + * COUNT(DISTINCT 'cat2), + * sum('value)] + * output = ['key, 'cat1_cnt, 'cat2_cnt, 'total]) + * LocalTableScan [...] + * }}} + * + * This rule rewrites this logical plan to the following (pseudo) logical plan: + * {{{ + * Aggregate( + * key = ['key] + * functions = [count(if (('gid = 1)) '_gen_distinct_1 else null), + * count(if (('gid = 2)) '_gen_distinct_2 else null), + * first(if (('gid = 0)) 'total else null) ignore nulls] + * output = ['key, 'cat1_cnt, 'cat2_cnt, 'total]) + * Aggregate( + * key = ['key, '_gen_distinct_1, '_gen_distinct_2, 'gid] + * functions = [sum('value)] + * output = ['key, '_gen_distinct_1, '_gen_distinct_2, 'gid, 'total]) + * Expand( + * projections = [('key, null, null, 0, 'value), + * ('key, '_gen_distinct_1, null, 1, null), + * ('key, null, '_gen_distinct_2, 2, null)] + * output = ['key, '_gen_distinct_1, '_gen_distinct_2, 'gid, 'value]) + * Project( + * projectList = ['key, if ('id > 1) 'cat1 else null, 'cat2, cast('value as bigint)] + * output = ['key, '_gen_distinct_1, '_gen_distinct_2, 'value]) + * LocalTableScan [...] + * }}} + * + * The rule consists of the two phases as follows: + * + * In the first phase, expands data for the distinct aggregates where filter clauses exist: Review comment: Please explain "what" first, not "why". > Guaranteed to compute filter clauses in the first aggregate locally. This is not a "what". ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org