[GitHub] [spark] cloud-fan commented on a change in pull request #27428: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

GitBox Mon, 13 Jul 2020 02:18:08 -0700


cloud-fan commented on a change in pull request #27428:
URL: https://github.com/apache/spark/pull/27428#discussion_r453511010




##########
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala
##########
@@ -102,23 +102,126 @@ import org.apache.spark.sql.types.IntegerType
  * {{{
  * Aggregate(
  *    key = ['key]
- *    functions = [count(if (('gid = 1)) 'cat1 else null),
- *                 count(if (('gid = 2)) 'cat2 else null),
+ *    functions = [count(if (('gid = 1)) '_gen_attr_1 else null),
+ *                 count(if (('gid = 2)) '_gen_attr_2 else null),
  *                 first(if (('gid = 0)) 'total else null) ignore nulls]
  *    output = ['key, 'cat1_cnt, 'cat2_cnt, 'total])
  *   Aggregate(
- *      key = ['key, 'cat1, 'cat2, 'gid]
- *      functions = [sum('value) with FILTER('id > 1)]
- *      output = ['key, 'cat1, 'cat2, 'gid, 'total])
+ *      key = ['key, '_gen_attr_1, '_gen_attr_2, 'gid]
+ *      functions = [sum('_gen_attr_3)]
+ *      output = ['key, '_gen_attr_1, '_gen_attr_2, 'gid, 'total])
  *     Expand(
- *        projections = [('key, null, null, 0, cast('value as bigint), 'id),
+ *        projections = [('key, null, null, 0, if ('id > 1) cast('value as 
bigint) else null, 'id),
  *                       ('key, 'cat1, null, 1, null, null),
  *                       ('key, null, 'cat2, 2, null, null)]
- *        output = ['key, 'cat1, 'cat2, 'gid, 'value, 'id])
+ *        output = ['key, '_gen_attr_1, '_gen_attr_2, 'gid, '_gen_attr_3, 'id])
+ *       LocalTableScan [...]
+ * }}}
+ *
+ * Third example: single distinct aggregate function with filter clauses and 
have
+ * not other distinct aggregate function (in sql):
+ * {{{
+ *   SELECT
+ *     COUNT(DISTINCT cat1) FILTER (WHERE id > 1) as cat1_cnt,
+ *     SUM(value) AS total
+ *  FROM
+ *    data
+ *  GROUP BY
+ *    key
+ * }}}
+ *
+ * This translates to the following (pseudo) logical plan:
+ * {{{
+ * Aggregate(
+ *    key = ['key]
+ *    functions = [COUNT(DISTINCT 'cat1) with FILTER('id > 1),
+ *                 sum('value)]
+ *    output = ['key, 'cat1_cnt, 'total])
+ *   LocalTableScan [...]
+ * }}}
+ *
+ * This rule rewrites this logical plan to the following (pseudo) logical plan:
+ * {{{
+ *   Aggregate(
+ *      key = ['key]
+ *      functions = [count('_gen_attr_1),
+  *                  sum('_gen_attr_2)]
+ *      output = ['key, 'cat1_cnt, 'total])
+ *     Project(
+ *        projectionList = ['key, if ('id > 1) 'cat1 else null, cast('value as 
bigint)]
+ *        output = ['key, '_gen_attr_1, '_gen_attr_2])
  *       LocalTableScan [...]
  * }}}
  *
- * The rule does the following things here:
+ * Four example: single distinct aggregate function with filter clauses (in 
sql):
+ * {{{
+ *   SELECT
+ *     COUNT(DISTINCT cat1) FILTER (WHERE id > 1) as cat1_cnt,
+ *     COUNT(DISTINCT cat2) as cat2_cnt,
+ *     SUM(value) AS total
+ *  FROM
+ *    data
+ *  GROUP BY
+ *    key
+ * }}}
+ *
+ * This translates to the following (pseudo) logical plan:
+ * {{{
+ * Aggregate(
+ *    key = ['key]
+ *    functions = [COUNT(DISTINCT 'cat1) with FILTER('id > 1),
+ *                 COUNT(DISTINCT 'cat2),
+ *                 sum('value)]
+ *    output = ['key, 'cat1_cnt, 'cat2_cnt, 'total])
+ *   LocalTableScan [...]
+ * }}}
+ *
+ * This rule rewrites this logical plan to the following (pseudo) logical plan:
+ * {{{
+ *   Aggregate(
+ *      key = ['key]
+ *      functions = [count(if (('gid = 1)) '_gen_attr_1 else null),
+ *                   count(if (('gid = 2)) '_gen_attr_2 else null),
+ *                   first(if (('gid = 0)) 'total else null) ignore nulls]
+ *      output = ['key, 'cat1_cnt, 'cat2_cnt, 'total])
+ *     Aggregate(
+ *        key = ['key, '_gen_attr_1, '_gen_attr_2, 'gid]
+ *        functions = [sum('_gen_attr_3)]
+ *        output = ['key, '_gen_attr_1, '_gen_attr_2, 'gid, 'total])
+ *       Expand(
+ *          projections = [('key, null, null, 0, cast('value as bigint)),
+ *                         ('key, if ('id > 1) 'cat1 else null, null, 1, null),
+ *                         ('key, null, 'cat2, 2, null)]
+ *          output = ['key, '_gen_attr_1, '_gen_attr_2, 'gid, '_gen_attr_3])
+ *         LocalTableScan [...]
+ * }}}
+ *
+ * The rule consists of the two phases as follows:
+ *
+ * In the first phase, if the aggregate query exists filter clauses, project 
the output of
+ * the child of the aggregate query:
+ * 1. Project the data. There are three aggregation groups in this query:
+ *    i. the non-distinct group;
+ *    ii. the distinct 'cat1 group;
+ *    iii. the distinct 'cat2 group with filter clause.
+ *    Because there is at least one group with filter clause (e.g. the 
distinct 'cat2
+ *    group with filter clause), then will project the data.
+ * 2. Avoid projections that may output the same attributes. There are three 
aggregation groups
+ *    in this query:
+ *    i. the non-distinct 'cat1 group;
+ *    ii. the distinct 'cat1 group;
+ *    iii. the distinct 'cat1 group with filter clause.

Review comment:
       We don't need to repeat these 3 groups.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a change in pull request #27428: [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT

Reply via email to