[ https://issues.apache.org/jira/browse/SPARK-9241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14951290#comment-14951290 ]
Yin Huai commented on SPARK-9241: --------------------------------- Yeah. When we compile the query, we can split the queries with multiple distinct columns to multiple queries. Every query evaluates a single distinct aggregation. Then, we can join the results using the group by keys as the join keys. In the join, we need to use null safe equality as the condition. Right now, we need to have another optimization to make it work efficiently. Here is an example, {code} SELECT COUNT(DISTINCT a), COUNT(DISTINCT b), c FROM t GROUP BY c {code} will be rewritten to {code} SELECT x.count_a, y.count_b, x.c FROM (SELECT COUNT(DISTINCT a) count_a FROM t GROUP BY c) x JOIN (SELECT COUNT(DISTINCT b) count_b FROM t GROUP BY c) y ON coalesce(x.c, 0) = coalesce(y.c, 0) AND x.c <=> y.c {code} > Supporting multiple DISTINCT columns > ------------------------------------ > > Key: SPARK-9241 > URL: https://issues.apache.org/jira/browse/SPARK-9241 > Project: Spark > Issue Type: Sub-task > Components: SQL > Reporter: Yin Huai > Priority: Critical > > Right now the new aggregation code path only support a single distinct column > (you can use it in multiple aggregate functions in the query). We need to > support multiple distinct columns by generating a different plan for handling > multiple distinct columns (without change aggregate functions). -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org