[jira] [Commented] (SPARK-9241) Supporting multiple DISTINCT columns

Yin Huai (JIRA) Fri, 09 Oct 2015 15:10:27 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-9241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14951290#comment-14951290
 ]


Yin Huai commented on SPARK-9241:
---------------------------------

Yeah. When we compile the query, we can split the queries with multiple 
distinct columns to multiple queries. Every query evaluates a single distinct 
aggregation. Then, we can join the results using the group by keys as the join 
keys. In the join, we need to use null safe equality as the condition. Right 
now, we need to have another optimization to make it work efficiently.

Here is an example,
{code}
SELECT COUNT(DISTINCT a), COUNT(DISTINCT b), c FROM t GROUP BY c
{code}
will be rewritten to
{code}
SELECT x.count_a, y.count_b, x.c
FROM
  (SELECT COUNT(DISTINCT a) count_a FROM t GROUP BY c) x JOIN
  (SELECT COUNT(DISTINCT b) count_b FROM t GROUP BY c) y 
  ON coalesce(x.c, 0) = coalesce(y.c, 0) AND x.c <=> y.c
{code}

> Supporting multiple DISTINCT columns
> ------------------------------------
>
>                 Key: SPARK-9241
>                 URL: https://issues.apache.org/jira/browse/SPARK-9241
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>            Reporter: Yin Huai
>            Priority: Critical
>
> Right now the new aggregation code path only support a single distinct column 
> (you can use it in multiple aggregate functions in the query). We need to 
> support multiple distinct columns by generating a different plan for handling 
> multiple distinct columns (without change aggregate functions).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9241) Supporting multiple DISTINCT columns

Reply via email to