[jira] Commented: (HIVE-287) count distinct on multiple columns does not work

John Sichi (JIRA) Thu, 08 Jul 2010 12:54:49 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886428#action_12886428
 ]


John Sichi commented on HIVE-287:
---------------------------------

Regarding DISTINCT:  I agree with Arvind; this information should be provided 
to the UDAF so that it can reject invocations that don't make sense.  Once this 
validation is passed, the distinct elimination is still implemented generically 
inside of Hive (upstream of the UDAF).

Regarding F(*):  let's discriminate three cases.

COUNT(*):  this really means COUNT(), not COUNT(x,y,z).  This is a very 
important distinction to make from an optimizer perspective, because we want to 
be able to push down projection to avoid I/O and other processing for columns 
whose values we will never look at.

SUM(*) and similar ones:  these we should disallow.

MY_UDAF(*), or MY_UDAF(t.*):  this is similar to Pradeep's case that came up 
recently on the mailing list, and it needs to expand to MY_UDAF(x,y,z), not 
MY_UDAF().  I think the patch is currently doing MY_UDAF(), which isn't what he 
wants.

My recommendation is that we commit Arvind's patch as is, then create a 
followup JIRA issue to do what Pradeep is looking for (the expansion of * in 
the semantic analyzer) for both UDF and UDAF, but with a special case for 
COUNT. UDAF authors will be able to decide whether or not to reject the star 
syntax, since in the common case of a UDAF expecting a limited number of 
parameters, the star won't make sense.


> count distinct on multiple columns does not work
> ------------------------------------------------
>
>                 Key: HIVE-287
>                 URL: https://issues.apache.org/jira/browse/HIVE-287
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Arvind Prabhakar
>         Attachments: HIVE-287-1.patch, HIVE-287-2.patch, HIVE-287-3.patch, 
> HIVE-287-4.patch, HIVE-287-5-branch-0.6.patch, HIVE-287-5-trunk.patch
>
>
> The following query does not work:
> select count(distinct col1, col2) from Tbl

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-287) count distinct on multiple columns does not work

Reply via email to