[jira] Commented: (HIVE-503) improvement on distinct: distinguish distinct aggregate function from distinct

Min Zhou (JIRA) Fri, 22 May 2009 20:13:10 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12712344#action_12712344
 ]


Min Zhou commented on HIVE-503:
-------------------------------

Is that intermediate data outputed by mappers for shuffling the data you 
mentioned? Could you explain why that amount  brought by your second approach 
was m+n  which I thought may be mn too?

btw, I think you must caculate in addition the overhead of  at least one piece 
of join query when you finally merge your m distincted results into one table. 
That isnot the first approach needed.

> improvement on distinct: distinguish distinct aggregate function from distinct
> ------------------------------------------------------------------------------
>
>                 Key: HIVE-503
>                 URL: https://issues.apache.org/jira/browse/HIVE-503
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: Min Zhou
>
> h4.distinct
> # OK
> {code:sql}
> select 
>    distinct col
> from 
>   tbl
> {code}
> # FAILED
> {code:sql}
> select 
>    distinct  col1,
>    distinct  col2
> from 
>   tbl
> {code}
> h4.distinct aggregate function
> # OK
> {code:sql}
> select 
>    count(distinct col % 10)
> from 
>   tbl
> {code}
> # OK
> {code:sql}
> select 
>    count(distinct col1% 10)
>    count(distinct col1% 9)
> from 
>   tbl
> {code}
> # OK
> {code:sql}
> select 
>    count(distinct col1 % 10)
>    count(distinct col2 % 9)
> from 
>   tbl
> {code}
> # OK
> {code:sql}
> select 
>   sum(distinct col1 % 10),
>   count(distinct col2 % 9)
> from 
>   tbl
> {code}
> # OK
> {code:sql}
> select 
>   max(distinct substr(col1, 1, 10)),
>   count(distinct col2 % 9)
> from 
>   tbl
> {code}
> The keyword "distinct" ofen produce more than one results, so it's impossible 
> removing two different columns' duplicates in only one mapreduce job, so it 
> failed.
> But the term "distinct aggregate function" with a form like 
> aggregate_function(distinct ....),  is in connection with the term "all 
> aggregate function",  it essentially is an aggregate function. Only one 
> result each aggregate function will produce,  it's very possible one 
> mapreduce job could deal with two or more different aggregate expression 
> simultaneously.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-503) improvement on distinct: distinguish distinct aggregate function from distinct

Reply via email to