[ 
https://issues.apache.org/jira/browse/HIVE-503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12712177#action_12712177
 ] 

Ashish Thusoo commented on HIVE-503:
------------------------------------

Actually we had talked about this approach a long time back but we were not 
sure that this would be better than running 2 map/reduce jobs. The reason being 
that this approach leads to a sort of mn amount of data where m is the number 
of distincts and n the number of rows as opposed to a sort of m+n data if we do 
this with m map/reduce jobs. Granted that we also scan the data mn times in the 
second approach as opposed to 1 time in the first approach but we find in our 
cluster that scan bandwidth is not an issue (mostly because we store data 
compressed) and the sort and memory used in the reducer or the mapper becomes 
the issue. I think this does call for some experimentation to determine the 
value of m where one approach becomes better than other..


> improvement on distinct: distinguish distinct aggregate function from distinct
> ------------------------------------------------------------------------------
>
>                 Key: HIVE-503
>                 URL: https://issues.apache.org/jira/browse/HIVE-503
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: Min Zhou
>
> h4.distinct
> # OK
> {code:sql}
> select 
>    distinct col
> from 
>   tbl
> {code}
> # FAILED
> {code:sql}
> select 
>    distinct  col1,
>    distinct  col2
> from 
>   tbl
> {code}
> h4.distinct aggregate function
> # OK
> {code:sql}
> select 
>    count(distinct col % 10)
> from 
>   tbl
> {code}
> # OK
> {code:sql}
> select 
>    count(distinct col1% 10)
>    count(distinct col1% 9)
> from 
>   tbl
> {code}
> # OK
> {code:sql}
> select 
>    count(distinct col1 % 10)
>    count(distinct col2 % 9)
> from 
>   tbl
> {code}
> # OK
> {code:sql}
> select 
>   sum(distinct col1 % 10),
>   count(distinct col2 % 9)
> from 
>   tbl
> {code}
> # OK
> {code:sql}
> select 
>   max(distinct substr(col1, 1, 10)),
>   count(distinct col2 % 9)
> from 
>   tbl
> {code}
> The keyword "distinct" ofen produce more than one results, so it's impossible 
> removing two different columns' duplicates in only one mapreduce job, so it 
> failed.
> But the term "distinct aggregate function" with a form like 
> aggregate_function(distinct ....),  is in connection with the term "all 
> aggregate function",  it essentially is an aggregate function. Only one 
> result each aggregate function will produce,  it's very possible one 
> mapreduce job could deal with two or more different aggregate expression 
> simultaneously.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to