[ https://issues.apache.org/jira/browse/HIVE-503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12712177#action_12712177 ]
Ashish Thusoo commented on HIVE-503: ------------------------------------ Actually we had talked about this approach a long time back but we were not sure that this would be better than running 2 map/reduce jobs. The reason being that this approach leads to a sort of mn amount of data where m is the number of distincts and n the number of rows as opposed to a sort of m+n data if we do this with m map/reduce jobs. Granted that we also scan the data mn times in the second approach as opposed to 1 time in the first approach but we find in our cluster that scan bandwidth is not an issue (mostly because we store data compressed) and the sort and memory used in the reducer or the mapper becomes the issue. I think this does call for some experimentation to determine the value of m where one approach becomes better than other.. > improvement on distinct: distinguish distinct aggregate function from distinct > ------------------------------------------------------------------------------ > > Key: HIVE-503 > URL: https://issues.apache.org/jira/browse/HIVE-503 > Project: Hadoop Hive > Issue Type: Improvement > Reporter: Min Zhou > > h4.distinct > # OK > {code:sql} > select > distinct col > from > tbl > {code} > # FAILED > {code:sql} > select > distinct col1, > distinct col2 > from > tbl > {code} > h4.distinct aggregate function > # OK > {code:sql} > select > count(distinct col % 10) > from > tbl > {code} > # OK > {code:sql} > select > count(distinct col1% 10) > count(distinct col1% 9) > from > tbl > {code} > # OK > {code:sql} > select > count(distinct col1 % 10) > count(distinct col2 % 9) > from > tbl > {code} > # OK > {code:sql} > select > sum(distinct col1 % 10), > count(distinct col2 % 9) > from > tbl > {code} > # OK > {code:sql} > select > max(distinct substr(col1, 1, 10)), > count(distinct col2 % 9) > from > tbl > {code} > The keyword "distinct" ofen produce more than one results, so it's impossible > removing two different columns' duplicates in only one mapreduce job, so it > failed. > But the term "distinct aggregate function" with a form like > aggregate_function(distinct ....), is in connection with the term "all > aggregate function", it essentially is an aggregate function. Only one > result each aggregate function will produce, it's very possible one > mapreduce job could deal with two or more different aggregate expression > simultaneously. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.