[ https://issues.apache.org/jira/browse/HIVE-503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12712344#action_12712344 ]
Min Zhou commented on HIVE-503: ------------------------------- Is that intermediate data outputed by mappers for shuffling the data you mentioned? Could you explain why that amount brought by your second approach was m+n which I thought may be mn too? btw, I think you must caculate in addition the overhead of at least one piece of join query when you finally merge your m distincted results into one table. That isnot the first approach needed. > improvement on distinct: distinguish distinct aggregate function from distinct > ------------------------------------------------------------------------------ > > Key: HIVE-503 > URL: https://issues.apache.org/jira/browse/HIVE-503 > Project: Hadoop Hive > Issue Type: Improvement > Reporter: Min Zhou > > h4.distinct > # OK > {code:sql} > select > distinct col > from > tbl > {code} > # FAILED > {code:sql} > select > distinct col1, > distinct col2 > from > tbl > {code} > h4.distinct aggregate function > # OK > {code:sql} > select > count(distinct col % 10) > from > tbl > {code} > # OK > {code:sql} > select > count(distinct col1% 10) > count(distinct col1% 9) > from > tbl > {code} > # OK > {code:sql} > select > count(distinct col1 % 10) > count(distinct col2 % 9) > from > tbl > {code} > # OK > {code:sql} > select > sum(distinct col1 % 10), > count(distinct col2 % 9) > from > tbl > {code} > # OK > {code:sql} > select > max(distinct substr(col1, 1, 10)), > count(distinct col2 % 9) > from > tbl > {code} > The keyword "distinct" ofen produce more than one results, so it's impossible > removing two different columns' duplicates in only one mapreduce job, so it > failed. > But the term "distinct aggregate function" with a form like > aggregate_function(distinct ....), is in connection with the term "all > aggregate function", it essentially is an aggregate function. Only one > result each aggregate function will produce, it's very possible one > mapreduce job could deal with two or more different aggregate expression > simultaneously. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.