[ 
https://issues.apache.org/jira/browse/HIVE-6247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13890075#comment-13890075
 ] 

Gunther Hagleitner commented on HIVE-6247:
------------------------------------------

Dug a little bit into. I think the idea makes good sense, but the description 
about MR is not correct. At least I wasn't able to make MR not use a single 
reducer for the query cited. You can rewrite the query though using a subquery 
to get the result you want.

There are two more flags to consider (when rewriting):

a) set hive.optimize.reducededuplication.min.reducer:

If this is set to 1 you will have a single reducer regardless of rewrite.

b) hive.fetch.task.aggr

If this one is true the final count will happen on the client. This is more 
important in MR than Tez (because it would start a new job in MR, in tez it's 
just another stage in the DAG).

> select count(distinct) should be MRR in Tez
> -------------------------------------------
>
>                 Key: HIVE-6247
>                 URL: https://issues.apache.org/jira/browse/HIVE-6247
>             Project: Hive
>          Issue Type: Bug
>          Components: Tez
>    Affects Versions: 0.13.0
>            Reporter: Gopal V
>            Assignee: Gunther Hagleitner
>
> The MR query plan for "select count(distinct) " fires off multiple reducers, 
> with a local work task to perform final aggregation.
> The Tez version fires off exactly 1 reducer for the entire data-set which 
> chokes and dies/slows down massively.
> To reproduce on a TPC-DS database (meaningless query)
> {code}
> select count(distinct ss_net_profit) from store_sales ss join store s on 
> ss.ss_store_sk = s.s_store_sk;
> {code}
> This spins up Map 1, Map 2 (for the dim table + fact table) & Reducer 1 which 
> is always "0/1".



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to