Vighnesh Avadhani created HIVE-3640:
---------------------------------------

             Summary: Reducer allocation is incorrect if enforce bucketing and 
mapred.reduce.tasks are both set
                 Key: HIVE-3640
                 URL: https://issues.apache.org/jira/browse/HIVE-3640
             Project: Hive
          Issue Type: Bug
            Reporter: Vighnesh Avadhani
            Assignee: Vighnesh Avadhani
            Priority: Minor


When I enforce bucketing and fix the number of reducers via mapred.reduce.tasks 
Hive ignores my input and instead takes the largest value <= 
hive.exec.reducers.max that is also an even divisor of num_buckets. In other 
words, if I set 1024 buckets and set mapred.reduce.tasks=1024 I'll get. . . 256 
reducers. If I set 1997 buckets and set mapred.reduce.tasks=1997 I'll get. . . 
1 reducer. 

This is totally crazy, and it's far, far crazier when the data inputs get 
large. In the latter case the bucketing job will almost certainly fail because 
we'll most likely try to stuff several TB of input through a single reducer. 
We'll also drastically reduce the effectiveness of bucketing, since the buckets 
themselves will be larger.

If the user sets mapred.reduce.tasks in a query that inserts into a bucketed 
table we should either accept that value or raise an exception if it's invalid 
relative to the number of buckets. We should absolutely NOT override the user's 
direction and fall back on automatically allocating reducers based on some 
obscure logic dictated by completely different setting. 

I have yet to encounter a single person who expected this the first time, so 
it's clearly a bug.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to