okumin opened a new pull request, #4432:
URL: https://github.com/apache/hive/pull/4432

   …of Tez
   
   ### What changes were proposed in this pull request?
   
   This PR adds a new threshold to enable/disable auto reducer parallelism 
based on the potential reduction.
   https://issues.apache.org/jira/browse/HIVE-23831
   
   ### Why are the changes needed?
   
   We introduced a heuristics to disable auto reducer parallelism in 
[HIVE-14200](https://issues.apache.org/jira/browse/HIVE-14200) when `{# of 
tasks} * {hive.tez.min.partition.factor}` is smaller than 1.0. It takes effect 
when the estimated number is smaller than 4 by default because the default 
`hive.tez.min.partition.factor` is 0.25.
   
   For example, if the estimated number is equal to 4, Tez tunes the final 
number between 1 and 8 because the minimum number can be 4 * 0.25 = 1.0, and 
the maximum number can be 4 * 2.0 = 8 as the default 
`hive.tez.max.partition.factor` is 2.0.
   
   ```
   $ beeline -e 'SELECT item_id, sum(price) FROM orders GROUP BY item_id' 
--hiveconf hive.server2.in.place.progress=false --hiveconf 
hive.tez.auto.reducer.parallelism=true --hiveconf 
hive.exec.reducers.bytes.per.reducer=150
   ...
   INFO  : Map 1: 0(+1)/1       Reducer 2: 0/8  
   INFO  : Map 1: 1/1   Reducer 2: 2/2
   ```
   
   If the estimated number is 3, Tez doesn't apply auto-reducer parallelism 
because `{# of tasks} * {hive.tez.min.partition.factor}` = `3 * 0.25` is 
smaller than 1.0.
   
   ```
   $ beeline -e 'SELECT item_id, sum(price) FROM orders GROUP BY item_id' 
--hiveconf hive.server2.in.place.progress=false --hiveconf 
hive.tez.auto.reducer.parallelism=true --hiveconf 
hive.exec.reducers.bytes.per.reducer=200
   ...
   INFO  : Map 1: 1/1   Reducer 2: 0(+1)/6      
   INFO  : Map 1: 1/1   Reducer 2: 6/6
   ```
   
   I understand we introduced 
[HIVE-14200](https://issues.apache.org/jira/browse/HIVE-14200) so that Hive on 
Tez can avoid overhead when the possible reduction is small. However, I know 
`{# of tasks} * {hive.tez.min.partition.factor} < 1.0` is a little overfitting 
the default values of `hive.tez.min.partition.factor` and 
`hive.tez.max.partition.factor`. For example, when we configure 
`hive.tez.min.partition.factor=0.05`, auto reducer parallelism is disabled when 
# of tasks is 19 or less. It means Hive on Tez misses a chance to tune 
parallelism between 1 and 38 in the worst case. I expect the case is worth 
trying auto reducer parallelism.
   
   ```
   $ beeline -e 'SELECT item_id, sum(price) FROM orders GROUP BY item_id' 
--hiveconf hive.server2.in.place.progress=false --hiveconf 
hive.tez.auto.reducer.parallelism=true --hiveconf 
hive.exec.reducers.bytes.per.reducer=30 --hiveconf 
hive.tez.min.partition.factor=0.05
   ...
   INFO  : Map 1: 0(+1)/1       Reducer 2: 0/34 
   INFO  : Map 1: 1/1   Reducer 2: 1(+1)/34     
   INFO  : Map 1: 1/1   Reducer 2: 13(+0)/34
   ```
   
   I guess a similar issue can happen when we give big 
`hive.tez.max.partition.factor`.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes if a user already configures `hive.tez.min.partition.factor` or 
`hive.tez.max.partition.factor`. If not, the behavior doesn't change.
   
   ### Is the change a dependency upgrade?
   
   No.
   
   ### How was this patch tested?
   
   I tested it on my local machine.
   
   No behaviors change with default parameters.
   
   ```
   $ beeline -e 'SELECT item_id, sum(price) FROM orders GROUP BY item_id' 
--hiveconf hive.server2.in.place.progress=false --hiveconf 
hive.tez.auto.reducer.parallelism=true --hiveconf 
hive.exec.reducers.bytes.per.reducer=150
   ...
   INFO  : Map 1: 0(+1)/1       Reducer 2: 0/8  
   INFO  : Map 1: 1/1   Reducer 2: 0(+1)/2
   ...
   $ beeline -e 'SELECT item_id, sum(price) FROM orders GROUP BY item_id' 
--hiveconf hive.server2.in.place.progress=false --hiveconf 
hive.tez.auto.reducer.parallelism=true --hiveconf 
hive.exec.reducers.bytes.per.reducer=200
   ...
   INFO  : Map 1: 0(+1)/1       Reducer 2: 0/6  
   INFO  : Map 1: 1/1   Reducer 2: 0(+1)/6
   ```
   
   Auto reducer parallelism is enabled even when 
`hive.tez.min.partition.factor` is small.
   
   ```
   $ beeline -e 'SELECT item_id, sum(price) FROM orders GROUP BY item_id' 
--hiveconf hive.server2.in.place.progress=false --hiveconf 
hive.tez.auto.reducer.parallelism=true --hiveconf 
hive.exec.reducers.bytes.per.reducer=30 --hiveconf 
hive.tez.min.partition.factor=0.05
   ...
   INFO  : Map 1: 0(+1)/1       Reducer 2: 0/34 
   INFO  : Map 1: 1/1   Reducer 2: 0(+1)/12
   ```
   
   The threshold is tunable.
   
   ```
   $ beeline -e 'SELECT item_id, sum(price) FROM orders GROUP BY item_id' 
--hiveconf hive.server2.in.place.progress=false --hiveconf 
hive.tez.auto.reducer.parallelism=true --hiveconf 
hive.exec.reducers.bytes.per.reducer=200 --hiveconf 
hive.tez.auto.reducer.parallelism.threshold=6
   ...
   INFO  : Map 1: 0(+1)/1       Reducer 2: 0/6  
   INFO  : Map 1: 1/1   Reducer 2: 2/2
   ...
   $ beeline -e 'SELECT item_id, sum(price) FROM orders GROUP BY item_id' 
--hiveconf hive.server2.in.place.progress=false --hiveconf 
hive.tez.auto.reducer.parallelism=true --hiveconf 
hive.exec.reducers.bytes.per.reducer=300 --hiveconf 
hive.tez.auto.reducer.parallelism.threshold=6
   ...
   INFO  : Map 1: 0/1   Reducer 2: 0/4  
   INFO  : Map 1: 1/1   Reducer 2: 4/4
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to