Gunther Hagleitner created HIVE-7158:
----------------------------------------

             Summary: Use Tez auto-parallelism in Hive
                 Key: HIVE-7158
                 URL: https://issues.apache.org/jira/browse/HIVE-7158
             Project: Hive
          Issue Type: Bug
            Reporter: Gunther Hagleitner
            Assignee: Gunther Hagleitner


Tez can optionally sample data from a fraction of the tasks of a vertex and use 
that information to choose the number of downstream tasks for any given scatter 
gather edge.

Hive estimates the count of reducers by looking at stats and estimates for each 
operator in the operator pipeline leading up to the reducer. However, if this 
estimate turns out to be too large, Tez can reign in the resources used to 
compute the reducer.

It does so by combining partitions of the upstream vertex. It cannot, however, 
add reducers at this stage.

I'm proposing to let users specify whether they want to use auto-parallelism or 
not. If they do there will be scaling factors to determine max and min reducers 
Tez can choose from. We will then partition by max reducers, letting Tez sample 
and reign in the count up until the specified min.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to