Gopal V created HIVE-17896:
------------------------------

             Summary: TopN: Create a standalone vectorizable TopN operator
                 Key: HIVE-17896
                 URL: https://issues.apache.org/jira/browse/HIVE-17896
             Project: Hive
          Issue Type: New Feature
          Components: Operators
    Affects Versions: 3.0.0
            Reporter: Gopal V


For TPC-DS Query27, the TopN operation is delayed by the group-by - the 
group-by operator buffers up all the rows before discarding the 99% of the rows 
in the TopN Hash within the ReduceSink Operator.

The RS TopN operator is very restrictive as it only supports doing the 
filtering on the shuffle keys, but it is better to do this before breaking the 
vectors into rows and losing the isRepeating properties.

Adding a TopN operator in the physical operator tree allows the following to 
happen.

GBY->RS(Top=1)

can become 

TopN(1)->GBY->RS(Top=1)

So that, the TopN can remove rows before they are buffered into the GBY and 
consume memory.

Here's the equivalent implementation in Presto

https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/operator/TopNOperator.java#L35



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to