[jira] [Updated] (HIVE-17896) TopN: Create a standalone vectorizable TopN operator

2017-12-17 Thread Teddy Choi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Teddy Choi updated HIVE-17896:
--
Attachment: HIVE-17896.3.patch

> TopN: Create a standalone vectorizable TopN operator
> 
>
> Key: HIVE-17896
> URL: https://issues.apache.org/jira/browse/HIVE-17896
> Project: Hive
>  Issue Type: New Feature
>  Components: Operators
>Affects Versions: 3.0.0
>Reporter: Gopal V
>Assignee: Teddy Choi
> Attachments: HIVE-17896.1.patch, HIVE-17896.3.patch
>
>
> For TPC-DS Query27, the TopN operation is delayed by the group-by - the 
> group-by operator buffers up all the rows before discarding the 99% of the 
> rows in the TopN Hash within the ReduceSink Operator.
> The RS TopN operator is very restrictive as it only supports doing the 
> filtering on the shuffle keys, but it is better to do this before breaking 
> the vectors into rows and losing the isRepeating properties.
> Adding a TopN operator in the physical operator tree allows the following to 
> happen.
> GBY->RS(Top=1)
> can become 
> TopN(1)->GBY->RS(Top=1)
> So that, the TopN can remove rows before they are buffered into the GBY and 
> consume memory.
> Here's the equivalent implementation in Presto
> https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/operator/TopNOperator.java#L35
> Adding this as a sub-feature of GroupBy prevents further optimizations if the 
> GBY is on keys "a,b,c" and the TopN is on just "a".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-17896) TopN: Create a standalone vectorizable TopN operator

2017-12-17 Thread Teddy Choi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Teddy Choi updated HIVE-17896:
--
Attachment: (was: HIVE-17896.3.patch)

> TopN: Create a standalone vectorizable TopN operator
> 
>
> Key: HIVE-17896
> URL: https://issues.apache.org/jira/browse/HIVE-17896
> Project: Hive
>  Issue Type: New Feature
>  Components: Operators
>Affects Versions: 3.0.0
>Reporter: Gopal V
>Assignee: Teddy Choi
> Attachments: HIVE-17896.1.patch
>
>
> For TPC-DS Query27, the TopN operation is delayed by the group-by - the 
> group-by operator buffers up all the rows before discarding the 99% of the 
> rows in the TopN Hash within the ReduceSink Operator.
> The RS TopN operator is very restrictive as it only supports doing the 
> filtering on the shuffle keys, but it is better to do this before breaking 
> the vectors into rows and losing the isRepeating properties.
> Adding a TopN operator in the physical operator tree allows the following to 
> happen.
> GBY->RS(Top=1)
> can become 
> TopN(1)->GBY->RS(Top=1)
> So that, the TopN can remove rows before they are buffered into the GBY and 
> consume memory.
> Here's the equivalent implementation in Presto
> https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/operator/TopNOperator.java#L35
> Adding this as a sub-feature of GroupBy prevents further optimizations if the 
> GBY is on keys "a,b,c" and the TopN is on just "a".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-17896) TopN: Create a standalone vectorizable TopN operator

2017-12-17 Thread Teddy Choi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Teddy Choi updated HIVE-17896:
--
Attachment: HIVE-17896.3.patch

> TopN: Create a standalone vectorizable TopN operator
> 
>
> Key: HIVE-17896
> URL: https://issues.apache.org/jira/browse/HIVE-17896
> Project: Hive
>  Issue Type: New Feature
>  Components: Operators
>Affects Versions: 3.0.0
>Reporter: Gopal V
>Assignee: Teddy Choi
> Attachments: HIVE-17896.1.patch, HIVE-17896.3.patch
>
>
> For TPC-DS Query27, the TopN operation is delayed by the group-by - the 
> group-by operator buffers up all the rows before discarding the 99% of the 
> rows in the TopN Hash within the ReduceSink Operator.
> The RS TopN operator is very restrictive as it only supports doing the 
> filtering on the shuffle keys, but it is better to do this before breaking 
> the vectors into rows and losing the isRepeating properties.
> Adding a TopN operator in the physical operator tree allows the following to 
> happen.
> GBY->RS(Top=1)
> can become 
> TopN(1)->GBY->RS(Top=1)
> So that, the TopN can remove rows before they are buffered into the GBY and 
> consume memory.
> Here's the equivalent implementation in Presto
> https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/operator/TopNOperator.java#L35
> Adding this as a sub-feature of GroupBy prevents further optimizations if the 
> GBY is on keys "a,b,c" and the TopN is on just "a".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-17896) TopN: Create a standalone vectorizable TopN operator

2017-12-17 Thread Teddy Choi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Teddy Choi updated HIVE-17896:
--
Attachment: (was: HIVE-17896.2.patch)

> TopN: Create a standalone vectorizable TopN operator
> 
>
> Key: HIVE-17896
> URL: https://issues.apache.org/jira/browse/HIVE-17896
> Project: Hive
>  Issue Type: New Feature
>  Components: Operators
>Affects Versions: 3.0.0
>Reporter: Gopal V
>Assignee: Teddy Choi
> Attachments: HIVE-17896.1.patch
>
>
> For TPC-DS Query27, the TopN operation is delayed by the group-by - the 
> group-by operator buffers up all the rows before discarding the 99% of the 
> rows in the TopN Hash within the ReduceSink Operator.
> The RS TopN operator is very restrictive as it only supports doing the 
> filtering on the shuffle keys, but it is better to do this before breaking 
> the vectors into rows and losing the isRepeating properties.
> Adding a TopN operator in the physical operator tree allows the following to 
> happen.
> GBY->RS(Top=1)
> can become 
> TopN(1)->GBY->RS(Top=1)
> So that, the TopN can remove rows before they are buffered into the GBY and 
> consume memory.
> Here's the equivalent implementation in Presto
> https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/operator/TopNOperator.java#L35
> Adding this as a sub-feature of GroupBy prevents further optimizations if the 
> GBY is on keys "a,b,c" and the TopN is on just "a".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-17896) TopN: Create a standalone vectorizable TopN operator

2017-12-14 Thread Teddy Choi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Teddy Choi updated HIVE-17896:
--
Attachment: HIVE-17896.2.patch

HIVE-17896.2.patch is to get performance numbers. It sets the default value of 
hive.optimize.topn.key as true. It will be false in its final patch.

> TopN: Create a standalone vectorizable TopN operator
> 
>
> Key: HIVE-17896
> URL: https://issues.apache.org/jira/browse/HIVE-17896
> Project: Hive
>  Issue Type: New Feature
>  Components: Operators
>Affects Versions: 3.0.0
>Reporter: Gopal V
>Assignee: Teddy Choi
> Attachments: HIVE-17896.1.patch, HIVE-17896.2.patch
>
>
> For TPC-DS Query27, the TopN operation is delayed by the group-by - the 
> group-by operator buffers up all the rows before discarding the 99% of the 
> rows in the TopN Hash within the ReduceSink Operator.
> The RS TopN operator is very restrictive as it only supports doing the 
> filtering on the shuffle keys, but it is better to do this before breaking 
> the vectors into rows and losing the isRepeating properties.
> Adding a TopN operator in the physical operator tree allows the following to 
> happen.
> GBY->RS(Top=1)
> can become 
> TopN(1)->GBY->RS(Top=1)
> So that, the TopN can remove rows before they are buffered into the GBY and 
> consume memory.
> Here's the equivalent implementation in Presto
> https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/operator/TopNOperator.java#L35
> Adding this as a sub-feature of GroupBy prevents further optimizations if the 
> GBY is on keys "a,b,c" and the TopN is on just "a".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-17896) TopN: Create a standalone vectorizable TopN operator

2017-12-14 Thread Teddy Choi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Teddy Choi updated HIVE-17896:
--
Status: Patch Available  (was: Open)

> TopN: Create a standalone vectorizable TopN operator
> 
>
> Key: HIVE-17896
> URL: https://issues.apache.org/jira/browse/HIVE-17896
> Project: Hive
>  Issue Type: New Feature
>  Components: Operators
>Affects Versions: 3.0.0
>Reporter: Gopal V
>Assignee: Teddy Choi
> Attachments: HIVE-17896.1.patch, HIVE-17896.2.patch
>
>
> For TPC-DS Query27, the TopN operation is delayed by the group-by - the 
> group-by operator buffers up all the rows before discarding the 99% of the 
> rows in the TopN Hash within the ReduceSink Operator.
> The RS TopN operator is very restrictive as it only supports doing the 
> filtering on the shuffle keys, but it is better to do this before breaking 
> the vectors into rows and losing the isRepeating properties.
> Adding a TopN operator in the physical operator tree allows the following to 
> happen.
> GBY->RS(Top=1)
> can become 
> TopN(1)->GBY->RS(Top=1)
> So that, the TopN can remove rows before they are buffered into the GBY and 
> consume memory.
> Here's the equivalent implementation in Presto
> https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/operator/TopNOperator.java#L35
> Adding this as a sub-feature of GroupBy prevents further optimizations if the 
> GBY is on keys "a,b,c" and the TopN is on just "a".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-17896) TopN: Create a standalone vectorizable TopN operator

2017-11-28 Thread Teddy Choi (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Teddy Choi updated HIVE-17896:
--
Attachment: HIVE-17896.1.patch

HIVE-17896.1.patch is work in progress. It still needs to add context handling 
in VectorTopNKeyOperator.

> TopN: Create a standalone vectorizable TopN operator
> 
>
> Key: HIVE-17896
> URL: https://issues.apache.org/jira/browse/HIVE-17896
> Project: Hive
>  Issue Type: New Feature
>  Components: Operators
>Affects Versions: 3.0.0
>Reporter: Gopal V
>Assignee: Teddy Choi
> Attachments: HIVE-17896.1.patch
>
>
> For TPC-DS Query27, the TopN operation is delayed by the group-by - the 
> group-by operator buffers up all the rows before discarding the 99% of the 
> rows in the TopN Hash within the ReduceSink Operator.
> The RS TopN operator is very restrictive as it only supports doing the 
> filtering on the shuffle keys, but it is better to do this before breaking 
> the vectors into rows and losing the isRepeating properties.
> Adding a TopN operator in the physical operator tree allows the following to 
> happen.
> GBY->RS(Top=1)
> can become 
> TopN(1)->GBY->RS(Top=1)
> So that, the TopN can remove rows before they are buffered into the GBY and 
> consume memory.
> Here's the equivalent implementation in Presto
> https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/operator/TopNOperator.java#L35
> Adding this as a sub-feature of GroupBy prevents further optimizations if the 
> GBY is on keys "a,b,c" and the TopN is on just "a".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-17896) TopN: Create a standalone vectorizable TopN operator

2017-10-24 Thread Gopal V (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gopal V updated HIVE-17896:
---
Description: 
For TPC-DS Query27, the TopN operation is delayed by the group-by - the 
group-by operator buffers up all the rows before discarding the 99% of the rows 
in the TopN Hash within the ReduceSink Operator.

The RS TopN operator is very restrictive as it only supports doing the 
filtering on the shuffle keys, but it is better to do this before breaking the 
vectors into rows and losing the isRepeating properties.

Adding a TopN operator in the physical operator tree allows the following to 
happen.

GBY->RS(Top=1)

can become 

TopN(1)->GBY->RS(Top=1)

So that, the TopN can remove rows before they are buffered into the GBY and 
consume memory.

Here's the equivalent implementation in Presto

https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/operator/TopNOperator.java#L35

Adding this as a sub-feature of GroupBy prevents further optimizations if the 
GBY is on keys "a,b,c" and the TopN is on just "a".

  was:
For TPC-DS Query27, the TopN operation is delayed by the group-by - the 
group-by operator buffers up all the rows before discarding the 99% of the rows 
in the TopN Hash within the ReduceSink Operator.

The RS TopN operator is very restrictive as it only supports doing the 
filtering on the shuffle keys, but it is better to do this before breaking the 
vectors into rows and losing the isRepeating properties.

Adding a TopN operator in the physical operator tree allows the following to 
happen.

GBY->RS(Top=1)

can become 

TopN(1)->GBY->RS(Top=1)

So that, the TopN can remove rows before they are buffered into the GBY and 
consume memory.

Here's the equivalent implementation in Presto

https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/operator/TopNOperator.java#L35


> TopN: Create a standalone vectorizable TopN operator
> 
>
> Key: HIVE-17896
> URL: https://issues.apache.org/jira/browse/HIVE-17896
> Project: Hive
>  Issue Type: New Feature
>  Components: Operators
>Affects Versions: 3.0.0
>Reporter: Gopal V
>
> For TPC-DS Query27, the TopN operation is delayed by the group-by - the 
> group-by operator buffers up all the rows before discarding the 99% of the 
> rows in the TopN Hash within the ReduceSink Operator.
> The RS TopN operator is very restrictive as it only supports doing the 
> filtering on the shuffle keys, but it is better to do this before breaking 
> the vectors into rows and losing the isRepeating properties.
> Adding a TopN operator in the physical operator tree allows the following to 
> happen.
> GBY->RS(Top=1)
> can become 
> TopN(1)->GBY->RS(Top=1)
> So that, the TopN can remove rows before they are buffered into the GBY and 
> consume memory.
> Here's the equivalent implementation in Presto
> https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/operator/TopNOperator.java#L35
> Adding this as a sub-feature of GroupBy prevents further optimizations if the 
> GBY is on keys "a,b,c" and the TopN is on just "a".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)