[jira] Commented: (HIVE-1694) Accelerate query execution using indexes

2011-02-02 Thread Prajakta Kalmegh (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989983#comment-12989983
 ] 

Prajakta Kalmegh commented on HIVE-1694:


Hi John,

We have the code ready for review. You can view it at 
.
Please find attached the diff 'HIVE-1694.1.patch' for the changes. We have 
taken the diff from the github hive repo  
on 30th Jan 2011. The last commit on github apache/hive before we took a diff 
was 
.

Rewrite needs to be enabled explicitly by setting the 
'hive.optimize.gbyusingindex' flag to true as done in 'ql_rewrite_gbtoidx.q' 
test case. We have added the 'ql_rewrite_gbtoidx.q' file in  
ql/src/test/queries/clientpositive.


> Accelerate query execution using indexes
> 
>
> Key: HIVE-1694
> URL: https://issues.apache.org/jira/browse/HIVE-1694
> Project: Hive
>  Issue Type: New Feature
>  Components: Indexing, Query Processor
>Affects Versions: 0.7.0
>Reporter: Nikhil Deshpande
>Assignee: Nikhil Deshpande
> Attachments: HIVE-1694_2010-10-28.diff, demo_q1.hql, demo_q2.hql
>
>
> The index building patch (Hive-417) is checked into trunk, this JIRA issue 
> tracks supporting indexes in Hive compiler & execution engine for SELECT 
> queries.
> This is in ref. to John's comment at
> https://issues.apache.org/jira/browse/HIVE-417?focusedCommentId=12884869&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12884869
> on creating separate JIRA issue for tracking index usage in optimizer & query 
> execution.
> The aim of this effort is to use indexes to accelerate query execution (for 
> certain class of queries). E.g.
> - Filters and range scans (already being worked on by He Yongqiang as part of 
> HIVE-417?)
> - Joins (index based joins)
> - Group By, Order By and other misc cases
> The proposal is multi-step:
> 1. Building index based operators, compiler and execution engine changes
> 2. Optimizer enhancements (e.g. cost-based optimizer to compare and choose 
> between index scans, full table scans etc.)
> This JIRA initially focuses on the first step. This JIRA is expected to hold 
> the information about index based plans & operator implementations for above 
> mentioned cases. 

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (HIVE-1694) Accelerate query execution using indexes

2011-01-18 Thread Prajakta Kalmegh (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12983184#action_12983184
 ] 

Prajakta Kalmegh commented on HIVE-1694:


Hey John,

We are maintaining the latest code in github on 'prj' branch. You can access it 
here:


We have created a 'RewriteParseContextGenerator' class in the 
org.apache.hadoop.hive.ql.optimizer package to return a ParseContext instance 
(which contains all the information on the operator tree) when given a string 
command (query) as input. Since we only need the operator tree for our 
execution, we have created this basic utility class for our code.

Rest of the code is being internally reviewed by the team right now. We are 
expecting it to be completed by end of this week. We will update you once the 
code is ready for your review.

> Accelerate query execution using indexes
> 
>
> Key: HIVE-1694
> URL: https://issues.apache.org/jira/browse/HIVE-1694
> Project: Hive
>  Issue Type: New Feature
>  Components: Indexing, Query Processor
>Affects Versions: 0.7.0
>Reporter: Nikhil Deshpande
>Assignee: Nikhil Deshpande
> Attachments: demo_q1.hql, demo_q2.hql, HIVE-1694_2010-10-28.diff
>
>
> The index building patch (Hive-417) is checked into trunk, this JIRA issue 
> tracks supporting indexes in Hive compiler & execution engine for SELECT 
> queries.
> This is in ref. to John's comment at
> https://issues.apache.org/jira/browse/HIVE-417?focusedCommentId=12884869&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12884869
> on creating separate JIRA issue for tracking index usage in optimizer & query 
> execution.
> The aim of this effort is to use indexes to accelerate query execution (for 
> certain class of queries). E.g.
> - Filters and range scans (already being worked on by He Yongqiang as part of 
> HIVE-417?)
> - Joins (index based joins)
> - Group By, Order By and other misc cases
> The proposal is multi-step:
> 1. Building index based operators, compiler and execution engine changes
> 2. Optimizer enhancements (e.g. cost-based optimizer to compare and choose 
> between index scans, full table scans etc.)
> This JIRA initially focuses on the first step. This JIRA is expected to hold 
> the information about index based plans & operator implementations for above 
> mentioned cases. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1694) Accelerate query execution using indexes

2011-01-17 Thread John Sichi (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12982999#action_12982999
 ] 

John Sichi commented on HIVE-1694:
--

Is there an updated tree for this?  I checked github but didn't see it.  
HIVE-1644 needs support for compiling internally-generated SQL into operators, 
so if you have that working, I'd like to point the Harvey Mudd folks at it when 
I talk to them tomorrow.


> Accelerate query execution using indexes
> 
>
> Key: HIVE-1694
> URL: https://issues.apache.org/jira/browse/HIVE-1694
> Project: Hive
>  Issue Type: New Feature
>  Components: Indexing, Query Processor
>Affects Versions: 0.7.0
>Reporter: Nikhil Deshpande
>Assignee: Nikhil Deshpande
> Attachments: demo_q1.hql, demo_q2.hql, HIVE-1694_2010-10-28.diff
>
>
> The index building patch (Hive-417) is checked into trunk, this JIRA issue 
> tracks supporting indexes in Hive compiler & execution engine for SELECT 
> queries.
> This is in ref. to John's comment at
> https://issues.apache.org/jira/browse/HIVE-417?focusedCommentId=12884869&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12884869
> on creating separate JIRA issue for tracking index usage in optimizer & query 
> execution.
> The aim of this effort is to use indexes to accelerate query execution (for 
> certain class of queries). E.g.
> - Filters and range scans (already being worked on by He Yongqiang as part of 
> HIVE-417?)
> - Joins (index based joins)
> - Group By, Order By and other misc cases
> The proposal is multi-step:
> 1. Building index based operators, compiler and execution engine changes
> 2. Optimizer enhancements (e.g. cost-based optimizer to compare and choose 
> between index scans, full table scans etc.)
> This JIRA initially focuses on the first step. This JIRA is expected to hold 
> the information about index based plans & operator implementations for above 
> mentioned cases. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1694) Accelerate query execution using indexes

2011-01-03 Thread John Sichi (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12976882#action_12976882
 ] 

John Sichi commented on HIVE-1694:
--

Thanks Prajakta.  Let us know once you have new code ready to review.  
HIVE-1644 is going to need the internal SQL support too, so I'd like to make 
sure that as much as possible is reusable there.


> Accelerate query execution using indexes
> 
>
> Key: HIVE-1694
> URL: https://issues.apache.org/jira/browse/HIVE-1694
> Project: Hive
>  Issue Type: New Feature
>  Components: Indexing, Query Processor
>Affects Versions: 0.7.0
>Reporter: Nikhil Deshpande
>Assignee: Nikhil Deshpande
> Attachments: demo_q1.hql, demo_q2.hql, HIVE-1694_2010-10-28.diff
>
>
> The index building patch (Hive-417) is checked into trunk, this JIRA issue 
> tracks supporting indexes in Hive compiler & execution engine for SELECT 
> queries.
> This is in ref. to John's comment at
> https://issues.apache.org/jira/browse/HIVE-417?focusedCommentId=12884869&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12884869
> on creating separate JIRA issue for tracking index usage in optimizer & query 
> execution.
> The aim of this effort is to use indexes to accelerate query execution (for 
> certain class of queries). E.g.
> - Filters and range scans (already being worked on by He Yongqiang as part of 
> HIVE-417?)
> - Joins (index based joins)
> - Group By, Order By and other misc cases
> The proposal is multi-step:
> 1. Building index based operators, compiler and execution engine changes
> 2. Optimizer enhancements (e.g. cost-based optimizer to compare and choose 
> between index scans, full table scans etc.)
> This JIRA initially focuses on the first step. This JIRA is expected to hold 
> the information about index based plans & operator implementations for above 
> mentioned cases. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1694) Accelerate query execution using indexes

2010-12-31 Thread Prajakta Kalmegh (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12976380#action_12976380
 ] 

Prajakta Kalmegh commented on HIVE-1694:


Thanks to both of you for your comments on our proposed design. Since the last 
post, we have been working on the code changes as per your comments. The 
progress has been in the following areas:
1) Removed the dependency for our optimizer to be the first one. It can now be 
used as any other optimizer by adding it to "transformations" list.
2) Implemented changes to re-structure the operator DAG plan for group-by 
queries.
3) We have removed the dependency of our optimization to read data from 
QB(query block) as it used to do earlier to check if the optimization can be 
applied before proceeding to apply the re-write. (See canApply() method in the 
original rewrite code.)
4) Regarding issue #3 (from my original post), as per John's suggestion, the 
change for modification of operator row schemas/resolvers are done smoothly 
wherever applicable.
5) We have completed testing the new implementation for simple group-by cases. 
Also, the code to append a sub-query to original DAG is implemented separately 
as of now. This needs to be integrated as part of our optimization.

 The only issue that will be pending post this implementation will be 
regarding John's post on Nov 1st stating "...we store only the distinct block 
offsets, not the distinct row offsets.". We plan to work on this once the 
current implementation is tested end-to-end. You can expect the update on this 
in a couple of weeks.

> Accelerate query execution using indexes
> 
>
> Key: HIVE-1694
> URL: https://issues.apache.org/jira/browse/HIVE-1694
> Project: Hive
>  Issue Type: New Feature
>  Components: Indexing, Query Processor
>Affects Versions: 0.7.0
>Reporter: Nikhil Deshpande
>Assignee: Nikhil Deshpande
> Attachments: demo_q1.hql, demo_q2.hql, HIVE-1694_2010-10-28.diff
>
>
> The index building patch (Hive-417) is checked into trunk, this JIRA issue 
> tracks supporting indexes in Hive compiler & execution engine for SELECT 
> queries.
> This is in ref. to John's comment at
> https://issues.apache.org/jira/browse/HIVE-417?focusedCommentId=12884869&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12884869
> on creating separate JIRA issue for tracking index usage in optimizer & query 
> execution.
> The aim of this effort is to use indexes to accelerate query execution (for 
> certain class of queries). E.g.
> - Filters and range scans (already being worked on by He Yongqiang as part of 
> HIVE-417?)
> - Joins (index based joins)
> - Group By, Order By and other misc cases
> The proposal is multi-step:
> 1. Building index based operators, compiler and execution engine changes
> 2. Optimizer enhancements (e.g. cost-based optimizer to compare and choose 
> between index scans, full table scans etc.)
> This JIRA initially focuses on the first step. This JIRA is expected to hold 
> the information about index based plans & operator implementations for above 
> mentioned cases. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1694) Accelerate query execution using indexes

2010-12-10 Thread Namit Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12970245#action_12970245
 ] 

Namit Jain commented on HIVE-1694:
--

I think having a mechanism which lets is issue "internal" or "recursive" sql is 
better in the long term.
That is something we will need anyway for future optimizations.

We can create a thin API around SemanticAnalyzer (analyze etc.), which is 
indirectly present in Driver.
Another implementation of that API can be the internal API, say RecursiveDriver.
In a recursive context, you are only allowed to invoke RecursiveDriver. 
External Clients (CliDriver, HiveServer etc.) invoke Driver directly.

As John said, definitely keep your optimizations pluggable. Currently, they are 
invoked as rule-based, 
but should be flexible enough to be invoked based on some costs in the future.

> Accelerate query execution using indexes
> 
>
> Key: HIVE-1694
> URL: https://issues.apache.org/jira/browse/HIVE-1694
> Project: Hive
>  Issue Type: New Feature
>  Components: Indexing, Query Processor
>Affects Versions: 0.7.0
>Reporter: Nikhil Deshpande
>Assignee: Nikhil Deshpande
> Attachments: demo_q1.hql, demo_q2.hql, HIVE-1694_2010-10-28.diff
>
>
> The index building patch (Hive-417) is checked into trunk, this JIRA issue 
> tracks supporting indexes in Hive compiler & execution engine for SELECT 
> queries.
> This is in ref. to John's comment at
> https://issues.apache.org/jira/browse/HIVE-417?focusedCommentId=12884869&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12884869
> on creating separate JIRA issue for tracking index usage in optimizer & query 
> execution.
> The aim of this effort is to use indexes to accelerate query execution (for 
> certain class of queries). E.g.
> - Filters and range scans (already being worked on by He Yongqiang as part of 
> HIVE-417?)
> - Joins (index based joins)
> - Group By, Order By and other misc cases
> The proposal is multi-step:
> 1. Building index based operators, compiler and execution engine changes
> 2. Optimizer enhancements (e.g. cost-based optimizer to compare and choose 
> between index scans, full table scans etc.)
> This JIRA initially focuses on the first step. This JIRA is expected to hold 
> the information about index based plans & operator implementations for above 
> mentioned cases. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1694) Accelerate query execution using indexes

2010-12-09 Thread John Sichi (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12970058#action_12970058
 ] 

John Sichi commented on HIVE-1694:
--

I talked to Namit, and he thinks there should be no relevant dependencies on 
the QB once we start on optimization, so letting it get out of sync with the 
operator DAG may not be an issue.  (I scanned the code in optimizer, and it 
seems like a few dependencies have crept in, but only for special cases like 
ANALYZE.)

For issue #1, you are proposing what I'll call the "internal SQL" approach, 
which is to construct an internal SQL expression (either in string or ASTNode 
form) and then partially analyze that (via SemanticAnalyzer), producing an 
operator DAG to be spliced into the main one.  For this approach, we would need 
to figure out how to make the relevant phases of SemanticAnalyzer modularized 
and invocable.

Alternately, the "direct construction" approach would be to attempt to 
construct the new operator subgraph directly via custom code targeted to the 
specific patterns you generate, and then splice that in.

I'm not sure which approach is better; Namit, any opinions?  The internal SQL 
approach definitely seems the most appropriate for the WHERE clause work being 
done by the Harvey Mudd team, since it produces a self-contained job to be run 
to produce the temp table containing the filtered block list.  But for GROUP 
BY, the direct construction approach may be cleaner.

For issue #2, it seems like this would happen automatically for the internal 
SQL approach (but this could also pollute the SemanticAnalyzer state to some 
extent).  The direct construction approach is the opposite:  it avoids 
polluting SemanticAnalyzer, but still might require modularizing some 
SemanticAnalyzer calls, e.g. for generating and registering the necessary 
aliases for index tables.

Regarding issue #3, that's already true for other optimizations such as 
projection pushdown (ColumnPruner), which modifies operator row 
schemas/resolvers; see for example ColumnPrunerProcFactory.pruneJoinOperator.  
So there shouldn't be anything new here.

Regarding the need to run your transformation first, it would be best to avoid 
this since a more advanced optimizer may want freedom in reordering 
transformations.  So instead of relying on information from the QB, analyze the 
relevant operator subgraph to decide whether your transformation is applicable. 
 This is the approach we expect to require for cost-based optimization.

Also, note that from a lineage perspective, it makes more sense for lineage to 
be derived prior to index transformation rather than subsequently.  Someone 
examining the lineage associated with an ETL job would typically be more 
interested in the logical source table from which it pulls (rather than from a 
physical index).


> Accelerate query execution using indexes
> 
>
> Key: HIVE-1694
> URL: https://issues.apache.org/jira/browse/HIVE-1694
> Project: Hive
>  Issue Type: New Feature
>  Components: Indexing, Query Processor
>Affects Versions: 0.7.0
>Reporter: Nikhil Deshpande
>Assignee: Nikhil Deshpande
> Attachments: demo_q1.hql, demo_q2.hql, HIVE-1694_2010-10-28.diff
>
>
> The index building patch (Hive-417) is checked into trunk, this JIRA issue 
> tracks supporting indexes in Hive compiler & execution engine for SELECT 
> queries.
> This is in ref. to John's comment at
> https://issues.apache.org/jira/browse/HIVE-417?focusedCommentId=12884869&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12884869
> on creating separate JIRA issue for tracking index usage in optimizer & query 
> execution.
> The aim of this effort is to use indexes to accelerate query execution (for 
> certain class of queries). E.g.
> - Filters and range scans (already being worked on by He Yongqiang as part of 
> HIVE-417?)
> - Joins (index based joins)
> - Group By, Order By and other misc cases
> The proposal is multi-step:
> 1. Building index based operators, compiler and execution engine changes
> 2. Optimizer enhancements (e.g. cost-based optimizer to compare and choose 
> between index scans, full table scans etc.)
> This JIRA initially focuses on the first step. This JIRA is expected to hold 
> the information about index based plans & operator implementations for above 
> mentioned cases. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1694) Accelerate query execution using indexes

2010-12-07 Thread Prajakta Kalmegh (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12969186#action_12969186
 ] 

Prajakta Kalmegh commented on HIVE-1694:


Hi,

I am Prajakta from Persistent Systems Ltd. and am working on the changes that 
John and Namit have suggested above along with Nikhil and Prafulla.
This is a design note about implementation of above review comments.

We're implementing the following changes as a single transformation in 
optimizer:
(1) Table replacement: involves modification of some internal members of 
TableScanOperator.
(2) Group by removal: involves removal of some operators (GBY-RS-GBY) where 
GBY is done at both mapper-reducer side; and re-setting of correct parent and 
child operators within the DAG.
(3) Sub-query insertion: involves creation of new DAG for sub-query and 
attaching it to the original DAG at an appropriate place.
(4) Projection modification: involves steps similar to (3).

We have implemented the above changes as a proof of concept. In this 
implementation, we have invoked this rule as the first transformation in the 
optimizer code so that lineage information is computed later as part of the 
Generator transformation. Another reason that we have applied this as the first 
transformation is that, as of now, the implementation uses the query block (QB) 
information to decide if the transformation can be applied for the input query 
(similar to the canApplyThisRule() method in the original rewrite code). 
Finally, to make the changes (3) and (4), we are modifying the operator DAG. 
However, we are not modifying the original query block (QB). Hence, this leaves 
the QB and the operator DAG out of sync.

The major issues in this implementation approach are:
1. The changes listed above require either modification of operator DAG (in 
case of 2) or creation of new operator DAG(in case of 3 and 4). The 
implementation requires some hacks in the SemanticAnalyzer code if we create a 
new DAG (as in the case of replaceViewReferenceWithDefinition() method which 
uses ParseDriver() to do the same). However, the methods are private (like 
genBodyPlan(...), genSelectPlan(...) etc) making it all the more difficult to 
implement changes (3) and (4) without access to these methods.
2. The creation of new DAG will require creating all associated data structures 
like QB, ASTNode etc as this information is necessary to generate DAG operator 
plan for the sub-queries. This approach would be very similar to what we are 
already doing in rewrite i.e creating new QB and ASTNode. 
3. Since we are creating a new DAG and appending it to the enclosing query DAG, 
we also need to modify the row schema and row resolvers for the operators.

One of the questions that underlies before finalizing the above approach is 
whether the cost-based optimizer, which is to be implemented in the future, 
will work on the query block or on the DAG operator tree. In case it works on 
the operator DAG, then the implementation changes we listed here are bound to 
be done. However, if the cost-based optimizer is to work on the query block, 
then we feel that the initial query rewrite engine code which worked after 
semantic analysis but before plan generation can be made to work with the 
cost-based optimizer. It will be a valuable input from your side if you could 
comment on the cost-based optimizer.


> Accelerate query execution using indexes
> 
>
> Key: HIVE-1694
> URL: https://issues.apache.org/jira/browse/HIVE-1694
> Project: Hive
>  Issue Type: New Feature
>  Components: Indexing, Query Processor
>Affects Versions: 0.7.0
>Reporter: Nikhil Deshpande
>Assignee: Nikhil Deshpande
> Attachments: demo_q1.hql, demo_q2.hql, HIVE-1694_2010-10-28.diff
>
>
> The index building patch (Hive-417) is checked into trunk, this JIRA issue 
> tracks supporting indexes in Hive compiler & execution engine for SELECT 
> queries.
> This is in ref. to John's comment at
> https://issues.apache.org/jira/browse/HIVE-417?focusedCommentId=12884869&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12884869
> on creating separate JIRA issue for tracking index usage in optimizer & query 
> execution.
> The aim of this effort is to use indexes to accelerate query execution (for 
> certain class of queries). E.g.
> - Filters and range scans (already being worked on by He Yongqiang as part of 
> HIVE-417?)
> - Joins (index based joins)
> - Group By, Order By and other misc cases
> The proposal is multi-step:
> 1. Building index based operators, compiler and execution engine changes
> 2. Optimizer enhancements (e.g. cost-based optimizer to compare and choose 
> between index scans, full table scans etc.)
> This JIRA initially focuses on the first

[jira] Commented: (HIVE-1694) Accelerate query execution using indexes

2010-11-01 Thread John Sichi (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12927077#action_12927077
 ] 

John Sichi commented on HIVE-1694:
--

+1 to what Namit said.  Doing the rewrites at the relational algebra Operator 
level (similar to the way optimizer transformations such as predicate pushdown 
already work) will have two big advantages:

* more general (syntax-independent)

* much easier to maintain (as you noted in your presentation, the 
SemanticAnalyzer data structures can be very difficult to analyze and 
manipulate, whereas the Operator tree are a lot cleaner)

BTW, thanks for the very clear explanation of the work you've done so far.


> Accelerate query execution using indexes
> 
>
> Key: HIVE-1694
> URL: https://issues.apache.org/jira/browse/HIVE-1694
> Project: Hive
>  Issue Type: New Feature
>  Components: Indexing, Query Processor
>Affects Versions: 0.7.0
>Reporter: Nikhil Deshpande
> Attachments: demo_q1.hql, demo_q2.hql, HIVE-1694_2010-10-28.diff
>
>
> The index building patch (Hive-417) is checked into trunk, this JIRA issue 
> tracks supporting indexes in Hive compiler & execution engine for SELECT 
> queries.
> This is in ref. to John's comment at
> https://issues.apache.org/jira/browse/HIVE-417?focusedCommentId=12884869&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12884869
> on creating separate JIRA issue for tracking index usage in optimizer & query 
> execution.
> The aim of this effort is to use indexes to accelerate query execution (for 
> certain class of queries). E.g.
> - Filters and range scans (already being worked on by He Yongqiang as part of 
> HIVE-417?)
> - Joins (index based joins)
> - Group By, Order By and other misc cases
> The proposal is multi-step:
> 1. Building index based operators, compiler and execution engine changes
> 2. Optimizer enhancements (e.g. cost-based optimizer to compare and choose 
> between index scans, full table scans etc.)
> This JIRA initially focuses on the first step. This JIRA is expected to hold 
> the information about index based plans & operator implementations for above 
> mentioned cases. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1694) Accelerate query execution using indexes

2010-11-01 Thread Namit Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12927073#action_12927073
 ] 

Namit Jain commented on HIVE-1694:
--

Had an offline discussion with John just now - I think he is also giving 
similar comments, so I will keep it very brief.

One concern is that all the optimizations should be at the operator level, I 
mean - this should be treated as just another optimization.
As you mentioned in your presentation correctly, Hive does not support 
cost-based optimizer currently, and it will 
require all the optimizations to be consolidated in one place to help move to 
that model.

We are thinking about moving the group by skews also in the optimizer (instead 
of the current approach in
SemanticAnalyzer). Once all the optimizations are in a central place, it will 
be much easier to move to costing.

The Harvey Mudd folks currently are not looking at Group By optimizations for 
indexing, so, this will be extremely
useful for the whole community.


> Accelerate query execution using indexes
> 
>
> Key: HIVE-1694
> URL: https://issues.apache.org/jira/browse/HIVE-1694
> Project: Hive
>  Issue Type: New Feature
>  Components: Indexing, Query Processor
>Affects Versions: 0.7.0
>Reporter: Nikhil Deshpande
> Attachments: demo_q1.hql, demo_q2.hql, HIVE-1694_2010-10-28.diff
>
>
> The index building patch (Hive-417) is checked into trunk, this JIRA issue 
> tracks supporting indexes in Hive compiler & execution engine for SELECT 
> queries.
> This is in ref. to John's comment at
> https://issues.apache.org/jira/browse/HIVE-417?focusedCommentId=12884869&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12884869
> on creating separate JIRA issue for tracking index usage in optimizer & query 
> execution.
> The aim of this effort is to use indexes to accelerate query execution (for 
> certain class of queries). E.g.
> - Filters and range scans (already being worked on by He Yongqiang as part of 
> HIVE-417?)
> - Joins (index based joins)
> - Group By, Order By and other misc cases
> The proposal is multi-step:
> 1. Building index based operators, compiler and execution engine changes
> 2. Optimizer enhancements (e.g. cost-based optimizer to compare and choose 
> between index scans, full table scans etc.)
> This JIRA initially focuses on the first step. This JIRA is expected to hold 
> the information about index based plans & operator implementations for above 
> mentioned cases. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1694) Accelerate query execution using indexes

2010-11-01 Thread John Sichi (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12927070#action_12927070
 ] 

John Sichi commented on HIVE-1694:
--

Hey guys, I haven't gone through all the code yet, but reading through the 
slides just now, there's one problem I should point out with using the existing 
compact indexes for aggregate rewrite.

Namely, we store only the distinct block offsets, not the distinct row offsets. 
 So, if the same key appears more than once within the same block, you'll get 
the wrong answer for COUNT.  One way to address this would be to compute the 
COUNT per index entry at the time we are building the index, and then SUM that 
later for aggregation.  But currently the compact index does not store that, so 
we would need to add it as a new index type.

One smaller item is that for the DISTINCT rewrite (slide 10), you still need to 
keep a DISTINCT on the rewritten query since the same l_shipdate may be 
repeated in the index table if it appears in multiple buckets.


> Accelerate query execution using indexes
> 
>
> Key: HIVE-1694
> URL: https://issues.apache.org/jira/browse/HIVE-1694
> Project: Hive
>  Issue Type: New Feature
>  Components: Indexing, Query Processor
>Affects Versions: 0.7.0
>Reporter: Nikhil Deshpande
> Attachments: demo_q1.hql, demo_q2.hql, HIVE-1694_2010-10-28.diff
>
>
> The index building patch (Hive-417) is checked into trunk, this JIRA issue 
> tracks supporting indexes in Hive compiler & execution engine for SELECT 
> queries.
> This is in ref. to John's comment at
> https://issues.apache.org/jira/browse/HIVE-417?focusedCommentId=12884869&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12884869
> on creating separate JIRA issue for tracking index usage in optimizer & query 
> execution.
> The aim of this effort is to use indexes to accelerate query execution (for 
> certain class of queries). E.g.
> - Filters and range scans (already being worked on by He Yongqiang as part of 
> HIVE-417?)
> - Joins (index based joins)
> - Group By, Order By and other misc cases
> The proposal is multi-step:
> 1. Building index based operators, compiler and execution engine changes
> 2. Optimizer enhancements (e.g. cost-based optimizer to compare and choose 
> between index scans, full table scans etc.)
> This JIRA initially focuses on the first step. This JIRA is expected to hold 
> the information about index based plans & operator implementations for above 
> mentioned cases. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1694) Accelerate query execution using indexes

2010-10-30 Thread Prafulla Tekawade (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12926675#action_12926675
 ] 

Prafulla Tekawade commented on HIVE-1694:
-

Here is one presentation we(Me and Nikhil) had at Persistent Systems Ltd, 
regarding this changes.
http://www.slideshare.net/NikhilDeshpande/indexed-hive

> Accelerate query execution using indexes
> 
>
> Key: HIVE-1694
> URL: https://issues.apache.org/jira/browse/HIVE-1694
> Project: Hive
>  Issue Type: New Feature
>  Components: Indexing, Query Processor
>Affects Versions: 0.7.0
>Reporter: Nikhil Deshpande
> Attachments: demo_q1.hql, demo_q2.hql, HIVE-1694_2010-10-28.diff
>
>
> The index building patch (Hive-417) is checked into trunk, this JIRA issue 
> tracks supporting indexes in Hive compiler & execution engine for SELECT 
> queries.
> This is in ref. to John's comment at
> https://issues.apache.org/jira/browse/HIVE-417?focusedCommentId=12884869&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12884869
> on creating separate JIRA issue for tracking index usage in optimizer & query 
> execution.
> The aim of this effort is to use indexes to accelerate query execution (for 
> certain class of queries). E.g.
> - Filters and range scans (already being worked on by He Yongqiang as part of 
> HIVE-417?)
> - Joins (index based joins)
> - Group By, Order By and other misc cases
> The proposal is multi-step:
> 1. Building index based operators, compiler and execution engine changes
> 2. Optimizer enhancements (e.g. cost-based optimizer to compare and choose 
> between index scans, full table scans etc.)
> This JIRA initially focuses on the first step. This JIRA is expected to hold 
> the information about index based plans & operator implementations for above 
> mentioned cases. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-1694) Accelerate query execution using indexes

2010-10-11 Thread John Sichi (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-1694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12919961#action_12919961
 ] 

John Sichi commented on HIVE-1694:
--

Note that we have a team of students from Harvey Mudd planning to work on this 
one.

> Accelerate query execution using indexes
> 
>
> Key: HIVE-1694
> URL: https://issues.apache.org/jira/browse/HIVE-1694
> Project: Hadoop Hive
>  Issue Type: New Feature
>  Components: Indexing, Query Processor
>Affects Versions: 0.7.0
>Reporter: Nikhil Deshpande
>
> The index building patch (Hive-417) is checked into trunk, this JIRA issue 
> tracks supporting indexes in Hive compiler & execution engine for SELECT 
> queries.
> This is in ref. to John's comment at
> https://issues.apache.org/jira/browse/HIVE-417?focusedCommentId=12884869&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12884869
> on creating separate JIRA issue for tracking index usage in optimizer & query 
> execution.
> The aim of this effort is to use indexes to accelerate query execution (for 
> certain class of queries). E.g.
> - Filters and range scans (already being worked on by He Yongqiang as part of 
> HIVE-417?)
> - Joins (index based joins)
> - Group By, Order By and other misc cases
> The proposal is multi-step:
> 1. Building index based operators, compiler and execution engine changes
> 2. Optimizer enhancements (e.g. cost-based optimizer to compare and choose 
> between index scans, full table scans etc.)
> This JIRA initially focuses on the first step. This JIRA is expected to hold 
> the information about index based plans & operator implementations for above 
> mentioned cases. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.