subject:"\[jira\] Commented\: \(HIVE\-417\) Implement Indexing in Hive"

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2010-07-30 Thread He Yongqiang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12894221#action_12894221
 ] 

He Yongqiang commented on HIVE-417:
---

For mysql metastore upgrade, please refer to
http://wiki.apache.org/hadoop/Hive/IndexDev#Metastore_Upgrades


 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0
Reporter: Prasad Chakka
Assignee: He Yongqiang
 Fix For: 0.7.0

 Attachments: hive-417.proto.patch, hive-417－2009-07-18.patch, 
 hive-indexing-8-thrift-metastore-remodel.patch, hive-indexing.3.patch, 
 hive-indexing.5.thrift.patch, hive.indexing.11.patch, hive.indexing.12.patch, 
 hive.indexing.13.patch, idx2.png, indexing_with_ql_rewrites_trunk_953221.patch


 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2010-07-29 Thread John Sichi (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12893522#action_12893522
 ] 

John Sichi commented on HIVE-417:
-

Since another patch is needed, here are the review comments I mentioned above.

* Javadoc for Hive.createIndex needs parameters fixed

* Javadoc for HiveIndexHandler.analyzeIndexDefinition:  remove storageDesc]

* In HiveUtils.getIndexHandler:  the message should be Error in loading index 
handler rather than Error in loading storage handler

* GenericUDAFCollectSet @Description :  with no duplication elements should 
be with duplicate elements eliminated

* DDLSemanticAnalyzer.analyzeCreateIndex:  hanlder is misspelled

* Property AbstractIndexHandler.INDEX_COLS_KEY is never used; get rid of it?

* For HiveIndex.INDEX_TABLE_CREATETIME property name, spell out 
lastModifiedTime instead of lmt


 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0
Reporter: Prasad Chakka
Assignee: He Yongqiang
 Fix For: 0.7.0

 Attachments: hive-417.proto.patch, hive-417－2009-07-18.patch, 
 hive-indexing-8-thrift-metastore-remodel.patch, hive-indexing.3.patch, 
 hive-indexing.5.thrift.patch, hive.indexing.11.patch, hive.indexing.12.patch, 
 idx2.png, indexing_with_ql_rewrites_trunk_953221.patch


 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2010-07-29 Thread John Sichi (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12893890#action_12893890
 ] 

John Sichi commented on HIVE-417:
-

OK, testing lucky patch 13...

 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0
Reporter: Prasad Chakka
Assignee: He Yongqiang
 Fix For: 0.7.0

 Attachments: hive-417.proto.patch, hive-417－2009-07-18.patch, 
 hive-indexing-8-thrift-metastore-remodel.patch, hive-indexing.3.patch, 
 hive-indexing.5.thrift.patch, hive.indexing.11.patch, hive.indexing.12.patch, 
 hive.indexing.13.patch, idx2.png, indexing_with_ql_rewrites_trunk_953221.patch


 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2010-07-28 Thread John Sichi (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1289#action_1289
 ] 

John Sichi commented on HIVE-417:
-

Thanks Yongqiang.  Looking at it now.

 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0
Reporter: Prasad Chakka
Assignee: He Yongqiang
 Attachments: hive-417.proto.patch, hive-417－2009-07-18.patch, 
 hive-indexing-8-thrift-metastore-remodel.patch, hive-indexing.3.patch, 
 hive-indexing.5.thrift.patch, hive.indexing.11.patch, hive.indexing.12.patch, 
 idx2.png, indexing_with_ql_rewrites_trunk_953221.patch


 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2010-07-28 Thread John Sichi (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12893402#action_12893402
 ] 

John Sichi commented on HIVE-417:
-

+1.  Will commit when tests pass.  I noticed a number of trivial issues (like 
Javadoc mismatches) which I'll put in a followup.


 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0
Reporter: Prasad Chakka
Assignee: He Yongqiang
 Attachments: hive-417.proto.patch, hive-417－2009-07-18.patch, 
 hive-indexing-8-thrift-metastore-remodel.patch, hive-indexing.3.patch, 
 hive-indexing.5.thrift.patch, hive.indexing.11.patch, hive.indexing.12.patch, 
 idx2.png, indexing_with_ql_rewrites_trunk_953221.patch


 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2010-07-28 Thread Joydeep Sen Sarma (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12893455#action_12893455
 ] 

Joydeep Sen Sarma commented on HIVE-417:


i am waiting for a commit on hive-1408. that's probably gonna collide.

 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0
Reporter: Prasad Chakka
Assignee: He Yongqiang
 Fix For: 0.7.0

 Attachments: hive-417.proto.patch, hive-417－2009-07-18.patch, 
 hive-indexing-8-thrift-metastore-remodel.patch, hive-indexing.3.patch, 
 hive-indexing.5.thrift.patch, hive.indexing.11.patch, hive.indexing.12.patch, 
 idx2.png, indexing_with_ql_rewrites_trunk_953221.patch


 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2010-07-28 Thread John Sichi (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12893461#action_12893461
 ] 

John Sichi commented on HIVE-417:
-

Thanks Joydeep.  Yeah, this one has tons of plan diffs due to the virtual 
columns.

 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0
Reporter: Prasad Chakka
Assignee: He Yongqiang
 Fix For: 0.7.0

 Attachments: hive-417.proto.patch, hive-417－2009-07-18.patch, 
 hive-indexing-8-thrift-metastore-remodel.patch, hive-indexing.3.patch, 
 hive-indexing.5.thrift.patch, hive.indexing.11.patch, hive.indexing.12.patch, 
 idx2.png, indexing_with_ql_rewrites_trunk_953221.patch


 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2010-07-28 Thread John Sichi (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12893488#action_12893488
 ] 

John Sichi commented on HIVE-417:
-

Yongqiang, I passed tests on Hadoop 0.20, but Ning has committed HIVE-1408, 
which conflicts, so you'll need to rebase against that and then I'll try again.

 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0
Reporter: Prasad Chakka
Assignee: He Yongqiang
 Fix For: 0.7.0

 Attachments: hive-417.proto.patch, hive-417－2009-07-18.patch, 
 hive-indexing-8-thrift-metastore-remodel.patch, hive-indexing.3.patch, 
 hive-indexing.5.thrift.patch, hive.indexing.11.patch, hive.indexing.12.patch, 
 idx2.png, indexing_with_ql_rewrites_trunk_953221.patch


 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2010-07-27 Thread John Sichi (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12892865#action_12892865
 ] 

John Sichi commented on HIVE-417:
-

Regarding the mkset function:  can we rename this to collect_array to hint that 
it is a UDAF?  The @Description should also make this clear.

Collect is the standard SQL name for this aggregate function, but the standard 
version returns a multiset rather than an array, so let's call it collect_array 
to be specific.

Also, it will need its own independent unit tests (open a followup JIRA issue 
for this).


 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0
Reporter: Prasad Chakka
Assignee: He Yongqiang
 Attachments: hive-417.proto.patch, hive-417－2009-07-18.patch, 
 hive-indexing-8-thrift-metastore-remodel.patch, hive-indexing.3.patch, 
 hive-indexing.5.thrift.patch, hive.indexing.11.patch, idx2.png, 
 indexing_with_ql_rewrites_trunk_953221.patch


 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2010-07-27 Thread Ashish Thusoo (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12892932#action_12892932
 ] 

Ashish Thusoo commented on HIVE-417:


Started looking at this. One initial question I had - why is virtualcolumn 
class in the serde2 package?

 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0
Reporter: Prasad Chakka
Assignee: He Yongqiang
 Attachments: hive-417.proto.patch, hive-417－2009-07-18.patch, 
 hive-indexing-8-thrift-metastore-remodel.patch, hive-indexing.3.patch, 
 hive-indexing.5.thrift.patch, hive.indexing.11.patch, idx2.png, 
 indexing_with_ql_rewrites_trunk_953221.patch


 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2010-07-27 Thread John Sichi (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12892937#action_12892937
 ] 

John Sichi commented on HIVE-417:
-

Another followup needed:  REBUILD should be propagating lineage and read/write 
info from the reentrant INSERT statement up to the top-level statement so that 
hooks get called with the right information.

 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0
Reporter: Prasad Chakka
Assignee: He Yongqiang
 Attachments: hive-417.proto.patch, hive-417－2009-07-18.patch, 
 hive-indexing-8-thrift-metastore-remodel.patch, hive-indexing.3.patch, 
 hive-indexing.5.thrift.patch, hive.indexing.11.patch, idx2.png, 
 indexing_with_ql_rewrites_trunk_953221.patch


 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2010-07-27 Thread Ashish Thusoo (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12892939#action_12892939
 ] 

Ashish Thusoo commented on HIVE-417:


Also, how is the file name populated? That is not done through the IOContext?

 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0
Reporter: Prasad Chakka
Assignee: He Yongqiang
 Attachments: hive-417.proto.patch, hive-417－2009-07-18.patch, 
 hive-indexing-8-thrift-metastore-remodel.patch, hive-indexing.3.patch, 
 hive-indexing.5.thrift.patch, hive.indexing.11.patch, idx2.png, 
 indexing_with_ql_rewrites_trunk_953221.patch


 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2010-07-27 Thread He Yongqiang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12892946#action_12892946
 ] 

He Yongqiang commented on HIVE-417:
---

@Ashish
why is virtualcolumn class in the serde2 package?
will put it to ql.io package. I put it to serde2 package just because i thought 
it maybe needed by the serde layer. Since all codes are almost done and it is 
not accessed by serde, 
it makes sense to move it to ql.
how is the file name populated
filename and block offset are all populated by record reader. filename is 
populated by looking at the split path when we construct the record reader. 
Offset is generated at runtime by record reader.

 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0
Reporter: Prasad Chakka
Assignee: He Yongqiang
 Attachments: hive-417.proto.patch, hive-417－2009-07-18.patch, 
 hive-indexing-8-thrift-metastore-remodel.patch, hive-indexing.3.patch, 
 hive-indexing.5.thrift.patch, hive.indexing.11.patch, idx2.png, 
 indexing_with_ql_rewrites_trunk_953221.patch


 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2010-07-27 Thread He Yongqiang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12892947#action_12892947
 ] 

He Yongqiang commented on HIVE-417:
---

IOContext is just a container, HiveContextAwareRecordReader is responsible for 
filling it with actual values.

 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0
Reporter: Prasad Chakka
Assignee: He Yongqiang
 Attachments: hive-417.proto.patch, hive-417－2009-07-18.patch, 
 hive-indexing-8-thrift-metastore-remodel.patch, hive-indexing.3.patch, 
 hive-indexing.5.thrift.patch, hive.indexing.11.patch, idx2.png, 
 indexing_with_ql_rewrites_trunk_953221.patch


 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2010-07-26 Thread John Sichi (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12892413#action_12892413
 ] 

John Sichi commented on HIVE-417:
-

Yongqiang, I looked at hive.indexing.10.patch, but I don't see the virtual 
columns in there?

 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0
Reporter: Prasad Chakka
Assignee: He Yongqiang
 Attachments: hive-417.proto.patch, hive-417－2009-07-18.patch, 
 hive-indexing-8-thrift-metastore-remodel.patch, hive-indexing.3.patch, 
 hive-indexing.5.thrift.patch, hive.indexing.10.patch, idx2.png, 
 indexing_with_ql_rewrites_trunk_953221.patch


 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2010-07-26 Thread John Sichi (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12892594#action_12892594
 ] 

John Sichi commented on HIVE-417:
-

First pass of review comments on latest patch (I'll probably have more 
tomorrow).

* INDEX_NAME precision in the metastore should be 128 characters (not 767), 
following convention for other identifiers

* I don't think we need INDEX_TABLE_NAME at all in the metastore; it should 
only be used during CREATE INDEX and then forgotten

* Move HiveIndexInputFormat and HiveIndexResult to package 
org.apache.hadoop.hive.ql.index.compact, and add Compact in their names (I'd 
still prefer to move this entire package out to a new subproj, but I guess we 
can skip that part now since most of the code went away with the virtual column 
approach); rename property hive.exec.index_file to hive.index.compact.file

* Support WITH DEFERRED REBUILD, and require this to be specified for now to 
avoid confusion (per discussion in design meeting)

* when generating reentrant INSERT, need to quote identifiers such as 
table/column names (use HiveUtils.unparseIdentifier), and may need extra 
escaping for special characters in getPartKVPairStringArray (I'm not 
sure--check with Paul)

* thread_local should be private (and named threadLocal); go through public 
IOContext.get() instead; likewise use public getter/setter methods on IOContext 
instead of accessing its data members directly

* need ORDER BY in virtual_column.q

* remove extra semicolon in other ORDER BY's, and make sure they cover a unique 
key in all cases

* don't need TYPE and UPDATE as keywords in grammar

 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0
Reporter: Prasad Chakka
Assignee: He Yongqiang
 Attachments: hive-417.proto.patch, hive-417－2009-07-18.patch, 
 hive-indexing-8-thrift-metastore-remodel.patch, hive-indexing.3.patch, 
 hive-indexing.5.thrift.patch, hive.indexing.11.patch, idx2.png, 
 indexing_with_ql_rewrites_trunk_953221.patch


 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2010-07-26 Thread John Sichi (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12892613#action_12892613
 ] 

John Sichi commented on HIVE-417:
-

Whoops, forgot two leftover from a private diff review:

* metastore/if/hive_metastore.thrift:102 instead of including the full 
indexTable structure inside the Index structure, can we omit it but then pass 
it as an additional parameter to add_index?

* 
ql/src/java/org/apache/hadoop/hive/ql/index/compact/CompactIndexHandler.java:86 
Move generic partition analysis out into Hive, since it will be the same for 
all plugins.

We can talk more about these tomorrow if it's not clear.


 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0
Reporter: Prasad Chakka
Assignee: He Yongqiang
 Attachments: hive-417.proto.patch, hive-417－2009-07-18.patch, 
 hive-indexing-8-thrift-metastore-remodel.patch, hive-indexing.3.patch, 
 hive-indexing.5.thrift.patch, hive.indexing.11.patch, idx2.png, 
 indexing_with_ql_rewrites_trunk_953221.patch


 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2010-07-22 Thread Ning Zhang (JIRA)

[
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12891274#action_12891274
]

Ning Zhang commented on HIVE-417:
-

Based on some internal discussions below are some comments about the design doc:

1) the staleness (inconsistency) between the index and the base table should be
addressed more precisely.
Since the current implementation allows the user to query the index table
directly, we should guarantee that the index is consistent with the base table
at the query time. This means at the query START time, the index was built
completely based on the data stored in the base table. The current design does
not satisfy this criteria in that it only record the last_modification_time
(LMT) of the base table and the index table, and check if the latter is larger
than the former. This leaves the following example break:

timestamp0: last update of partition P1
timestamp1: start create index on partition P1
timestamp2: start insert overwrite P1
timestamp3: finish insert overwrite P1
timestamp4: finish index creation on P1
timestamp 5: query on P1

The LMTs of the index and the base table are timestamp4 and timestamp3
respectively so the optimizer will conclude the index is consistent with base
table. However, the index was built based on stale data at the timestamp5. So
the index should not be used.

Instead of recording the LMT of the index table, we probably should record the
LMT of the base table in the index metadata at the beginning of the index
creation. In the above example, the timestamp recorded in the index metadata
should be timestamp0. This means the index was created based on the base table
at timestamp0. At the query time, we should check timestamp0 against timestamp
3, which correctly conclude the index is stale.

BTW, all the timestamp should be coming from some centralized clock such as the
DFS directory update time (from the namenode).

2) The above consistency problem does not only present in the case of DEFERRED
REBUILD. Even if the index rebuild starts right away after INSERT OVERWRITE,
there is still a time window that the index is stale (before the index creation
is complete). So we need the same mechanism to figure out stale indexes.

3) I think a lock-based concurrency may not be the best choice as well. If the
index creation takes a long time, it defers the availability of the base table.
If we have the optimizer, we should always query against the base tables, and
let the optimizer to figure out whether an index is available and fresh. So if
an index creation is not finished, we can just use the base table, otherwise we
can use the index if the cost is less expensive.

4) Another case is that if the index creation finished and the query is using
the index, and then an DML happened on the base table and finished before the
query finish. Here we only guarantee snapshot consistency (results consisting
with the data at the beginning of the query, not after the query).

5) If we have the mechanism to check consistency of the index, then the index
rebuild command could just return if the index is consistent. We can also
allow a force option in case we need to compensate for bad metadata.

Implement Indexing in Hive
--

Key: HIVE-417
URL: https://issues.apache.org/jira/browse/HIVE-417
Project: Hadoop Hive
Issue Type: New Feature
Components: Metastore, Query Processor
Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0
Reporter: Prasad Chakka
Assignee: He Yongqiang
Attachments: hive-417.proto.patch, hive-417－2009-07-18.patch,
hive-indexing-8-thrift-metastore-remodel.patch, hive-indexing.3.patch,
hive-indexing.5.thrift.patch, idx2.png,
indexing_with_ql_rewrites_trunk_953221.patch

Implement indexing on Hive so that lookup and range queries are efficient.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2010-07-19 Thread John Sichi (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12890137#action_12890137
 ] 

John Sichi commented on HIVE-417:
-

Preliminary draft of design doc is here

http://wiki.apache.org/hadoop/Hive/IndexDev

Yongqiang and I are still working out some of the details.


 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0
Reporter: Prasad Chakka
Assignee: He Yongqiang
 Attachments: hive-417.proto.patch, hive-417－2009-07-18.patch, 
 hive-indexing-8-thrift-metastore-remodel.patch, hive-indexing.3.patch, 
 hive-indexing.5.thrift.patch, idx2.png, 
 indexing_with_ql_rewrites_trunk_953221.patch


 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2010-07-16 Thread John Sichi (JIRA)

[
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12889345#action_12889345
]

John Sichi commented on HIVE-417:
-

Here are some preliminary comments on the metastore work. We can move on to
the plugin design next week and start getting all of this into a doc.

* We should support a property on the index which controls the name of the
index table, and only generate an index table name automatically in the case
where the user doesn't supply the property. For this, we'll need to add
property key/values to the grammar (IDXPROPERTIES like TBLPROPERTIES and
SERDEPROPERTIES?).

* The grammar supports control over the tableFileFormat for the index table;
what about other attributes such as row format, location, and TBLPROPERTIES?
Some of these may be dictated by the index implementation, but it may be useful
to override in some cases (same as tableFileFormat).

* Is the partitioning for the index independent of the partitioning for the
table? Don't we need to allow control over this in the grammar?

* I think we should track the status of the index (when was the last time it
was rebuilt, if ever) so that we know whether it is fresh with respect to the
base table data. How should we model this in such a way that it takes
per-partition indexing into account?

* Some metastore followups to be logged separately: COMMENT clause on index
definition; DESCRIBE INDEX; SHOW INDEXES; dealing with base table columns being
dropped/renamed out from under the index

* For generating the index table structure, we'll need to move that to plugin
(rather than in Hive.java), since each index will need a different table
structure (or no table structure at all).

* Test queries: remember to add ORDER BY for determinism. Also, I'm not sure
whether it is safe to use /tmp in the local file system (it may not exist, e.g.
on Windows). I used it in hbase_bulk.m, but that uses a mini HDFS cluster (not
the local file system).

* Dropping a table with an index on it currently gives the exception below (in
Derby; I didn't test MySQL yet). Same for attempting to drop an index table
directly (instead of dropping the index). The second case should either fail
with a meaningful exception, or implicitly drop the index definition as a
trigger from dropping the table.

hive create table t1(i int);
OK
hive create index q type compact on table t1(i);
OK
hive drop table t1;
FAILED: Error in metadata: javax.jdo.JDODataStoreException: Exception thrown
flushing changes to datastore
NestedThrowables:
java.sql.BatchUpdateException: DELETE on table 'TBLS' caused a violation of
foreign key constraint 'INDEXS_FK3' for key (12). The statement has been
rolled back.
FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.DDLTask

hive create table t5(i int);
OK
hive create index r type compact on table t5(i);
OK
hive drop table default__t5_r__;
FAILED: Error in metadata: javax.jdo.JDODataStoreException: Exception thrown
flushing changes to datastore
NestedThrowables:
java.sql.BatchUpdateException: DELETE on table 'TBLS' caused a violation of
foreign key constraint 'INDEXS_FK2' for key (17). The statement has been
rolled back.
FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.DDLTask

Implement Indexing in Hive
--

Implement indexing on Hive so that lookup and range queries are efficient.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2010-07-16 Thread He Yongqiang (JIRA)

[
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12889376#action_12889376
]

He Yongqiang commented on HIVE-417:
---

THANKS FOR THE DETAILED COMMENTS.

We should support a property on the index which controls the name of the
index table, and only generate an index table name automatically in the case
where the user doesn't supply the property.
will add this in the following patch.

For this, we'll need to add property key/values to the grammar (IDXPROPERTIES
like TBLPROPERTIES and SERDEPROPERTIES?).
Let's do it in a followup jira.

The grammar supports control over the tableFileFormat for the index table;
what about other attributes such as row format, location, and TBLPROPERTIES?
Some of these may be dictated by the index implementation, but it may be
useful to override in some cases (same as tableFileFormat).
We can add this when we see the requirement. For now we can leave this out.

I think we should track the status of the index (when was the last time it
was rebuilt, if ever) so that we know whether it is fresh with respect to the
base table data. How should we model this in such a way that it takes
per-partition indexing into account?
I think it's the same as the one of key/value property. no?

Test queries: remember to add ORDER BY for determinism.
will add this in the following patch.

Also, I'm not sure whether it is safe to use /tmp in the local file system
(it may not exist, e.g. on Windows). I used it in hbase_bulk.m, but that uses
a mini HDFS cluster (not the local file system).
I think it's should be ok because it's not local tmp. it's mini HDFS /tmp

Dropping a table with an index on it currently gives the exception below (in
Derby; I didn't test MySQL yet). Same for attempting to drop an index table
directly (instead of dropping the index). The second case should either fail
with a meaningful exception, or implicitly drop the index definition as a
trigger from dropping the table.
Actually this is reported by Prafulla offline. Will add this in the following
patch. For the second case, i am planning to report error.

Implement Indexing in Hive
--

Implement indexing on Hive so that lookup and range queries are efficient.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2010-07-15 Thread John Sichi (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12888946#action_12888946
 ] 

John Sichi commented on HIVE-417:
-

Whoops, relationships connecting TBLS/SDS and IDXS/SDS got lost; will attach 
another diagram which fixes that.

 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0
Reporter: Prasad Chakka
Assignee: He Yongqiang
 Attachments: hive-417.proto.patch, hive-417－2009-07-18.patch, 
 hive-indexing-8-thrift-metastore-remodel.patch, hive-indexing.3.patch, 
 hive-indexing.5.thrift.patch, idx.png, 
 indexing_with_ql_rewrites_trunk_953221.patch


 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2010-07-09 Thread Jeff Hammerbacher (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886634#action_12886634
 ] 

Jeff Hammerbacher commented on HIVE-417:


Hey,

Any chance you guys could post a more detailed design document for 
full-fledged index support? I'm quite curious to read up on it.

Thanks,
Jeff

 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0
Reporter: Prasad Chakka
Assignee: He Yongqiang
 Attachments: hive-417.proto.patch, hive-417－2009-07-18.patch, 
 hive-indexing.3.patch, hive-indexing.5.thrift.patch, 
 indexing_with_ql_rewrites_trunk_953221.patch


 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2010-07-09 Thread John Sichi (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886764#action_12886764
 ] 

John Sichi commented on HIVE-417:
-

@Jeff:  Yes, we'll put it up on the wiki, similar to how we did for storage 
handler + HBase.


 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0
Reporter: Prasad Chakka
Assignee: He Yongqiang
 Attachments: hive-417.proto.patch, hive-417－2009-07-18.patch, 
 hive-indexing.3.patch, hive-indexing.5.thrift.patch, 
 indexing_with_ql_rewrites_trunk_953221.patch


 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2010-07-08 Thread Prafulla Tekawade (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886308#action_12886308
 ] 

Prafulla Tekawade commented on HIVE-417:


Hi Yongqiang,
I am facing some problem for creating SUMMARY indexes.
This index is not built with update index command.
COMPACT SUMMARY index works fine. Is there any problem with
creation of SUMMARY index table ?


 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0
Reporter: Prasad Chakka
Assignee: He Yongqiang
 Attachments: hive-417.proto.patch, hive-417－2009-07-18.patch, 
 hive-indexing.3.patch, hive-indexing.5.thrift.patch, 
 indexing_with_ql_rewrites_trunk_953221.patch


 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2010-07-08 Thread He Yongqiang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886380#action_12886380
 ] 

He Yongqiang commented on HIVE-417:
---

I think SUMMARY index's mapper code is comment out in the uploaded patch.

 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0
Reporter: Prasad Chakka
Assignee: He Yongqiang
 Attachments: hive-417.proto.patch, hive-417－2009-07-18.patch, 
 hive-indexing.3.patch, hive-indexing.5.thrift.patch, 
 indexing_with_ql_rewrites_trunk_953221.patch


 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2010-07-08 Thread John Sichi (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886587#action_12886587
 ] 

John Sichi commented on HIVE-417:
-

Based on discussion with Yongqiang, we've decided to go for Full-fledged index 
support.


 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0
Reporter: Prasad Chakka
Assignee: He Yongqiang
 Attachments: hive-417.proto.patch, hive-417－2009-07-18.patch, 
 hive-indexing.3.patch, hive-indexing.5.thrift.patch, 
 indexing_with_ql_rewrites_trunk_953221.patch


 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2010-07-02 Thread Ashish Thusoo (JIRA)

[
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884685#action_12884685
]

Ashish Thusoo commented on HIVE-417:

Looked at the code and have some questions...

Can you explain how the metastore object model is laid out. It seems that the
table names of the index are stored in key value properties of the table that
the index is created on. Is that correct? Would it be better to put a key
reference from the index table to the base table instead (similar to what is
done for partitions)?

Also, how would this be used to query the table? Can you give an example?

Is the idea here to select from the index an then pass the offsets to another
query to look up the table? An example or a test which shows the query on the
base table would be useful.

Implement Indexing in Hive
--

Key: HIVE-417
URL: https://issues.apache.org/jira/browse/HIVE-417
Project: Hadoop Hive
Issue Type: New Feature
Components: Metastore, Query Processor
Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0
Reporter: Prasad Chakka
Assignee: He Yongqiang
Attachments: hive-417.proto.patch, hive-417－2009-07-18.patch,
hive-indexing.3.patch, hive-indexing.5.thrift.patch,
indexing_with_ql_rewrites_trunk_953221.patch

Implement indexing on Hive so that lookup and range queries are efficient.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2010-07-02 Thread John Sichi (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884869#action_12884869
 ] 

John Sichi commented on HIVE-417:
-

Had a chat with Ashish and Yongqiang offline, and came up with three 
alternatives.

1)  Shortest path to checkin:  Treat current code as prototype and move it 
into contrib, providing a utility for creating/updating the index, and keeping 
changes to core classes to a minimum.  As Yongqiang pointed out, this makes it 
harder to follow up with automatic use of the index due to the lack of 
metadata.  If we do this, we should create a new JIRA issue for its limited 
scope.

2) Full-fledged index support:  change the JDO metamodel to add support for 
indexes as first class objects, and come up with a pluggable index 
creation+access design framework which can encompass a variety of index types 
likely to be needed in the future.  Code from this patch would become the first 
such index implementation provided.  If we do this, we should continue on in 
this truly epic JIRA issue.

3) Rework as materialized view:  keep the JDO metamodel as is (adding a new 
table type for MATERIALIZED_VIEW) but change the DDL to CREATE MATERIALIZED 
VIEW AS SELECT ... and then come up with the system functions needed (e.g. for 
accessing file offsets) in order to be able to express the index construction 
as SQL.  We would then execute view materialization in a fashion similar to 
CREATE TABLE AS SELECT.  This approach best reflects the way the current code 
models an index as an ordinary table, but requires some other changes (e.g. 
CTAS + dynamic partitioning, something we want anyway).  If we do this, we 
should create a new JIRA issue since it's a different feature from the user POV.

We're aiming to reach a decision next week; input is welcome on whether these 
alternatives make sense (and on others we should consider).

Since this JIRA issue is already so overloaded, we would also like to treat the 
following two items as separate followup JIRA issues rather than trying to 
address it all at once:

* rewrite framework
* automatic usage of index or materialized view by optimizer


 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0
Reporter: Prasad Chakka
Assignee: He Yongqiang
 Attachments: hive-417.proto.patch, hive-417－2009-07-18.patch, 
 hive-indexing.3.patch, hive-indexing.5.thrift.patch, 
 indexing_with_ql_rewrites_trunk_953221.patch


 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2010-07-01 Thread Namit Jain (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884404#action_12884404
 ] 

Namit Jain commented on HIVE-417:
-

Few higher level comments:

1. Populate the index at create index.
2. Instead of proposing a new syntax, why dont we use 'alter index INDEX_NAME 
ON TABLE_NAME REBUILD;
3. Since the code is in a prototype stage, can we move the index code to 
contrib ?


 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0
Reporter: Prasad Chakka
Assignee: He Yongqiang
 Attachments: hive-417.proto.patch, hive-417－2009-07-18.patch, 
 hive-indexing.3.patch, hive-indexing.5.thrift.patch, 
 indexing_with_ql_rewrites_trunk_953221.patch


 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2010-07-01 Thread Jeff Hammerbacher (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884431#action_12884431
 ] 

Jeff Hammerbacher commented on HIVE-417:


bq. 3. Since the code is in a prototype stage, can we move the index code to 
contrib ?

It's been the experience of other Hadoop-related projects that contrib gets 
messy. It has proven effective to either keep experimental features in mainline 
trunk or to put them up on github.

 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0
Reporter: Prasad Chakka
Assignee: He Yongqiang
 Attachments: hive-417.proto.patch, hive-417－2009-07-18.patch, 
 hive-indexing.3.patch, hive-indexing.5.thrift.patch, 
 indexing_with_ql_rewrites_trunk_953221.patch


 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2010-07-01 Thread Namit Jain (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884434#action_12884434
 ] 

Namit Jain commented on HIVE-417:
-

if (work.getReducer() != null) {
  work.getReducer().jobClose(job, success, feedBack);
}

if (IndexBuilderBaseReducer.class.isAssignableFrom(this
.getReducerClass())) {
  this.closeIndexBuilder(job, success);
}
  }


Instead of the above code in ExecDriver,  
IndexBuilderBaseReducer/CompactSumReducer should have a jobClose - no code
change needed in ExecDriver.

I would still vote for the index code to be in contrib, it will take some time 
to clean it up - then it should be moved to the mainline.
Till then, it is usable, but in a prototype state.

What we should aim for is minimum changes in ql/. and put all changes in 
contrib for now. As they become stable, we can pull them
in - even the DDLSemanticAnalyzer should be factored in contrib

 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0
Reporter: Prasad Chakka
Assignee: He Yongqiang
 Attachments: hive-417.proto.patch, hive-417－2009-07-18.patch, 
 hive-indexing.3.patch, hive-indexing.5.thrift.patch, 
 indexing_with_ql_rewrites_trunk_953221.patch


 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2010-06-30 Thread Namit Jain (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884067#action_12884067
 ] 

Namit Jain commented on HIVE-417:
-

Looking at the patch (not yet in detail) seems to suggest the following:

1. The index file can only be a text file.
2. PROJECTION index is not used - I mean, to start, can we just get the basic 
COMPACT+SUMMARY and only support that.

 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0
Reporter: Prasad Chakka
Assignee: He Yongqiang
 Attachments: hive-417.proto.patch, hive-417－2009-07-18.patch, 
 hive-indexing.3.patch, hive-indexing.5.thrift.patch, 
 indexing_with_ql_rewrites_trunk_953221.patch


 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2010-06-30 Thread Namit Jain (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884110#action_12884110
 ] 

Namit Jain commented on HIVE-417:
-

DDLSemanticAnalyzer.java

if (outputFormat == null) {
  outputFormat = RCFileOutputFormat.class;
}


use the default - dont hardcode.


 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0
Reporter: Prasad Chakka
Assignee: He Yongqiang
 Attachments: hive-417.proto.patch, hive-417－2009-07-18.patch, 
 hive-indexing.3.patch, hive-indexing.5.thrift.patch, 
 indexing_with_ql_rewrites_trunk_953221.patch


 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2010-06-09 Thread Prafulla Tekawade (JIRA)

[
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12877049#action_12877049
]

Prafulla Tekawade commented on HIVE-417:

I was thinking of adding something called query rewrite module.
It would be rule-based query rewrite system and it would
rewrite the query into semantically equivalent query which is
more optimized and/or uses indexes (not just for scans, but
for other query operators, e.g. GroupBy etc.)

Eg.

select distinct c1
from t1;

This query, if we have densed index ('compact summary index' in this
hive indexing patch) on c1 can be replaced with query on index table
itself.

select idx_key
from t1_cmpct_sum_idx;

Similar query transformation can happen for other queries.

Module will be placed just before optimizer and will help optimizer.
Module structure looks like below.

[Query parser]
[Query rewrites] -- new phase
[Query optimization]
[Query execution planner]
[Query execution engine]

The rewrite module is 'generic', not just for above indexing case,
but for other cases too, e.g. OR predicates to union (for efficiency?), outer
join
to union of anti semi joins, moving out 'order by' out of union
subquery etc etc.

The aim is to implement a very simple, light-weight rewrite support,
implement the indexing related rewrites (above rewrite does not
even need a new run-time map-red operator) and integrate indexing
support quickly and cleanly. As noted above, this rewrite phase
is rule-based (and not cost-based), sort of early optimization.

Let me know what u think. I'll start with reading ur patch.
This would do most part from TODO 1,
TODO 2 and 3 will have to be looked into.

Implement Indexing in Hive
--

Implement indexing on Hive so that lookup and range queries are efficient.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2010-06-09 Thread He Yongqiang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12877144#action_12877144
 ] 

He Yongqiang commented on HIVE-417:
---

Plan sounds perfectly good to me!

 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0
Reporter: Prasad Chakka
Assignee: He Yongqiang
 Attachments: hive-417.proto.patch, hive-417－2009-07-18.patch, 
 hive-indexing.3.patch


 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2010-06-09 Thread Ashish Thusoo (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12877236#action_12877236
 ] 

Ashish Thusoo commented on HIVE-417:


A couple of comments on this:

A complication that happens by doing a rewrite just after parse is that you 
loose the ability to report back errors that correspond to the original query. 
Also the 
metadata that you need to do the rewrite is only available after phase 1 of 
semantic analysis. So in my opinion the rewrite should be done after semantic 
analysis but before plan generation. Is that what you had in mind...

so something like...

[Query parser]
[Query semantic analysis]
[Query optimization]
...


 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0
Reporter: Prasad Chakka
Assignee: He Yongqiang
 Attachments: hive-417.proto.patch, hive-417－2009-07-18.patch, 
 hive-indexing.3.patch


 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2010-06-09 Thread Prafulla Tekawade (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12877295#action_12877295
 ] 

Prafulla Tekawade commented on HIVE-417:


Yes Ashish,
Thats what I had in mind.

Rewrite system would need metadata, and hence it should be invoked 
after semantic analysis phase which would make metadata available.


 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0
Reporter: Prasad Chakka
Assignee: He Yongqiang
 Attachments: hive-417.proto.patch, hive-417－2009-07-18.patch, 
 hive-indexing.3.patch


 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2010-06-08 Thread Prafulla Tekawade (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12876676#action_12876676
 ] 

Prafulla Tekawade commented on HIVE-417:


He Yongqiang ,
Have you started working on this one ?
If not, I was interested in taking a look at it.
Patch link hive- 417－2009-07-18.patch is not working, can you share latest 
patch here ?

 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0
Reporter: Prasad Chakka
Assignee: He Yongqiang
 Attachments: hive-417.proto.patch, hive-417－2009-07-18.patch


 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2010-06-08 Thread He Yongqiang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12876717#action_12876717
 ] 

He Yongqiang commented on HIVE-417:
---

Cool. Yes. i do have a latest patch for this jira. I will cleanup it and post. 

 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0
Reporter: Prasad Chakka
Assignee: He Yongqiang
 Attachments: hive-417.proto.patch, hive-417－2009-07-18.patch


 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2010-06-08 Thread He Yongqiang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12876827#action_12876827
 ] 

He Yongqiang commented on HIVE-417:
---

I forgot to add this line set hive.exec.compress.output=false; in the above 
snippet before selecting from the index table.

 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0
Reporter: Prasad Chakka
Assignee: He Yongqiang
 Attachments: hive-417.proto.patch, hive-417－2009-07-18.patch, 
 hive-indexing.3.patch


 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2010-02-18 Thread He Yongqiang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12835569#action_12835569
 ] 

He Yongqiang commented on HIVE-417:
---

Got talked with Prasad about this issue today.
I may not able to finish this in the coming one or two months. I am now 
spending most of my time working on some other issues. I am sorry about that.
If anyone want this feature in, please feel free to take over from me. And i 
will provide all help that i can.  If no one picked up, i can finish it after 
finishing issues at hand.
Thanks.

 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0
Reporter: Prasad Chakka
Assignee: He Yongqiang
 Attachments: hive-417.proto.patch, hive-417－2009-07-18.patch


 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2009-09-22 Thread Joydeep Sen Sarma (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12758135#action_12758135
 ] 

Joydeep Sen Sarma commented on HIVE-417:


MDC also maintains metadata separately - at least based on their paper 
(http://www.research.ibm.com/compsci/project_spotlight/datamgmt/SIGMOD2003.pdf)


 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0
Reporter: Prasad Chakka
Assignee: He Yongqiang
 Attachments: hive-417.proto.patch, hive-417－2009-07-18.patch


 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2009-09-22 Thread Prasad Chakka (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12758139#action_12758139
 ] 

Prasad Chakka commented on HIVE-417:


yes they do but they don't use for table scans which are done if the query 
selectivity is greater than 10% (or some such). they use the index for index 
scans and in joins. I wrote the table scan code :)

 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0
Reporter: Prasad Chakka
Assignee: He Yongqiang
 Attachments: hive-417.proto.patch, hive-417－2009-07-18.patch


 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2009-09-22 Thread Schubert Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12758556#action_12758556
 ] 

Schubert Zhang commented on HIVE-417:
-

{quote}
Prasad Chakka added a comment - 15/Apr/09 11:25 AM
Another way of doing it is to create a file format that contains index along 
with data... but i think that would take lot more time. 
{quote}

We are trying to store data in sorted and block-indexed files (such as HFile or 
TFile). Then I think we can know the startKey and lastKey of each file and each 
block. This block index(block summary) is just for primary key. 

 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0
Reporter: Prasad Chakka
Assignee: He Yongqiang
 Attachments: hive-417.proto.patch, hive-417－2009-07-18.patch


 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2009-09-21 Thread Jeff Hammerbacher (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12758067#action_12758067
 ] 

Jeff Hammerbacher commented on HIVE-417:


Another type of index worth knowing about: the negative index/storage index 
from Exadata, described at 
http://blogs.oracle.com/datawarehousing/2009/09/500gbsec_and_database_machine.html.

We get some negative indexing for free with partitions, but this may be 
useful for more distinctive scans over columns for which we have not 
partitioned.

 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0
Reporter: Prasad Chakka
Assignee: He Yongqiang
 Attachments: hive-417.proto.patch, hive-417－2009-07-18.patch


 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2009-09-21 Thread Prasad Chakka (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12758084#action_12758084
 ] 

Prasad Chakka commented on HIVE-417:


@jeff, i think this is more suitable for storing it along with data where 
blocks of data can skipped while scanning rows. i think columnar storage might 
already be doing this. 

 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0
Reporter: Prasad Chakka
Assignee: He Yongqiang
 Attachments: hive-417.proto.patch, hive-417－2009-07-18.patch


 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2009-09-21 Thread Jeff Hammerbacher (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12758094#action_12758094
 ] 

Jeff Hammerbacher commented on HIVE-417:


Yeah, I think so as well. Did my comment make it seem like I thought otherwise?

 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0
Reporter: Prasad Chakka
Assignee: He Yongqiang
 Attachments: hive-417.proto.patch, hive-417－2009-07-18.patch


 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2009-09-21 Thread Prasad Chakka (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12758097#action_12758097
 ] 

Prasad Chakka commented on HIVE-417:


there can be a summary index here as well (every SequenceFile block will have 
min  max column values in the index). thought you are hinting at that.

 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0
Reporter: Prasad Chakka
Assignee: He Yongqiang
 Attachments: hive-417.proto.patch, hive-417－2009-07-18.patch


 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2009-09-21 Thread Joydeep Sen Sarma (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12758116#action_12758116
 ] 

Joydeep Sen Sarma commented on HIVE-417:


are there any references on this technique?

someone had earlier suggested this (apparently from reading Netezza 
documentation) - but i don't understand when it would work. why would a (fairly 
large) sequencefile block only limited range of values (assuming the metadata 
stores a min-max range). most cases i can imagine in our dataset would either 
have low cardinality columns (so most values would be present) or for large 
cardinality ones - the distribution would be random (relative to the primary 
sort key) - and the range would seem ineffective.

unless there are columns that are closely related to the how data is 
sorted/partitioned (perhaps some product ids are limited to specific range of 
time - but the partitioning is on time and not product id - and even that 
sounds dubious).

a bloom filter would seem much more plausible at allowing good filtering. even 
then don't understand why this sort of metadata should be kept along with the 
block and not separately (much more flexible - can be added on demand) as this 
jira is headed towards.

 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0
Reporter: Prasad Chakka
Assignee: He Yongqiang
 Attachments: hive-417.proto.patch, hive-417－2009-07-18.patch


 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2009-09-21 Thread Prasad Chakka (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12758127#action_12758127
 ] 

Prasad Chakka commented on HIVE-417:


i don't think it makes much sense unless there is some clustering or sorting 
property. if there is clustering and sorting and the selectivity of a query is 
much higher than 10% then storing this metadata along with data makes sense 
instead of a separate block. the 10% threshold may be larger for Hive but the 
point still stands. in OLAP case data is change seldom and the size of this 
kind of metadata is much smaller than the data itself so the overhead of 
storing this data is negligible.

something similar to this is done in DB2 Multi-Dimensional Clustering where 
whole blocks (disk blocks) are skipped if the key value doesn't fit the query.

 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0
Reporter: Prasad Chakka
Assignee: He Yongqiang
 Attachments: hive-417.proto.patch, hive-417－2009-07-18.patch


 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2009-07-25 Thread He Yongqiang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12735271#action_12735271
 ] 

He Yongqiang commented on HIVE-417:
---

Created HIVE-678 for add support for building index.
see https://issues.apache.org/jira/browse/HIVE-678

 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0
Reporter: Prasad Chakka
Assignee: He Yongqiang
 Attachments: hive-417.proto.patch, hive-417－2009-07-18.patch


 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2009-07-22 Thread Prasad Chakka (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734203#action_12734203
 ] 

Prasad Chakka commented on HIVE-417:


1) Are you worried about the sort phase of the reducer or the IndexBuilder's 
reducer code? I don't think former issue will be a problem. The later issue can 
be avoided by writing multiple rows for a key if the number of offsets exceed a 
certain limit. So reducer can flush the offsets periodically to disk thus 
avoiding OutOfMemory exceptions in reducer.

2) What are the other options for the index output format?

 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0
Reporter: Prasad Chakka
Assignee: He Yongqiang
 Attachments: hive-417.proto.patch, hive-417－2009-07-18.patch


 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2009-07-22 Thread He Yongqiang (JIRA)

[
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734405#action_12734405
]

He Yongqiang commented on HIVE-417:
---

1) For a given key, we are using a sorted set for each bucket to store
positions at the reduer. I am worried that one sorted set for each bucket may
cause out of memory problem.
as you commentted earlier: Listbucketname, Listoffset column, offsets are
sorted.
Think about one extreme situation: one file contains a single value million
times. So at the reducer we are storing million positions in a sorted set.

So reducer can flush the offsets periodically to disk thus avoiding
OutOfMemory exceptions in reducer.
If we do this, how we can guarantee they are sorted. I mean offsets after this
flush are greater than offsets in previous flush.

2)What are the other options for the index output format?
I think there is no other options. We need to discard the key part. And i think
in hive only IgnoreKeyTextOutputFormat does that. And Of course all hive's
custom HiveOutputFormat can discard key part, but they can not be specified in
the map-reduce jobconf, since they do not extend OutputFormat.

Implement Indexing in Hive
--

Implement indexing on Hive so that lookup and range queries are efficient.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2009-07-22 Thread Prasad Chakka (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734408#action_12734408
 ] 

Prasad Chakka commented on HIVE-417:


well the number of offsets can't exceed number of SequenceFile blocks since we 
can only index the SequenceFile block offsets. So the problem is not as dire as 
it can be. And also if there are that many (i.e. more than 10% of rows in 
traditional RDBMS but may more in Hadoop case) have same key then index may not 
be efficient after all since it is better to read the whole table anyways.

 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0
Reporter: Prasad Chakka
Assignee: He Yongqiang
 Attachments: hive-417.proto.patch, hive-417－2009-07-18.patch


 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2009-07-22 Thread Prasad Chakka (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734419#action_12734419
 ] 

Prasad Chakka commented on HIVE-417:


what i am trying to say is for such frequent keys indexing may not be of much 
help so may be we can relax 'sort' property? i don't think there is another 
easy way out other than do a disk based sort. check you can reuse any of the 
hadoop sorting code. Or can we piggyback this sorting on top of hadoop reduce 
sort phase some how?

 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0
Reporter: Prasad Chakka
Assignee: He Yongqiang
 Attachments: hive-417.proto.patch, hive-417－2009-07-18.patch


 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2009-07-08 Thread Prasad Chakka (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12728900#action_12728900
 ] 

Prasad Chakka commented on HIVE-417:


One thing that isn't mentioned is that in Listbucketname, Listoffset 
column, offsets are sorted. 

Another thing missing is, when an index on a partition is built then a new 
partition will be created for that index table (similar to that of creating a 
partition for a regular table).

We can distinguish index tables and regular tables by having a table parameter.

We can skip partition specific indexes in the first phase if it reduces amount 
of work and assume indexes defined on a table can be created on all partitions.

 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0
Reporter: Prasad Chakka
Assignee: Yongqiang He
 Attachments: hive-417.proto.patch


 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2009-06-29 Thread schubert zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725103#action_12725103
 ] 

schubert zhang commented on HIVE-417:
-

Prasad,

Thanks for your comments. Now, I understand your comments.

Yes, in one of our projects, we sorted the data table and build sparse index 
which record the block keys and file offsets. Then, we load the index files 
into HBase to service for query. I works fine.



 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0
Reporter: Prasad Chakka
Assignee: Yongqiang He
 Attachments: hive-417.proto.patch


 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2009-06-22 Thread Prasad Chakka (JIRA)

[
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12722773#action_12722773
]

Prasad Chakka commented on HIVE-417:

Schubert,

We can run another map-reduce job that scans the index and builds out the
results file sorted by the index key. This file can be read sequentially and
determine which input table HDFS blocks to be fed to the actual job for the
query.

Another way is to build a sparse index on the index. But if the table itself is
sorted, we can build the sparse index (ala MapFile) directly and use it.
@Facebook, the usecase we have doesn't have this sorting property but I can
envision this being useful for primary indexes where the index sort order and
the table sort order are same.

Can you think of any other ways? Ofcourse, we can process index files using
HBase or TokyoCabinet but that requires another system to be setup and
administered and both systems need to be available for index processing. But in
some cases these solutions also work. The indexing scheme described above
should play well with Hbase and TokyoCabinet since index is a file with rows
containg a key and position parameters. In Hadoop we can stored that in
SequenceFile or may be TFile but if they have to be stored in external systems,
we can plug-in a custom SerDe and change the default location of these two a
location where the external systems can access these files.

Implement Indexing in Hive
--

Implement indexing on Hive so that lookup and range queries are efficient.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2009-06-02 Thread Seymour Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715539#action_12715539
 ] 

Seymour Zhang commented on HIVE-417:



are we going to have one index file per hdfs file?

Can we also support exporting these index files as a table to some other 
storage system like HBase or Tokyou Cabinet,  i.e. these seperate index files 
for each HDFS file, can be expressed as a single table in Hive?

 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0
Reporter: Prasad Chakka
Assignee: He Yongqiang
 Attachments: hive-417.proto.patch


 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2009-06-01 Thread Joydeep Sen Sarma (JIRA)

[
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715153#action_12715153
]

Joydeep Sen Sarma commented on HIVE-417:

- are we going to have one index file per hdfs file? (or one per partition?)

related question is how this is going to interact with sampling? (i think
currently the sampling predicate is optimized out for bucketed tables -
although not terribly sure).

i would love to see the api to invoke the index.
- ideally we would like to plug in different indexing schemes - as well with
map-side joins - the hashmap storing the smaller table can be seen as an index
on this table. It would seem that one should be able to replace a map-side join
based on tables loaded into jdbm with tables with indices proposed here (and
thereby do joins based on indices almost trivially).
- we should enable people to be able to plug in their own indices (since it's
quite likely that over time there will be multiple indexing efforts on hadoop
files).

Implement Indexing in Hive
--

Implement indexing on Hive so that lookup and range queries are efficient.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2009-06-01 Thread He Yongqiang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715345#action_12715345
 ] 

He Yongqiang commented on HIVE-417:
---

Joydeep, Thanks for the concern.
are we going to have one index file per hdfs file?
yeah.
i would love to see the api to invoke the index
currently it is not settle down. I will try to give it in the next week.
enable people to be able to plug in their own indices
I think if we have a well-designed adaptable api, then this can be addressed.
we would like to plug in different indexing schemes
yes. i have proposed several schemes in previous posts. Can you give me some 
schemes, so i can compare and make a better design.

BTW, I will try to write a proposal in next week. I have an important english 
exam this weekend. Sorry for the delay.


 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0
Reporter: Prasad Chakka
Assignee: He Yongqiang
 Attachments: hive-417.proto.patch


 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2009-05-30 Thread He Yongqiang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12714730#action_12714730
 ] 

He Yongqiang commented on HIVE-417:
---

Thanks for the suggestions, Seymour.
I have also thought what your said, directly fetch the data instead of 
initilize a new mr job. I will try include this, but it may be done in the 
second phase(the optimize phase).

I'd like to treat these rows of same col values as a block and only use a 
single index entry for this block
in the design, we indeed only use one index entry. And not only for contineous 
values, we use the same index entry for all rows with the same col value.

 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0
Reporter: Prasad Chakka
Assignee: He Yongqiang
 Attachments: hive-417.proto.patch


 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2009-05-29 Thread Seymour Zhang (JIRA)

[
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12714492#action_12714492
]

Seymour Zhang commented on HIVE-417:

Hello Prasad and Yongqiang, Thank you very much for this great effort.

One of my suggestions would be that, since we've done indexing with Mapreduce,
and for some queries based on the generated indexes, can we just omit the
time-consuming Mapreduce phase during the querying period, as we've already got
all of the files/offsets and we can go to these specific file offsets directly
to get relevant rows of the table? This would greatly expedite the query
process.

This would be helpful for the following case in one of my usages with Hive.
With Hive, I've already sharded (by date), and bucketed (by cols hashing) of my
log data into a hierachical files. Also I've sorted each file with the hashing
cols. As I may have many rows with same column values but different timestamps,
to minimize index size, I'd like to treat these rows of same col values as a
block and only use a single index entry for this block. This will grealy reduce
the index size of my data, but still very useful in my query request with those
cols.

Implement Indexing in Hive
--

Implement indexing on Hive so that lookup and range queries are efficient.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2009-05-28 Thread Prasad Chakka (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12714306#action_12714306
 ] 

Prasad Chakka commented on HIVE-417:


the plan looks good. i am not sure we need to create sparse index on the dense 
index in phase 1. In most cases the size of dense index will be small enough so 
that additional mr job for processing the sparse index will become unnecessary. 
if sparse index is not necessary then there is not need for the dense index be 
sorted.

since the dense index is scanned completely while processing the query, we can 
use the index if any predicate column exists in index definition.


 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0
Reporter: Prasad Chakka
Assignee: He Yongqiang
 Attachments: hive-417.proto.patch


 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2009-05-26 Thread He Yongqiang (JIRA)

[
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12713003#action_12713003
]

He Yongqiang commented on HIVE-417:
---

Checked how Mysql does with index and found mysql either can not use index to
handle situations in my earlier post:
{quote}
but, we can not use it for queries like:
4) select * from table1 where col234 and col33
5) select * from table1 where col2 =34
6) select * from table1 where col3 45
{quote}

And now a basic idea for our index design, just like Prasad commented in
previous post:
1) index structure
use a mr job to create index, input is a file with all columns, and mapper
output kv pairs, where key is indexed col1, indexed col2,... offset.
And we define a comparator for indexed col1, indexed col2,... to letting the
shuffle phase sort all mappers' output. And in reducer, we combine kv-pairs to
indexed col1, indexed col2,... list_of_offsets
This is a dense sorted index, then we create a sparse index on the dense index.
And we also collect column data distribution informations (histogram) while
doing this.
2)
we consider using index for a query only when the query involves the columns of
leftmost part of the index.
And also need to consider index merge when involves two indexes, and a cost
estimation to consider whether using index will decrease query time (this is
the work need to do in the optimizer).

But as first step, we can first finish part 1 and hive ql part. Then consider
part two(optimizer part). After part1 finished, i will examine part2 in more
detail.

Implement Indexing in Hive
--

Implement indexing on Hive so that lookup and range queries are efficient.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2009-05-22 Thread He Yongqiang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12712352#action_12712352
 ] 

He Yongqiang commented on HIVE-417:
---

Thanks a lot, Prasad. I will put questions on the jira. and will start working 
on it after we set the design. Looking forward to working on it. 

 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0
Reporter: Prasad Chakka
Assignee: He Yongqiang
 Attachments: hive-417.proto.patch


 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2009-05-19 Thread Prasad Chakka (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12710741#action_12710741
 ] 

Prasad Chakka commented on HIVE-417:


Yes, mostly the block/pos size will be small but I don't think we can assume 
that since there will be enough cases where it will not be true. 

We explore the other approaches later on. different indexes will be useful in 
different scenarios. 

i will try to post some code this week. 

 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0
Reporter: Prasad Chakka
Assignee: He Yongqiang

 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2009-05-18 Thread Prasad Chakka (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12710494#action_12710494
 ] 

Prasad Chakka commented on HIVE-417:


The above index is not a hash index since you can't do range queries on hash 
index and lookups are constant time. not sure what to call this except that it 
is a view (simple projection) of the base table with offsets into the base 
table.

on sparse index, i meant you can create a sparse index on top of the index i 
described above. but this can be done later.

 And in most cases, the block/pos list's size will only be 1

that is not the case if the index is on a non-primary key column. and i think, 
mostly this is the case where indexes will be used in data warehouses.

 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0
Reporter: Prasad Chakka
Assignee: He Yongqiang

 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2009-05-18 Thread He Yongqiang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12710591#action_12710591
 ] 

He Yongqiang commented on HIVE-417:
---

that is not the case if the index is on a non-primary key column. and i think, 
mostly this is the case where indexes will be used in data warehouses.
Yes. If the index is built on one column, the block/pos list's size will be 
large. But if it is built on many columns, i think the block/pos list's size 
will be small.
Anyway, we can build this index as the first step.
And after this finished, we can try other kinds of index, like:
1) sort based index
2) lucene index
3) block-scope B+Tree or R-tree or other advantage index data structures.

Prasad, you said you already wrote some code, would you please attach it?

 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0
Reporter: Prasad Chakka
Assignee: He Yongqiang

 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2009-05-17 Thread Prasad Chakka (JIRA)

[
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12710204#action_12710204
]

Prasad Chakka commented on HIVE-417:

1)
The question you raised applies only to B+Tree indexes. The index that I
defined above is not really a traditional database index but a kind of summary
table (or view) and any lookup/range-query on table requires reading of the
whole index. So you can apply all predicates as long as columns referenced in
the predicates exist in the index. So we should be able use index on (col1,
col2, col3) for all the queries above. Sorting order has no impact here since
the whole index is read into memory anyways.

Since this index can be created in sorted order, we can create sparse index
(similar to non-leaf nodes of a B+-Tree) if the index itself is too big (ie,
index sizes are order of magnitude larger than HDFS block size). But this can
be done as a later optimization.

2)
With the design above, indexes on joins will come free since predicate pushdown
will push the 'user.name=user_name' to above the join and only index filtered
rows participate in join.

But creating indexes on the joined output may increase the index size so as to
decrease the overall effectiveness. But with sparse indexes this problem might
be mitigated so we can support this kind of join indexes along with support for
sparse indexes.

3)
Yes, for some aggregation queries it may make sense to read the index (since it
is a summary table as well). Aggregations or any queries that involve only
columns from the index can operate only on the index and not the main table.

4)
I also looked at it and not sure how it fits into Hive. Katta is more like an
distributed index server.

Implement Indexing in Hive
--

Implement indexing on Hive so that lookup and range queries are efficient.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2009-05-15 Thread Prasad Chakka (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12709899#action_12709899
 ] 

Prasad Chakka commented on HIVE-417:


Here is a very rough outline as how this can be done (prototype code has 
creation and execution parts but not he HiveQL related stuff)

hive indexing:
goal of hive indexing is to speed up lookup queries on certain columns of the 
table. currently queries with predicates like 'WHERE tab1.col1 = 10' has to 
load the complete table/partition and process all the rows. if there exists an 
index on col1 then only a small portion of the file can be loaded.

command to create index:

create index tab_idx1 on tab1 (col1, ...);

if the base table is partitioned then the index is also partitioned. indexes 
can be created on base tables whose file format supports (getPos() and possible 
seek() or equivalent methods.)

format of index:
index is also a hive table with the following columns
col_1...col_k -- key cols. base table columns on which this index is defined
listoffset  -- positions of rows which contain these keys

offset is a combination of following
file_name   -- relative path of the file in which this row is contained. 
(relative to the partition/table location)
byte_offset -- byte offset of the row in the file. row can be found at this 
byte offset or in the block starting at this byte offset for Block Compressed 
Sequence Files.

when to create index:
traditionally databases try to update index when the table is loaded. hive 
doesn't process rows while loading tables using 'LOAD DATA INPATH' command. 
also it may slow down the actual loading for 'INSERT ... SELECT ... FROM ...' 
type of statements. so users should have an option whether the index is 
initialized during 'INSERT ... SELECT ...' or initialized separately. Another 
command like 'update index tab_idx partition ..' can be provided.

how to create index:
index can be created using the following hive command augmented with 'offset'
'select col_1...col_k, offset from tab1'

offset can be provided as built in function which can be derived in 
HiveInputRecordReader which will in turn use the specific FileFormat's Reader 
getPos() method and the 'map.input.file' for the file name (or from the 
tableDesc or partiionDesc).

Algorithm For using index:
1) Hive QL needs to determine whether a particular query can use any existing 
indexes. This can be determined by examining the predicate tree. After 
predicate pushdown, all those predicates which can use index are in the child 
operator of a TableScanOperator. This predicate tree needs to be examined. If 
this contains any subset of columns of an index then that index can be used. 
Until stats are available, it is not possible to guess whether using index is 
beneficial. This needs to be fleshed out more to check both 'AND' and 'OR' 
predicates.

2) For each of the qualified indexes, a map/reduce job can be created using the 
predicates determined in step 1. The output of this job should have the 
following information
file_name   -- fully qualified file name that contains the data
byte_offset -- position of row

3) If there is more than one qualified index then the outputs of step2 needs to 
be combined depending on whether the predicates on these indexes have 'AND' or 
'OR' between them.

4) Modify the original plan to use only those FileSplits that appear in the 
output of step3. This reduces the number of mappers spawned by JobTracker.

5) Modify the original plan to use HiveIndexRecordReader instead of regular 
record reader. Output of step3 (which is sorted) is available to the 
HiveIndexRecordReader. It can skip to these locations instead of reading every 
record in the input of the Mapper.



 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0
Reporter: Prasad Chakka

 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2009-04-25 Thread Prasad Chakka (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702769#action_12702769
 ] 

Prasad Chakka commented on HIVE-417:


HIVE-1230 has changed the interface for RecordReader and it no longer has 
getPos() method. The older interfaces are deprecated. I used this method in the 
prototype get the current position while creating the index and also while 
reading the actual data file. Even the SequenceFileRecordReader does not have 
this method. 

Without getPos() and seek() methods to RecordReader it becomes tough to 
implement any kind of generic indexing.


 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0
Reporter: Prasad Chakka

 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-417) Implement Indexing in Hive

2009-04-15 Thread Prasad Chakka (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699307#action_12699307
 ] 

Prasad Chakka commented on HIVE-417:


Another way of doing it is to create a file format that contains index along 
with data... but i think that would take lot more time.

 Implement Indexing in Hive
 --

 Key: HIVE-417
 URL: https://issues.apache.org/jira/browse/HIVE-417
 Project: Hadoop Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0
Reporter: Prasad Chakka

 Implement indexing on Hive so that lookup and range queries are efficient.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

74 matches

Mail list logo