[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12894221#action_12894221 ] He Yongqiang commented on HIVE-417: --- For mysql metastore upgrade, please refer to http://wiki.apache.org/hadoop/Hive/IndexDev#Metastore_Upgrades Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0 Reporter: Prasad Chakka Assignee: He Yongqiang Fix For: 0.7.0 Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch, hive-indexing-8-thrift-metastore-remodel.patch, hive-indexing.3.patch, hive-indexing.5.thrift.patch, hive.indexing.11.patch, hive.indexing.12.patch, hive.indexing.13.patch, idx2.png, indexing_with_ql_rewrites_trunk_953221.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12893522#action_12893522 ] John Sichi commented on HIVE-417: - Since another patch is needed, here are the review comments I mentioned above. * Javadoc for Hive.createIndex needs parameters fixed * Javadoc for HiveIndexHandler.analyzeIndexDefinition: remove storageDesc] * In HiveUtils.getIndexHandler: the message should be Error in loading index handler rather than Error in loading storage handler * GenericUDAFCollectSet @Description : with no duplication elements should be with duplicate elements eliminated * DDLSemanticAnalyzer.analyzeCreateIndex: hanlder is misspelled * Property AbstractIndexHandler.INDEX_COLS_KEY is never used; get rid of it? * For HiveIndex.INDEX_TABLE_CREATETIME property name, spell out lastModifiedTime instead of lmt Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0 Reporter: Prasad Chakka Assignee: He Yongqiang Fix For: 0.7.0 Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch, hive-indexing-8-thrift-metastore-remodel.patch, hive-indexing.3.patch, hive-indexing.5.thrift.patch, hive.indexing.11.patch, hive.indexing.12.patch, idx2.png, indexing_with_ql_rewrites_trunk_953221.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12893890#action_12893890 ] John Sichi commented on HIVE-417: - OK, testing lucky patch 13... Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0 Reporter: Prasad Chakka Assignee: He Yongqiang Fix For: 0.7.0 Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch, hive-indexing-8-thrift-metastore-remodel.patch, hive-indexing.3.patch, hive-indexing.5.thrift.patch, hive.indexing.11.patch, hive.indexing.12.patch, hive.indexing.13.patch, idx2.png, indexing_with_ql_rewrites_trunk_953221.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1289#action_1289 ] John Sichi commented on HIVE-417: - Thanks Yongqiang. Looking at it now. Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0 Reporter: Prasad Chakka Assignee: He Yongqiang Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch, hive-indexing-8-thrift-metastore-remodel.patch, hive-indexing.3.patch, hive-indexing.5.thrift.patch, hive.indexing.11.patch, hive.indexing.12.patch, idx2.png, indexing_with_ql_rewrites_trunk_953221.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12893402#action_12893402 ] John Sichi commented on HIVE-417: - +1. Will commit when tests pass. I noticed a number of trivial issues (like Javadoc mismatches) which I'll put in a followup. Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0 Reporter: Prasad Chakka Assignee: He Yongqiang Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch, hive-indexing-8-thrift-metastore-remodel.patch, hive-indexing.3.patch, hive-indexing.5.thrift.patch, hive.indexing.11.patch, hive.indexing.12.patch, idx2.png, indexing_with_ql_rewrites_trunk_953221.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12893455#action_12893455 ] Joydeep Sen Sarma commented on HIVE-417: i am waiting for a commit on hive-1408. that's probably gonna collide. Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0 Reporter: Prasad Chakka Assignee: He Yongqiang Fix For: 0.7.0 Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch, hive-indexing-8-thrift-metastore-remodel.patch, hive-indexing.3.patch, hive-indexing.5.thrift.patch, hive.indexing.11.patch, hive.indexing.12.patch, idx2.png, indexing_with_ql_rewrites_trunk_953221.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12893461#action_12893461 ] John Sichi commented on HIVE-417: - Thanks Joydeep. Yeah, this one has tons of plan diffs due to the virtual columns. Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0 Reporter: Prasad Chakka Assignee: He Yongqiang Fix For: 0.7.0 Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch, hive-indexing-8-thrift-metastore-remodel.patch, hive-indexing.3.patch, hive-indexing.5.thrift.patch, hive.indexing.11.patch, hive.indexing.12.patch, idx2.png, indexing_with_ql_rewrites_trunk_953221.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12893488#action_12893488 ] John Sichi commented on HIVE-417: - Yongqiang, I passed tests on Hadoop 0.20, but Ning has committed HIVE-1408, which conflicts, so you'll need to rebase against that and then I'll try again. Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0 Reporter: Prasad Chakka Assignee: He Yongqiang Fix For: 0.7.0 Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch, hive-indexing-8-thrift-metastore-remodel.patch, hive-indexing.3.patch, hive-indexing.5.thrift.patch, hive.indexing.11.patch, hive.indexing.12.patch, idx2.png, indexing_with_ql_rewrites_trunk_953221.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12892865#action_12892865 ] John Sichi commented on HIVE-417: - Regarding the mkset function: can we rename this to collect_array to hint that it is a UDAF? The @Description should also make this clear. Collect is the standard SQL name for this aggregate function, but the standard version returns a multiset rather than an array, so let's call it collect_array to be specific. Also, it will need its own independent unit tests (open a followup JIRA issue for this). Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0 Reporter: Prasad Chakka Assignee: He Yongqiang Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch, hive-indexing-8-thrift-metastore-remodel.patch, hive-indexing.3.patch, hive-indexing.5.thrift.patch, hive.indexing.11.patch, idx2.png, indexing_with_ql_rewrites_trunk_953221.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12892932#action_12892932 ] Ashish Thusoo commented on HIVE-417: Started looking at this. One initial question I had - why is virtualcolumn class in the serde2 package? Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0 Reporter: Prasad Chakka Assignee: He Yongqiang Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch, hive-indexing-8-thrift-metastore-remodel.patch, hive-indexing.3.patch, hive-indexing.5.thrift.patch, hive.indexing.11.patch, idx2.png, indexing_with_ql_rewrites_trunk_953221.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12892937#action_12892937 ] John Sichi commented on HIVE-417: - Another followup needed: REBUILD should be propagating lineage and read/write info from the reentrant INSERT statement up to the top-level statement so that hooks get called with the right information. Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0 Reporter: Prasad Chakka Assignee: He Yongqiang Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch, hive-indexing-8-thrift-metastore-remodel.patch, hive-indexing.3.patch, hive-indexing.5.thrift.patch, hive.indexing.11.patch, idx2.png, indexing_with_ql_rewrites_trunk_953221.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12892939#action_12892939 ] Ashish Thusoo commented on HIVE-417: Also, how is the file name populated? That is not done through the IOContext? Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0 Reporter: Prasad Chakka Assignee: He Yongqiang Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch, hive-indexing-8-thrift-metastore-remodel.patch, hive-indexing.3.patch, hive-indexing.5.thrift.patch, hive.indexing.11.patch, idx2.png, indexing_with_ql_rewrites_trunk_953221.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12892946#action_12892946 ] He Yongqiang commented on HIVE-417: --- @Ashish why is virtualcolumn class in the serde2 package? will put it to ql.io package. I put it to serde2 package just because i thought it maybe needed by the serde layer. Since all codes are almost done and it is not accessed by serde, it makes sense to move it to ql. how is the file name populated filename and block offset are all populated by record reader. filename is populated by looking at the split path when we construct the record reader. Offset is generated at runtime by record reader. Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0 Reporter: Prasad Chakka Assignee: He Yongqiang Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch, hive-indexing-8-thrift-metastore-remodel.patch, hive-indexing.3.patch, hive-indexing.5.thrift.patch, hive.indexing.11.patch, idx2.png, indexing_with_ql_rewrites_trunk_953221.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12892947#action_12892947 ] He Yongqiang commented on HIVE-417: --- IOContext is just a container, HiveContextAwareRecordReader is responsible for filling it with actual values. Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0 Reporter: Prasad Chakka Assignee: He Yongqiang Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch, hive-indexing-8-thrift-metastore-remodel.patch, hive-indexing.3.patch, hive-indexing.5.thrift.patch, hive.indexing.11.patch, idx2.png, indexing_with_ql_rewrites_trunk_953221.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12892413#action_12892413 ] John Sichi commented on HIVE-417: - Yongqiang, I looked at hive.indexing.10.patch, but I don't see the virtual columns in there? Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0 Reporter: Prasad Chakka Assignee: He Yongqiang Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch, hive-indexing-8-thrift-metastore-remodel.patch, hive-indexing.3.patch, hive-indexing.5.thrift.patch, hive.indexing.10.patch, idx2.png, indexing_with_ql_rewrites_trunk_953221.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12892594#action_12892594 ] John Sichi commented on HIVE-417: - First pass of review comments on latest patch (I'll probably have more tomorrow). * INDEX_NAME precision in the metastore should be 128 characters (not 767), following convention for other identifiers * I don't think we need INDEX_TABLE_NAME at all in the metastore; it should only be used during CREATE INDEX and then forgotten * Move HiveIndexInputFormat and HiveIndexResult to package org.apache.hadoop.hive.ql.index.compact, and add Compact in their names (I'd still prefer to move this entire package out to a new subproj, but I guess we can skip that part now since most of the code went away with the virtual column approach); rename property hive.exec.index_file to hive.index.compact.file * Support WITH DEFERRED REBUILD, and require this to be specified for now to avoid confusion (per discussion in design meeting) * when generating reentrant INSERT, need to quote identifiers such as table/column names (use HiveUtils.unparseIdentifier), and may need extra escaping for special characters in getPartKVPairStringArray (I'm not sure--check with Paul) * thread_local should be private (and named threadLocal); go through public IOContext.get() instead; likewise use public getter/setter methods on IOContext instead of accessing its data members directly * need ORDER BY in virtual_column.q * remove extra semicolon in other ORDER BY's, and make sure they cover a unique key in all cases * don't need TYPE and UPDATE as keywords in grammar Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0 Reporter: Prasad Chakka Assignee: He Yongqiang Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch, hive-indexing-8-thrift-metastore-remodel.patch, hive-indexing.3.patch, hive-indexing.5.thrift.patch, hive.indexing.11.patch, idx2.png, indexing_with_ql_rewrites_trunk_953221.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12892613#action_12892613 ] John Sichi commented on HIVE-417: - Whoops, forgot two leftover from a private diff review: * metastore/if/hive_metastore.thrift:102 instead of including the full indexTable structure inside the Index structure, can we omit it but then pass it as an additional parameter to add_index? * ql/src/java/org/apache/hadoop/hive/ql/index/compact/CompactIndexHandler.java:86 Move generic partition analysis out into Hive, since it will be the same for all plugins. We can talk more about these tomorrow if it's not clear. Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0 Reporter: Prasad Chakka Assignee: He Yongqiang Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch, hive-indexing-8-thrift-metastore-remodel.patch, hive-indexing.3.patch, hive-indexing.5.thrift.patch, hive.indexing.11.patch, idx2.png, indexing_with_ql_rewrites_trunk_953221.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12891274#action_12891274 ] Ning Zhang commented on HIVE-417: - Based on some internal discussions below are some comments about the design doc: 1) the staleness (inconsistency) between the index and the base table should be addressed more precisely. Since the current implementation allows the user to query the index table directly, we should guarantee that the index is consistent with the base table at the query time. This means at the query START time, the index was built completely based on the data stored in the base table. The current design does not satisfy this criteria in that it only record the last_modification_time (LMT) of the base table and the index table, and check if the latter is larger than the former. This leaves the following example break: timestamp0: last update of partition P1 timestamp1: start create index on partition P1 timestamp2: start insert overwrite P1 timestamp3: finish insert overwrite P1 timestamp4: finish index creation on P1 timestamp 5: query on P1 The LMTs of the index and the base table are timestamp4 and timestamp3 respectively so the optimizer will conclude the index is consistent with base table. However, the index was built based on stale data at the timestamp5. So the index should not be used. Instead of recording the LMT of the index table, we probably should record the LMT of the base table in the index metadata at the beginning of the index creation. In the above example, the timestamp recorded in the index metadata should be timestamp0. This means the index was created based on the base table at timestamp0. At the query time, we should check timestamp0 against timestamp 3, which correctly conclude the index is stale. BTW, all the timestamp should be coming from some centralized clock such as the DFS directory update time (from the namenode). 2) The above consistency problem does not only present in the case of DEFERRED REBUILD. Even if the index rebuild starts right away after INSERT OVERWRITE, there is still a time window that the index is stale (before the index creation is complete). So we need the same mechanism to figure out stale indexes. 3) I think a lock-based concurrency may not be the best choice as well. If the index creation takes a long time, it defers the availability of the base table. If we have the optimizer, we should always query against the base tables, and let the optimizer to figure out whether an index is available and fresh. So if an index creation is not finished, we can just use the base table, otherwise we can use the index if the cost is less expensive. 4) Another case is that if the index creation finished and the query is using the index, and then an DML happened on the base table and finished before the query finish. Here we only guarantee snapshot consistency (results consisting with the data at the beginning of the query, not after the query). 5) If we have the mechanism to check consistency of the index, then the index rebuild command could just return if the index is consistent. We can also allow a force option in case we need to compensate for bad metadata. Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0 Reporter: Prasad Chakka Assignee: He Yongqiang Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch, hive-indexing-8-thrift-metastore-remodel.patch, hive-indexing.3.patch, hive-indexing.5.thrift.patch, idx2.png, indexing_with_ql_rewrites_trunk_953221.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12890137#action_12890137 ] John Sichi commented on HIVE-417: - Preliminary draft of design doc is here http://wiki.apache.org/hadoop/Hive/IndexDev Yongqiang and I are still working out some of the details. Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0 Reporter: Prasad Chakka Assignee: He Yongqiang Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch, hive-indexing-8-thrift-metastore-remodel.patch, hive-indexing.3.patch, hive-indexing.5.thrift.patch, idx2.png, indexing_with_ql_rewrites_trunk_953221.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12889345#action_12889345 ] John Sichi commented on HIVE-417: - Here are some preliminary comments on the metastore work. We can move on to the plugin design next week and start getting all of this into a doc. * We should support a property on the index which controls the name of the index table, and only generate an index table name automatically in the case where the user doesn't supply the property. For this, we'll need to add property key/values to the grammar (IDXPROPERTIES like TBLPROPERTIES and SERDEPROPERTIES?). * The grammar supports control over the tableFileFormat for the index table; what about other attributes such as row format, location, and TBLPROPERTIES? Some of these may be dictated by the index implementation, but it may be useful to override in some cases (same as tableFileFormat). * Is the partitioning for the index independent of the partitioning for the table? Don't we need to allow control over this in the grammar? * I think we should track the status of the index (when was the last time it was rebuilt, if ever) so that we know whether it is fresh with respect to the base table data. How should we model this in such a way that it takes per-partition indexing into account? * Some metastore followups to be logged separately: COMMENT clause on index definition; DESCRIBE INDEX; SHOW INDEXES; dealing with base table columns being dropped/renamed out from under the index * For generating the index table structure, we'll need to move that to plugin (rather than in Hive.java), since each index will need a different table structure (or no table structure at all). * Test queries: remember to add ORDER BY for determinism. Also, I'm not sure whether it is safe to use /tmp in the local file system (it may not exist, e.g. on Windows). I used it in hbase_bulk.m, but that uses a mini HDFS cluster (not the local file system). * Dropping a table with an index on it currently gives the exception below (in Derby; I didn't test MySQL yet). Same for attempting to drop an index table directly (instead of dropping the index). The second case should either fail with a meaningful exception, or implicitly drop the index definition as a trigger from dropping the table. hive create table t1(i int); OK hive create index q type compact on table t1(i); OK hive drop table t1; FAILED: Error in metadata: javax.jdo.JDODataStoreException: Exception thrown flushing changes to datastore NestedThrowables: java.sql.BatchUpdateException: DELETE on table 'TBLS' caused a violation of foreign key constraint 'INDEXS_FK3' for key (12). The statement has been rolled back. FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask hive create table t5(i int); OK hive create index r type compact on table t5(i); OK hive drop table default__t5_r__; FAILED: Error in metadata: javax.jdo.JDODataStoreException: Exception thrown flushing changes to datastore NestedThrowables: java.sql.BatchUpdateException: DELETE on table 'TBLS' caused a violation of foreign key constraint 'INDEXS_FK2' for key (17). The statement has been rolled back. FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0 Reporter: Prasad Chakka Assignee: He Yongqiang Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch, hive-indexing-8-thrift-metastore-remodel.patch, hive-indexing.3.patch, hive-indexing.5.thrift.patch, idx2.png, indexing_with_ql_rewrites_trunk_953221.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12889376#action_12889376 ] He Yongqiang commented on HIVE-417: --- THANKS FOR THE DETAILED COMMENTS. We should support a property on the index which controls the name of the index table, and only generate an index table name automatically in the case where the user doesn't supply the property. will add this in the following patch. For this, we'll need to add property key/values to the grammar (IDXPROPERTIES like TBLPROPERTIES and SERDEPROPERTIES?). Let's do it in a followup jira. The grammar supports control over the tableFileFormat for the index table; what about other attributes such as row format, location, and TBLPROPERTIES? Some of these may be dictated by the index implementation, but it may be useful to override in some cases (same as tableFileFormat). We can add this when we see the requirement. For now we can leave this out. I think we should track the status of the index (when was the last time it was rebuilt, if ever) so that we know whether it is fresh with respect to the base table data. How should we model this in such a way that it takes per-partition indexing into account? I think it's the same as the one of key/value property. no? Test queries: remember to add ORDER BY for determinism. will add this in the following patch. Also, I'm not sure whether it is safe to use /tmp in the local file system (it may not exist, e.g. on Windows). I used it in hbase_bulk.m, but that uses a mini HDFS cluster (not the local file system). I think it's should be ok because it's not local tmp. it's mini HDFS /tmp Dropping a table with an index on it currently gives the exception below (in Derby; I didn't test MySQL yet). Same for attempting to drop an index table directly (instead of dropping the index). The second case should either fail with a meaningful exception, or implicitly drop the index definition as a trigger from dropping the table. Actually this is reported by Prafulla offline. Will add this in the following patch. For the second case, i am planning to report error. Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0 Reporter: Prasad Chakka Assignee: He Yongqiang Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch, hive-indexing-8-thrift-metastore-remodel.patch, hive-indexing.3.patch, hive-indexing.5.thrift.patch, idx2.png, indexing_with_ql_rewrites_trunk_953221.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12888946#action_12888946 ] John Sichi commented on HIVE-417: - Whoops, relationships connecting TBLS/SDS and IDXS/SDS got lost; will attach another diagram which fixes that. Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0 Reporter: Prasad Chakka Assignee: He Yongqiang Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch, hive-indexing-8-thrift-metastore-remodel.patch, hive-indexing.3.patch, hive-indexing.5.thrift.patch, idx.png, indexing_with_ql_rewrites_trunk_953221.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886634#action_12886634 ] Jeff Hammerbacher commented on HIVE-417: Hey, Any chance you guys could post a more detailed design document for full-fledged index support? I'm quite curious to read up on it. Thanks, Jeff Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0 Reporter: Prasad Chakka Assignee: He Yongqiang Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch, hive-indexing.3.patch, hive-indexing.5.thrift.patch, indexing_with_ql_rewrites_trunk_953221.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886764#action_12886764 ] John Sichi commented on HIVE-417: - @Jeff: Yes, we'll put it up on the wiki, similar to how we did for storage handler + HBase. Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0 Reporter: Prasad Chakka Assignee: He Yongqiang Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch, hive-indexing.3.patch, hive-indexing.5.thrift.patch, indexing_with_ql_rewrites_trunk_953221.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886308#action_12886308 ] Prafulla Tekawade commented on HIVE-417: Hi Yongqiang, I am facing some problem for creating SUMMARY indexes. This index is not built with update index command. COMPACT SUMMARY index works fine. Is there any problem with creation of SUMMARY index table ? Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0 Reporter: Prasad Chakka Assignee: He Yongqiang Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch, hive-indexing.3.patch, hive-indexing.5.thrift.patch, indexing_with_ql_rewrites_trunk_953221.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886380#action_12886380 ] He Yongqiang commented on HIVE-417: --- I think SUMMARY index's mapper code is comment out in the uploaded patch. Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0 Reporter: Prasad Chakka Assignee: He Yongqiang Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch, hive-indexing.3.patch, hive-indexing.5.thrift.patch, indexing_with_ql_rewrites_trunk_953221.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12886587#action_12886587 ] John Sichi commented on HIVE-417: - Based on discussion with Yongqiang, we've decided to go for Full-fledged index support. Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0 Reporter: Prasad Chakka Assignee: He Yongqiang Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch, hive-indexing.3.patch, hive-indexing.5.thrift.patch, indexing_with_ql_rewrites_trunk_953221.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884685#action_12884685 ] Ashish Thusoo commented on HIVE-417: Looked at the code and have some questions... Can you explain how the metastore object model is laid out. It seems that the table names of the index are stored in key value properties of the table that the index is created on. Is that correct? Would it be better to put a key reference from the index table to the base table instead (similar to what is done for partitions)? Also, how would this be used to query the table? Can you give an example? Is the idea here to select from the index an then pass the offsets to another query to look up the table? An example or a test which shows the query on the base table would be useful. Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0 Reporter: Prasad Chakka Assignee: He Yongqiang Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch, hive-indexing.3.patch, hive-indexing.5.thrift.patch, indexing_with_ql_rewrites_trunk_953221.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884869#action_12884869 ] John Sichi commented on HIVE-417: - Had a chat with Ashish and Yongqiang offline, and came up with three alternatives. 1) Shortest path to checkin: Treat current code as prototype and move it into contrib, providing a utility for creating/updating the index, and keeping changes to core classes to a minimum. As Yongqiang pointed out, this makes it harder to follow up with automatic use of the index due to the lack of metadata. If we do this, we should create a new JIRA issue for its limited scope. 2) Full-fledged index support: change the JDO metamodel to add support for indexes as first class objects, and come up with a pluggable index creation+access design framework which can encompass a variety of index types likely to be needed in the future. Code from this patch would become the first such index implementation provided. If we do this, we should continue on in this truly epic JIRA issue. 3) Rework as materialized view: keep the JDO metamodel as is (adding a new table type for MATERIALIZED_VIEW) but change the DDL to CREATE MATERIALIZED VIEW AS SELECT ... and then come up with the system functions needed (e.g. for accessing file offsets) in order to be able to express the index construction as SQL. We would then execute view materialization in a fashion similar to CREATE TABLE AS SELECT. This approach best reflects the way the current code models an index as an ordinary table, but requires some other changes (e.g. CTAS + dynamic partitioning, something we want anyway). If we do this, we should create a new JIRA issue since it's a different feature from the user POV. We're aiming to reach a decision next week; input is welcome on whether these alternatives make sense (and on others we should consider). Since this JIRA issue is already so overloaded, we would also like to treat the following two items as separate followup JIRA issues rather than trying to address it all at once: * rewrite framework * automatic usage of index or materialized view by optimizer Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0 Reporter: Prasad Chakka Assignee: He Yongqiang Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch, hive-indexing.3.patch, hive-indexing.5.thrift.patch, indexing_with_ql_rewrites_trunk_953221.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884404#action_12884404 ] Namit Jain commented on HIVE-417: - Few higher level comments: 1. Populate the index at create index. 2. Instead of proposing a new syntax, why dont we use 'alter index INDEX_NAME ON TABLE_NAME REBUILD; 3. Since the code is in a prototype stage, can we move the index code to contrib ? Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0 Reporter: Prasad Chakka Assignee: He Yongqiang Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch, hive-indexing.3.patch, hive-indexing.5.thrift.patch, indexing_with_ql_rewrites_trunk_953221.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884431#action_12884431 ] Jeff Hammerbacher commented on HIVE-417: bq. 3. Since the code is in a prototype stage, can we move the index code to contrib ? It's been the experience of other Hadoop-related projects that contrib gets messy. It has proven effective to either keep experimental features in mainline trunk or to put them up on github. Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0 Reporter: Prasad Chakka Assignee: He Yongqiang Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch, hive-indexing.3.patch, hive-indexing.5.thrift.patch, indexing_with_ql_rewrites_trunk_953221.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884434#action_12884434 ] Namit Jain commented on HIVE-417: - if (work.getReducer() != null) { work.getReducer().jobClose(job, success, feedBack); } if (IndexBuilderBaseReducer.class.isAssignableFrom(this .getReducerClass())) { this.closeIndexBuilder(job, success); } } Instead of the above code in ExecDriver, IndexBuilderBaseReducer/CompactSumReducer should have a jobClose - no code change needed in ExecDriver. I would still vote for the index code to be in contrib, it will take some time to clean it up - then it should be moved to the mainline. Till then, it is usable, but in a prototype state. What we should aim for is minimum changes in ql/. and put all changes in contrib for now. As they become stable, we can pull them in - even the DDLSemanticAnalyzer should be factored in contrib Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0 Reporter: Prasad Chakka Assignee: He Yongqiang Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch, hive-indexing.3.patch, hive-indexing.5.thrift.patch, indexing_with_ql_rewrites_trunk_953221.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884067#action_12884067 ] Namit Jain commented on HIVE-417: - Looking at the patch (not yet in detail) seems to suggest the following: 1. The index file can only be a text file. 2. PROJECTION index is not used - I mean, to start, can we just get the basic COMPACT+SUMMARY and only support that. Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0 Reporter: Prasad Chakka Assignee: He Yongqiang Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch, hive-indexing.3.patch, hive-indexing.5.thrift.patch, indexing_with_ql_rewrites_trunk_953221.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884110#action_12884110 ] Namit Jain commented on HIVE-417: - DDLSemanticAnalyzer.java if (outputFormat == null) { outputFormat = RCFileOutputFormat.class; } use the default - dont hardcode. Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0 Reporter: Prasad Chakka Assignee: He Yongqiang Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch, hive-indexing.3.patch, hive-indexing.5.thrift.patch, indexing_with_ql_rewrites_trunk_953221.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12877049#action_12877049 ] Prafulla Tekawade commented on HIVE-417: I was thinking of adding something called query rewrite module. It would be rule-based query rewrite system and it would rewrite the query into semantically equivalent query which is more optimized and/or uses indexes (not just for scans, but for other query operators, e.g. GroupBy etc.) Eg. select distinct c1 from t1; This query, if we have densed index ('compact summary index' in this hive indexing patch) on c1 can be replaced with query on index table itself. select idx_key from t1_cmpct_sum_idx; Similar query transformation can happen for other queries. Module will be placed just before optimizer and will help optimizer. Module structure looks like below. [Query parser] [Query rewrites] -- new phase [Query optimization] [Query execution planner] [Query execution engine] The rewrite module is 'generic', not just for above indexing case, but for other cases too, e.g. OR predicates to union (for efficiency?), outer join to union of anti semi joins, moving out 'order by' out of union subquery etc etc. The aim is to implement a very simple, light-weight rewrite support, implement the indexing related rewrites (above rewrite does not even need a new run-time map-red operator) and integrate indexing support quickly and cleanly. As noted above, this rewrite phase is rule-based (and not cost-based), sort of early optimization. Let me know what u think. I'll start with reading ur patch. This would do most part from TODO 1, TODO 2 and 3 will have to be looked into. Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0 Reporter: Prasad Chakka Assignee: He Yongqiang Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch, hive-indexing.3.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12877144#action_12877144 ] He Yongqiang commented on HIVE-417: --- Plan sounds perfectly good to me! Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0 Reporter: Prasad Chakka Assignee: He Yongqiang Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch, hive-indexing.3.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12877236#action_12877236 ] Ashish Thusoo commented on HIVE-417: A couple of comments on this: A complication that happens by doing a rewrite just after parse is that you loose the ability to report back errors that correspond to the original query. Also the metadata that you need to do the rewrite is only available after phase 1 of semantic analysis. So in my opinion the rewrite should be done after semantic analysis but before plan generation. Is that what you had in mind... so something like... [Query parser] [Query semantic analysis] [Query optimization] ... Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0 Reporter: Prasad Chakka Assignee: He Yongqiang Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch, hive-indexing.3.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12877295#action_12877295 ] Prafulla Tekawade commented on HIVE-417: Yes Ashish, Thats what I had in mind. Rewrite system would need metadata, and hence it should be invoked after semantic analysis phase which would make metadata available. Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0 Reporter: Prasad Chakka Assignee: He Yongqiang Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch, hive-indexing.3.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12876676#action_12876676 ] Prafulla Tekawade commented on HIVE-417: He Yongqiang , Have you started working on this one ? If not, I was interested in taking a look at it. Patch link hive- 417-2009-07-18.patch is not working, can you share latest patch here ? Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0 Reporter: Prasad Chakka Assignee: He Yongqiang Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12876717#action_12876717 ] He Yongqiang commented on HIVE-417: --- Cool. Yes. i do have a latest patch for this jira. I will cleanup it and post. Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0 Reporter: Prasad Chakka Assignee: He Yongqiang Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12876827#action_12876827 ] He Yongqiang commented on HIVE-417: --- I forgot to add this line set hive.exec.compress.output=false; in the above snippet before selecting from the index table. Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0 Reporter: Prasad Chakka Assignee: He Yongqiang Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch, hive-indexing.3.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12835569#action_12835569 ] He Yongqiang commented on HIVE-417: --- Got talked with Prasad about this issue today. I may not able to finish this in the coming one or two months. I am now spending most of my time working on some other issues. I am sorry about that. If anyone want this feature in, please feel free to take over from me. And i will provide all help that i can. If no one picked up, i can finish it after finishing issues at hand. Thanks. Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.3.0, 0.3.1, 0.4.0, 0.6.0 Reporter: Prasad Chakka Assignee: He Yongqiang Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12758135#action_12758135 ] Joydeep Sen Sarma commented on HIVE-417: MDC also maintains metadata separately - at least based on their paper (http://www.research.ibm.com/compsci/project_spotlight/datamgmt/SIGMOD2003.pdf) Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0 Reporter: Prasad Chakka Assignee: He Yongqiang Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12758139#action_12758139 ] Prasad Chakka commented on HIVE-417: yes they do but they don't use for table scans which are done if the query selectivity is greater than 10% (or some such). they use the index for index scans and in joins. I wrote the table scan code :) Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0 Reporter: Prasad Chakka Assignee: He Yongqiang Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12758556#action_12758556 ] Schubert Zhang commented on HIVE-417: - {quote} Prasad Chakka added a comment - 15/Apr/09 11:25 AM Another way of doing it is to create a file format that contains index along with data... but i think that would take lot more time. {quote} We are trying to store data in sorted and block-indexed files (such as HFile or TFile). Then I think we can know the startKey and lastKey of each file and each block. This block index(block summary) is just for primary key. Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0 Reporter: Prasad Chakka Assignee: He Yongqiang Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12758067#action_12758067 ] Jeff Hammerbacher commented on HIVE-417: Another type of index worth knowing about: the negative index/storage index from Exadata, described at http://blogs.oracle.com/datawarehousing/2009/09/500gbsec_and_database_machine.html. We get some negative indexing for free with partitions, but this may be useful for more distinctive scans over columns for which we have not partitioned. Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0 Reporter: Prasad Chakka Assignee: He Yongqiang Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12758084#action_12758084 ] Prasad Chakka commented on HIVE-417: @jeff, i think this is more suitable for storing it along with data where blocks of data can skipped while scanning rows. i think columnar storage might already be doing this. Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0 Reporter: Prasad Chakka Assignee: He Yongqiang Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12758094#action_12758094 ] Jeff Hammerbacher commented on HIVE-417: Yeah, I think so as well. Did my comment make it seem like I thought otherwise? Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0 Reporter: Prasad Chakka Assignee: He Yongqiang Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12758097#action_12758097 ] Prasad Chakka commented on HIVE-417: there can be a summary index here as well (every SequenceFile block will have min max column values in the index). thought you are hinting at that. Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0 Reporter: Prasad Chakka Assignee: He Yongqiang Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12758116#action_12758116 ] Joydeep Sen Sarma commented on HIVE-417: are there any references on this technique? someone had earlier suggested this (apparently from reading Netezza documentation) - but i don't understand when it would work. why would a (fairly large) sequencefile block only limited range of values (assuming the metadata stores a min-max range). most cases i can imagine in our dataset would either have low cardinality columns (so most values would be present) or for large cardinality ones - the distribution would be random (relative to the primary sort key) - and the range would seem ineffective. unless there are columns that are closely related to the how data is sorted/partitioned (perhaps some product ids are limited to specific range of time - but the partitioning is on time and not product id - and even that sounds dubious). a bloom filter would seem much more plausible at allowing good filtering. even then don't understand why this sort of metadata should be kept along with the block and not separately (much more flexible - can be added on demand) as this jira is headed towards. Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0 Reporter: Prasad Chakka Assignee: He Yongqiang Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12758127#action_12758127 ] Prasad Chakka commented on HIVE-417: i don't think it makes much sense unless there is some clustering or sorting property. if there is clustering and sorting and the selectivity of a query is much higher than 10% then storing this metadata along with data makes sense instead of a separate block. the 10% threshold may be larger for Hive but the point still stands. in OLAP case data is change seldom and the size of this kind of metadata is much smaller than the data itself so the overhead of storing this data is negligible. something similar to this is done in DB2 Multi-Dimensional Clustering where whole blocks (disk blocks) are skipped if the key value doesn't fit the query. Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0 Reporter: Prasad Chakka Assignee: He Yongqiang Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12735271#action_12735271 ] He Yongqiang commented on HIVE-417: --- Created HIVE-678 for add support for building index. see https://issues.apache.org/jira/browse/HIVE-678 Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0 Reporter: Prasad Chakka Assignee: He Yongqiang Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734203#action_12734203 ] Prasad Chakka commented on HIVE-417: 1) Are you worried about the sort phase of the reducer or the IndexBuilder's reducer code? I don't think former issue will be a problem. The later issue can be avoided by writing multiple rows for a key if the number of offsets exceed a certain limit. So reducer can flush the offsets periodically to disk thus avoiding OutOfMemory exceptions in reducer. 2) What are the other options for the index output format? Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0 Reporter: Prasad Chakka Assignee: He Yongqiang Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734405#action_12734405 ] He Yongqiang commented on HIVE-417: --- 1) For a given key, we are using a sorted set for each bucket to store positions at the reduer. I am worried that one sorted set for each bucket may cause out of memory problem. as you commentted earlier: Listbucketname, Listoffset column, offsets are sorted. Think about one extreme situation: one file contains a single value million times. So at the reducer we are storing million positions in a sorted set. So reducer can flush the offsets periodically to disk thus avoiding OutOfMemory exceptions in reducer. If we do this, how we can guarantee they are sorted. I mean offsets after this flush are greater than offsets in previous flush. 2)What are the other options for the index output format? I think there is no other options. We need to discard the key part. And i think in hive only IgnoreKeyTextOutputFormat does that. And Of course all hive's custom HiveOutputFormat can discard key part, but they can not be specified in the map-reduce jobconf, since they do not extend OutputFormat. Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0 Reporter: Prasad Chakka Assignee: He Yongqiang Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734408#action_12734408 ] Prasad Chakka commented on HIVE-417: well the number of offsets can't exceed number of SequenceFile blocks since we can only index the SequenceFile block offsets. So the problem is not as dire as it can be. And also if there are that many (i.e. more than 10% of rows in traditional RDBMS but may more in Hadoop case) have same key then index may not be efficient after all since it is better to read the whole table anyways. Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0 Reporter: Prasad Chakka Assignee: He Yongqiang Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12734419#action_12734419 ] Prasad Chakka commented on HIVE-417: what i am trying to say is for such frequent keys indexing may not be of much help so may be we can relax 'sort' property? i don't think there is another easy way out other than do a disk based sort. check you can reuse any of the hadoop sorting code. Or can we piggyback this sorting on top of hadoop reduce sort phase some how? Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0 Reporter: Prasad Chakka Assignee: He Yongqiang Attachments: hive-417.proto.patch, hive-417-2009-07-18.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12728900#action_12728900 ] Prasad Chakka commented on HIVE-417: One thing that isn't mentioned is that in Listbucketname, Listoffset column, offsets are sorted. Another thing missing is, when an index on a partition is built then a new partition will be created for that index table (similar to that of creating a partition for a regular table). We can distinguish index tables and regular tables by having a table parameter. We can skip partition specific indexes in the first phase if it reduces amount of work and assume indexes defined on a table can be created on all partitions. Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0 Reporter: Prasad Chakka Assignee: Yongqiang He Attachments: hive-417.proto.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12725103#action_12725103 ] schubert zhang commented on HIVE-417: - Prasad, Thanks for your comments. Now, I understand your comments. Yes, in one of our projects, we sorted the data table and build sparse index which record the block keys and file offsets. Then, we load the index files into HBase to service for query. I works fine. Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0 Reporter: Prasad Chakka Assignee: Yongqiang He Attachments: hive-417.proto.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12722773#action_12722773 ] Prasad Chakka commented on HIVE-417: Schubert, We can run another map-reduce job that scans the index and builds out the results file sorted by the index key. This file can be read sequentially and determine which input table HDFS blocks to be fed to the actual job for the query. Another way is to build a sparse index on the index. But if the table itself is sorted, we can build the sparse index (ala MapFile) directly and use it. @Facebook, the usecase we have doesn't have this sorting property but I can envision this being useful for primary indexes where the index sort order and the table sort order are same. Can you think of any other ways? Ofcourse, we can process index files using HBase or TokyoCabinet but that requires another system to be setup and administered and both systems need to be available for index processing. But in some cases these solutions also work. The indexing scheme described above should play well with Hbase and TokyoCabinet since index is a file with rows containg a key and position parameters. In Hadoop we can stored that in SequenceFile or may be TFile but if they have to be stored in external systems, we can plug-in a custom SerDe and change the default location of these two a location where the external systems can access these files. Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0 Reporter: Prasad Chakka Assignee: Yongqiang He Attachments: hive-417.proto.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715539#action_12715539 ] Seymour Zhang commented on HIVE-417: are we going to have one index file per hdfs file? Can we also support exporting these index files as a table to some other storage system like HBase or Tokyou Cabinet, i.e. these seperate index files for each HDFS file, can be expressed as a single table in Hive? Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0 Reporter: Prasad Chakka Assignee: He Yongqiang Attachments: hive-417.proto.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715153#action_12715153 ] Joydeep Sen Sarma commented on HIVE-417: - are we going to have one index file per hdfs file? (or one per partition?) related question is how this is going to interact with sampling? (i think currently the sampling predicate is optimized out for bucketed tables - although not terribly sure). i would love to see the api to invoke the index. - ideally we would like to plug in different indexing schemes - as well with map-side joins - the hashmap storing the smaller table can be seen as an index on this table. It would seem that one should be able to replace a map-side join based on tables loaded into jdbm with tables with indices proposed here (and thereby do joins based on indices almost trivially). - we should enable people to be able to plug in their own indices (since it's quite likely that over time there will be multiple indexing efforts on hadoop files). Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0 Reporter: Prasad Chakka Assignee: He Yongqiang Attachments: hive-417.proto.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12715345#action_12715345 ] He Yongqiang commented on HIVE-417: --- Joydeep, Thanks for the concern. are we going to have one index file per hdfs file? yeah. i would love to see the api to invoke the index currently it is not settle down. I will try to give it in the next week. enable people to be able to plug in their own indices I think if we have a well-designed adaptable api, then this can be addressed. we would like to plug in different indexing schemes yes. i have proposed several schemes in previous posts. Can you give me some schemes, so i can compare and make a better design. BTW, I will try to write a proposal in next week. I have an important english exam this weekend. Sorry for the delay. Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0 Reporter: Prasad Chakka Assignee: He Yongqiang Attachments: hive-417.proto.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12714730#action_12714730 ] He Yongqiang commented on HIVE-417: --- Thanks for the suggestions, Seymour. I have also thought what your said, directly fetch the data instead of initilize a new mr job. I will try include this, but it may be done in the second phase(the optimize phase). I'd like to treat these rows of same col values as a block and only use a single index entry for this block in the design, we indeed only use one index entry. And not only for contineous values, we use the same index entry for all rows with the same col value. Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0 Reporter: Prasad Chakka Assignee: He Yongqiang Attachments: hive-417.proto.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12714492#action_12714492 ] Seymour Zhang commented on HIVE-417: Hello Prasad and Yongqiang, Thank you very much for this great effort. One of my suggestions would be that, since we've done indexing with Mapreduce, and for some queries based on the generated indexes, can we just omit the time-consuming Mapreduce phase during the querying period, as we've already got all of the files/offsets and we can go to these specific file offsets directly to get relevant rows of the table? This would greatly expedite the query process. This would be helpful for the following case in one of my usages with Hive. With Hive, I've already sharded (by date), and bucketed (by cols hashing) of my log data into a hierachical files. Also I've sorted each file with the hashing cols. As I may have many rows with same column values but different timestamps, to minimize index size, I'd like to treat these rows of same col values as a block and only use a single index entry for this block. This will grealy reduce the index size of my data, but still very useful in my query request with those cols. Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0 Reporter: Prasad Chakka Assignee: He Yongqiang Attachments: hive-417.proto.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12714306#action_12714306 ] Prasad Chakka commented on HIVE-417: the plan looks good. i am not sure we need to create sparse index on the dense index in phase 1. In most cases the size of dense index will be small enough so that additional mr job for processing the sparse index will become unnecessary. if sparse index is not necessary then there is not need for the dense index be sorted. since the dense index is scanned completely while processing the query, we can use the index if any predicate column exists in index definition. Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0 Reporter: Prasad Chakka Assignee: He Yongqiang Attachments: hive-417.proto.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12713003#action_12713003 ] He Yongqiang commented on HIVE-417: --- Checked how Mysql does with index and found mysql either can not use index to handle situations in my earlier post: {quote} but, we can not use it for queries like: 4) select * from table1 where col234 and col33 5) select * from table1 where col2 =34 6) select * from table1 where col3 45 {quote} And now a basic idea for our index design, just like Prasad commented in previous post: 1) index structure use a mr job to create index, input is a file with all columns, and mapper output kv pairs, where key is indexed col1, indexed col2,... offset. And we define a comparator for indexed col1, indexed col2,... to letting the shuffle phase sort all mappers' output. And in reducer, we combine kv-pairs to indexed col1, indexed col2,... list_of_offsets This is a dense sorted index, then we create a sparse index on the dense index. And we also collect column data distribution informations (histogram) while doing this. 2) we consider using index for a query only when the query involves the columns of leftmost part of the index. And also need to consider index merge when involves two indexes, and a cost estimation to consider whether using index will decrease query time (this is the work need to do in the optimizer). But as first step, we can first finish part 1 and hive ql part. Then consider part two(optimizer part). After part1 finished, i will examine part2 in more detail. Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0 Reporter: Prasad Chakka Assignee: He Yongqiang Attachments: hive-417.proto.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12712352#action_12712352 ] He Yongqiang commented on HIVE-417: --- Thanks a lot, Prasad. I will put questions on the jira. and will start working on it after we set the design. Looking forward to working on it. Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0 Reporter: Prasad Chakka Assignee: He Yongqiang Attachments: hive-417.proto.patch Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12710741#action_12710741 ] Prasad Chakka commented on HIVE-417: Yes, mostly the block/pos size will be small but I don't think we can assume that since there will be enough cases where it will not be true. We explore the other approaches later on. different indexes will be useful in different scenarios. i will try to post some code this week. Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0 Reporter: Prasad Chakka Assignee: He Yongqiang Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12710494#action_12710494 ] Prasad Chakka commented on HIVE-417: The above index is not a hash index since you can't do range queries on hash index and lookups are constant time. not sure what to call this except that it is a view (simple projection) of the base table with offsets into the base table. on sparse index, i meant you can create a sparse index on top of the index i described above. but this can be done later. And in most cases, the block/pos list's size will only be 1 that is not the case if the index is on a non-primary key column. and i think, mostly this is the case where indexes will be used in data warehouses. Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0 Reporter: Prasad Chakka Assignee: He Yongqiang Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12710591#action_12710591 ] He Yongqiang commented on HIVE-417: --- that is not the case if the index is on a non-primary key column. and i think, mostly this is the case where indexes will be used in data warehouses. Yes. If the index is built on one column, the block/pos list's size will be large. But if it is built on many columns, i think the block/pos list's size will be small. Anyway, we can build this index as the first step. And after this finished, we can try other kinds of index, like: 1) sort based index 2) lucene index 3) block-scope B+Tree or R-tree or other advantage index data structures. Prasad, you said you already wrote some code, would you please attach it? Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0 Reporter: Prasad Chakka Assignee: He Yongqiang Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12710204#action_12710204 ] Prasad Chakka commented on HIVE-417: 1) The question you raised applies only to B+Tree indexes. The index that I defined above is not really a traditional database index but a kind of summary table (or view) and any lookup/range-query on table requires reading of the whole index. So you can apply all predicates as long as columns referenced in the predicates exist in the index. So we should be able use index on (col1, col2, col3) for all the queries above. Sorting order has no impact here since the whole index is read into memory anyways. Since this index can be created in sorted order, we can create sparse index (similar to non-leaf nodes of a B+-Tree) if the index itself is too big (ie, index sizes are order of magnitude larger than HDFS block size). But this can be done as a later optimization. 2) With the design above, indexes on joins will come free since predicate pushdown will push the 'user.name=user_name' to above the join and only index filtered rows participate in join. But creating indexes on the joined output may increase the index size so as to decrease the overall effectiveness. But with sparse indexes this problem might be mitigated so we can support this kind of join indexes along with support for sparse indexes. 3) Yes, for some aggregation queries it may make sense to read the index (since it is a summary table as well). Aggregations or any queries that involve only columns from the index can operate only on the index and not the main table. 4) I also looked at it and not sure how it fits into Hive. Katta is more like an distributed index server. Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0 Reporter: Prasad Chakka Assignee: He Yongqiang Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12709899#action_12709899 ] Prasad Chakka commented on HIVE-417: Here is a very rough outline as how this can be done (prototype code has creation and execution parts but not he HiveQL related stuff) hive indexing: goal of hive indexing is to speed up lookup queries on certain columns of the table. currently queries with predicates like 'WHERE tab1.col1 = 10' has to load the complete table/partition and process all the rows. if there exists an index on col1 then only a small portion of the file can be loaded. command to create index: create index tab_idx1 on tab1 (col1, ...); if the base table is partitioned then the index is also partitioned. indexes can be created on base tables whose file format supports (getPos() and possible seek() or equivalent methods.) format of index: index is also a hive table with the following columns col_1...col_k -- key cols. base table columns on which this index is defined listoffset -- positions of rows which contain these keys offset is a combination of following file_name -- relative path of the file in which this row is contained. (relative to the partition/table location) byte_offset -- byte offset of the row in the file. row can be found at this byte offset or in the block starting at this byte offset for Block Compressed Sequence Files. when to create index: traditionally databases try to update index when the table is loaded. hive doesn't process rows while loading tables using 'LOAD DATA INPATH' command. also it may slow down the actual loading for 'INSERT ... SELECT ... FROM ...' type of statements. so users should have an option whether the index is initialized during 'INSERT ... SELECT ...' or initialized separately. Another command like 'update index tab_idx partition ..' can be provided. how to create index: index can be created using the following hive command augmented with 'offset' 'select col_1...col_k, offset from tab1' offset can be provided as built in function which can be derived in HiveInputRecordReader which will in turn use the specific FileFormat's Reader getPos() method and the 'map.input.file' for the file name (or from the tableDesc or partiionDesc). Algorithm For using index: 1) Hive QL needs to determine whether a particular query can use any existing indexes. This can be determined by examining the predicate tree. After predicate pushdown, all those predicates which can use index are in the child operator of a TableScanOperator. This predicate tree needs to be examined. If this contains any subset of columns of an index then that index can be used. Until stats are available, it is not possible to guess whether using index is beneficial. This needs to be fleshed out more to check both 'AND' and 'OR' predicates. 2) For each of the qualified indexes, a map/reduce job can be created using the predicates determined in step 1. The output of this job should have the following information file_name -- fully qualified file name that contains the data byte_offset -- position of row 3) If there is more than one qualified index then the outputs of step2 needs to be combined depending on whether the predicates on these indexes have 'AND' or 'OR' between them. 4) Modify the original plan to use only those FileSplits that appear in the output of step3. This reduces the number of mappers spawned by JobTracker. 5) Modify the original plan to use HiveIndexRecordReader instead of regular record reader. Output of step3 (which is sorted) is available to the HiveIndexRecordReader. It can skip to these locations instead of reading every record in the input of the Mapper. Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0 Reporter: Prasad Chakka Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702769#action_12702769 ] Prasad Chakka commented on HIVE-417: HIVE-1230 has changed the interface for RecordReader and it no longer has getPos() method. The older interfaces are deprecated. I used this method in the prototype get the current position while creating the index and also while reading the actual data file. Even the SequenceFileRecordReader does not have this method. Without getPos() and seek() methods to RecordReader it becomes tough to implement any kind of generic indexing. Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0 Reporter: Prasad Chakka Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-417) Implement Indexing in Hive
[ https://issues.apache.org/jira/browse/HIVE-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12699307#action_12699307 ] Prasad Chakka commented on HIVE-417: Another way of doing it is to create a file format that contains index along with data... but i think that would take lot more time. Implement Indexing in Hive -- Key: HIVE-417 URL: https://issues.apache.org/jira/browse/HIVE-417 Project: Hadoop Hive Issue Type: New Feature Components: Metastore, Query Processor Affects Versions: 0.2.0, 0.3.0, 0.3.1, 0.4.0 Reporter: Prasad Chakka Implement indexing on Hive so that lookup and range queries are efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.