Hi, I am new to Hive, and am trying to setup an index on a Hive table to improve query performance. I am presently using the CDH 4.2 Hadoop distribution, which ships with Hive 0.10, so from what I have read table index support should be available.
What I am seeing though is that when I go and create a simple test table, the time to perform a query on this table doesn't change whether I have an index created or not. Now I am likely just misunderstanding how to make use of the Hive index, so I am hoping to get some expert advice on what I must do to make use of the index. I have included a sample CLI session that documents what I have tried so far. Thank You, Chris ==== BEGIN HIVE EXAMPLE ==== hive> create table foo stored as RCFILE as select uri from uris; hive> explain select count(*) from foo where uri='http://www.example.com/'; ABSTRACT SYNTAX TREE: (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME foo))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_FUNCTIONSTAR count))) (TOK_WHERE (= (TOK_TABLE_OR_COL uri) 'http://www.example.com/')))) STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stage STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: foo TableScan alias: foo Filter Operator predicate: expr: (uri = 'http://www.example.com/') type: boolean Select Operator Group By Operator aggregations: expr: count() bucketGroup: false mode: hash outputColumnNames: _col0 Reduce Output Operator sort order: tag: -1 value expressions: expr: _col0 type: bigint Reduce Operator Tree: Group By Operator aggregations: expr: count(VALUE._col0) bucketGroup: false mode: mergepartial outputColumnNames: _col0 Select Operator expressions: expr: _col0 type: bigint outputColumnNames: _col0 File Output Operator compressed: true GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Stage: Stage-0 Fetch Operator limit: -1 hive> select count(*) from foo where uri='http://www.example.com/'; MapReduce Total cumulative CPU time: 6 minutes 13 seconds 270 msec Ended Job = job_201311060137_0028 MapReduce Jobs Launched: Job 0: Map: 17 Reduce: 1 Cumulative CPU: 373.27 sec HDFS Read: 3801117400 HDFS Write: 17 SUCCESS Total MapReduce CPU Time Spent: 6 minutes 13 seconds 270 msec OK 11044586 Time taken: 42.107 seconds hive> create index foo_idx on table foo (uri) as 'COMPACT' with deferred rebuild; hive> alter index foo_idx on foo rebuild; hive> set hive.optimize.index.filter = true; hive> explain select count(*) from foo where uri='http://www.example.com/'; ABSTRACT SYNTAX TREE: (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME foo))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_FUNCTIONSTAR count))) (TOK_WHERE (= (TOK_TABLE_OR_COL uri) 'http://www.example.com/')))) STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stage STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: foo TableScan alias: foo filterExpr: expr: (uri = 'http://www.example.com/') type: boolean Filter Operator predicate: expr: (uri = 'http://www.example.com/') type: boolean Select Operator Group By Operator aggregations: expr: count() bucketGroup: false mode: hash outputColumnNames: _col0 Reduce Output Operator sort order: tag: -1 value expressions: expr: _col0 type: bigint Reduce Operator Tree: Group By Operator aggregations: expr: count(VALUE._col0) bucketGroup: false mode: mergepartial outputColumnNames: _col0 Select Operator expressions: expr: _col0 type: bigint outputColumnNames: _col0 File Output Operator compressed: true GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Stage: Stage-0 Fetch Operator limit: -1 hive> select count(*) from foo where uri='http://www.example.com/'; MapReduce Total cumulative CPU time: 6 minutes 4 seconds 730 msec Ended Job = job_201311060137_0031 MapReduce Jobs Launched: Job 0: Map: 17 Reduce: 1 Cumulative CPU: 364.73 sec HDFS Read: 3801117400 HDFS Write: 10 SUCCESS Total MapReduce CPU Time Spent: 6 minutes 4 seconds 730 msec OK 0 Time taken: 42.112 seconds ==== END HIVE EXAMPLE ====