Help me understand Hive indexing.

Heller, Chris Wed, 06 Nov 2013 06:04:38 -0800

Hi,

I am new to Hive, and am trying to setup an index on a Hive table to
improve query performance.
I am presently using the CDH 4.2 Hadoop distribution, which ships with
Hive 0.10, so from what I have read table index support should be
available.


What I am seeing though is that when I go and create a simple test table,
the time to perform a query on this table doesn't change whether I have an
index created or not.

Now I am likely just misunderstanding how to make use of the Hive index,
so I am hoping to get some expert advice on what I must do to make use of
the index.

I have included a sample CLI session that documents what I have tried so
far.

Thank You,
Chris

==== BEGIN HIVE EXAMPLE ====

hive> create table foo stored as RCFILE as select uri from uris;
hive> explain select count(*) from foo where uri='http://www.example.com/';

ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME foo))) (TOK_INSERT
(TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR
(TOK_FUNCTIONSTAR count))) (TOK_WHERE (= (TOK_TABLE_OR_COL uri)
'http://www.example.com/'))))

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        foo
          TableScan
            alias: foo
            Filter Operator
              predicate:
                  expr: (uri = 'http://www.example.com/')
                  type: boolean
              Select Operator
                Group By Operator
                  aggregations:
                        expr: count()
                  bucketGroup: false
                  mode: hash
                  outputColumnNames: _col0
                  Reduce Output Operator
                    sort order:
                    tag: -1
                    value expressions:
                          expr: _col0
                          type: bigint
      Reduce Operator Tree:
        Group By Operator
          aggregations:
                expr: count(VALUE._col0)
          bucketGroup: false
          mode: mergepartial
          outputColumnNames: _col0
          Select Operator
            expressions:
                  expr: _col0
                  type: bigint
            outputColumnNames: _col0
            File Output Operator
              compressed: true
              GlobalTableId: 0
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1

hive> select count(*) from foo where uri='http://www.example.com/';

MapReduce Total cumulative CPU time: 6 minutes 13 seconds 270 msec
Ended Job = job_201311060137_0028
MapReduce Jobs Launched:
Job 0: Map: 17  Reduce: 1   Cumulative CPU: 373.27 sec   HDFS Read:
3801117400 HDFS Write: 17 SUCCESS
Total MapReduce CPU Time Spent: 6 minutes 13 seconds 270 msec
OK
11044586
Time taken: 42.107 seconds

hive> create index foo_idx on table foo (uri) as 'COMPACT' with deferred
rebuild;
hive> alter index foo_idx on foo rebuild;
hive> set hive.optimize.index.filter = true;
hive> explain select count(*) from foo where uri='http://www.example.com/';

ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME foo))) (TOK_INSERT
(TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR
(TOK_FUNCTIONSTAR count))) (TOK_WHERE (= (TOK_TABLE_OR_COL uri)
'http://www.example.com/'))))

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        foo
          TableScan
            alias: foo
            filterExpr:
                expr: (uri = 'http://www.example.com/')
                type: boolean
            Filter Operator
              predicate:
                  expr: (uri = 'http://www.example.com/')
                  type: boolean
              Select Operator
                Group By Operator
                  aggregations:
                        expr: count()
                  bucketGroup: false
                  mode: hash
                  outputColumnNames: _col0
                  Reduce Output Operator
                    sort order:
                    tag: -1
                    value expressions:
                          expr: _col0
                          type: bigint
      Reduce Operator Tree:
        Group By Operator
          aggregations:
                expr: count(VALUE._col0)
          bucketGroup: false
          mode: mergepartial
          outputColumnNames: _col0
          Select Operator
            expressions:
                  expr: _col0
                  type: bigint
            outputColumnNames: _col0
            File Output Operator
              compressed: true
              GlobalTableId: 0
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1

hive> select count(*) from foo where uri='http://www.example.com/';

MapReduce Total cumulative CPU time: 6 minutes 4 seconds 730 msec
Ended Job = job_201311060137_0031
MapReduce Jobs Launched:
Job 0: Map: 17  Reduce: 1   Cumulative CPU: 364.73 sec   HDFS Read:
3801117400 HDFS Write: 10 SUCCESS
Total MapReduce CPU Time Spent: 6 minutes 4 seconds 730 msec
OK
0
Time taken: 42.112 seconds


==== END HIVE EXAMPLE ====

Help me understand Hive indexing.

Reply via email to