TL;DR - A Materialized view is a much more useful construct than trying to get limited indexes to work.
That is pretty lively project which has been going on for a while with Druid+LLAP https://issues.apache.org/jira/browse/HIVE-14486 > This seems out of the blue but my initial benchmarks have shown that there's > no performance gain when Hive index is used with Tez engine. Indexes in Hive are broken once columnar formats came in, because there is no concept of directly locating a row in another table from an existing table - file+offsets doesn't work for a columnar format in any practical sense. Once you have a fast format, these indexes are only good enough to answer queries directly from an index by maintaining a more compact copy of data, which is really not what an index is supposed to do. > I think that index is a perfect solution for non-ORC file format since you > can selectively build an index table and leverage Tez to only look at those > blocks and/or files that we need to scan Since LLAP is a distributed system, the concept of "looking up indexes and then looking at blocks" does not work the same way it would work for a single machine DB with common memory across all threads - the index file scans cannot go and specify which other blocks to read, because there is no single frame of reference for this. We can only get meaningful speedups if the index are co-located with the blocks on the same machine and can reference column groups (10k rows in each chunk), which is exactly what ORC+LLAP manages to do with ORC's bloom filter indexes. > we still want to have fast ad-hoc query via Hive LLAP / Tez LLAP can do fast ad-hoc queries at around ~100 million rows/sec on a single node, which is pretty fast without needing a separate index + a lookup loop. In a production cluster, with constant ETL, the global indexes were actually lowering throughput since they got stale every few minutes and rebuilding fell behind because it is lower priority operations through-out. Here's an experiment for you to try CREATE INDEX idx_COUNTRY ON TABLE customer(c_birth_country) AS 'org.apache.hadoop.hive.ql.index.AggregateIndexHandler' WITH DEFERRED REBUILD IDXPROPERTIES('AGGREGATES'='count(c_birth_country)') STORED AS ORC; Which does build an index in ORC (why not?), can only be made in MR because the index handlers weren't implemented in Tez. Tez can still use those indexes. 0: jdbc:hive2://localhost:10007/tpcds_bin_par> set hive.optimize.index.groupby=true; 0: jdbc:hive2://localhost:10007/tpcds_bin_par> Vertex dependency in root stage Reducer 2 <- Map 1 (SIMPLE_EDGE) Stage-0 Fetch Operator limit:-1 Stage-1 Reducer 2 vectorized File Output Operator [FS_8] Group By Operator [GBY_7] (rows=1060 width=9536) Output:["_col0","_col1"],aggregations:["sum(VALUE._col0)"],keys:KEY._col0 <-Map 1 [SIMPLE_EDGE] SHUFFLE [RS_3] PartitionCols:_col0 Group By Operator [GBY_2] (rows=2120 width=9536) Output:["_col0","_col1"],aggregations:["sum(_count_of_c_birth_country)"],keys:c_birth_country Select Operator [SEL_1] (rows=2120 width=9536) Output:["c_birth_country","_count_of_c_birth_country"] TableScan [TS_0] (rows=2120 width=9536) testing@testing__customer_idx_country__,testing__customer_idx_country__,Tbl:COMPLETE,Col:NONE,Output:["c_birth_country"] If you look closely at the index at that point, it is clear that it is not a true index, but a materialized view. And that as a materialized view, the reason this is running slower is due to a column which won't happen in a true materialized view desc testing__customer_idx_country__; +----------------------------+----------------+----------+--+ | col_name | data_type | comment | +----------------------------+----------------+----------+--+ | c_birth_country | string | | | _bucketname | string | | | _offsets | array<bigint> | | | _count_of_c_birth_country | bigint | | +----------------------------+----------------+----------+--+ _offsets was ruining everything, from compression to vectorization. Which brings me to my actual point - Materialized views are good, but external indexes aren't great at being materialized views. At least now that we already have internal indexes and in-memory caching for them. Cheers, Gopal
