Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.

The "Hive/IndexDev/Bitmap" page has been changed by MarquisWang.
http://wiki.apache.org/hadoop/Hive/IndexDev/Bitmap

--------------------------------------------------

New page:
= Bitmap Indexing =

<<TableOfContents>>

== Introduction ==

This document explains the proposed design for adding a bitmap index handler 
(https://issues.apache.org/jira/browse/HIVE-1803).
Bitmap indexing (http://en.wikipedia.org/wiki/Bitmap_index) is a standard 
technique for indexing columns with few distinct 
values, such as gender.

== Approach ==

We want to develop a bitmap index that can reuse as much of the existing 
Compact Index code as possible. 

== Proposal ==

=== First implementation ===

This implementation confers some of the benefits of bitmap indexing and should 
be easy to implement given the already existing compact index, but it does few 
of the optimizations such as compression that a really good bitmap index should 
do.

Like the complex index, this implementation uses an index table. The index 
table on a column "key" has three columns, _bucketname, _offset, and _bitmaps. 
_bucketname is a string pointing to the hadoop file that is storing this block 
in the table, _offset is the block offset of a block, and _bitmaps is a Map 
where the keys are all the values of the column "key" that exist in this block 
and a bitmap encoding (an Array of BigInts??) of every row in that block, with 
a 1 if that row has the value, 0 if not. If a key value does not appear in a 
block at all, the value is not stored in the map.

When querying this index, we select each filename, block pair where the 
_bitmaps Map has a key that is the queried key value. If there are boolean AND 
or OR operations done on the predicates with bitmap indexes, we can use bitwise 
operations to try to eliminate blocks as well. We can use this data to generate 
the filename, array of block offsets format that the compact index handler uses 
and reuse that in the bitmap index query.

=== Second iteration ===

The basic implementation's only compression is eliminating blocks where all 
rows are 0s. This is unlikely to happen for larger blocks, so we need a better 
compression format. What we can do is do byte-aligned bitmap compression, where 
the bitmap is an array of bytes, and a byte of all 1s or all 0s implies one or 
more bytes where every value is 0 or 1. Then, we would just need to add another 
column in the bitmap index table that is an array of Ints that describe how 
long the gaps are and logic to expand the compression.

Reply via email to