On 9/6/14, 9:36 AM, Alain Petrus wrote:
I am wondering whether is it possible to use Hive index and ORC format? Does
it make sense?
ORC maintains its own indexes within the file - one index record every
10,000 rows (orc.row.index.stride / orc.create.index).
You can take advantage of it during scan+filter with the following option
hive> set hive.optimize.index.filter=true;
A recent IBM paper did have some detailed analysis on ORC's indexing
performance - but it is relatively "free" because there is no other step
than just inserting into an ORC table.
The part where ORC does help a lot is if you then do a "ANALYZE TABLE"
to build information required to make query plans better, because it
will read the stats off the single index record at the bottom of each
orc file (the "partial scan" mode).
Cheers,
Gopal