Oh I had very little data (a few MB) - I am just testing whether
features work at all before I try it on bigger data (for example
compression does not work for me
http://wiki.apache.org/hadoop/Hive/CompressedStorage ; also not sure
whether indexes work on compressed files).
The index got used because the second query
SELECT key, col2 FROM table WHERE key=100 -- no GROUP BY used
was almost instant as opposed to when not using the index (then it did
one full map scan). Not sure what the total speedup is but there SHOULD
be speed up for some queries or otherwise indexes would be useless. Try
comparing a query with just a WHERE condition without a GROUP BY.
Martin
On 08/06/2011 18:08, Guillaume WEILL wrote:
Thanks for your reply.
Could you say me how much time have you saved with the index and the
time used on a query without query? The amount of data of your table
could be helpful too.
This is to verify your point of view about the amount of data because
I am really not impressed by performance of my index.
Personally I work on 100 GB.
Guillaume
2011/6/8 Martin Konicek <[email protected]
<mailto:[email protected]>>
Hi,
I was testing indexes today as well and the index definitely got
used. You should be able to see this when you run two separate
queries:
INSERT OVERWRITE DIRECTORY "/tmp/index-result2" ...
SELECT ...
The SELECT was faster for me than without the index. In your case
the time might be spent in the GROUP BY and maybe you have little
data so the times look the same.
What is not so good is that index can't be partitioned on
different columns than the table. E.g. I would like to partition
the table on date and the index on region (I can't partition the
table on both date®ion bc there are thousands of regions and
that would create huge directory structure in HDFS, which I read
is not recommended).
Martin
On 08/06/2011 11:28, Guillaume WEILL wrote:
Hi,
I want to test the use of indexes in hive. For this I created
anindex, I launched a first query above, I changed the
settings on Hive and ran my query on my database table:
CREATE INDEX index ON TABLE table(key) as 'COMPACT' WITH
DEFERRED REBUILD;
ALTER INDEX index ON table REBUILD;
INSERT OVERWRITE DIRECTORY "/tmp/index-result2" SELECT
`_bucketname` , `_offsets` FROM default__table_index__ x
WHERE x.key=100;
SET hive.index.compact.file=/tmp/index_result2;
SET
hive.input.format=org.apache.hadoop.hive.ql.index.compact.HiveCompactIndexInputFormat;
SELECT key, col2, sum(col3) FROM tableWHERE key=100 group BY
col2;
No error but I am not sure that the index is really used.
Indeed I get the same performance with and without the index.
When I look at the logs (tasktracker, datanode, job_config), I
see no call to the directory / tmp / index_result2.
How do I know if my index has been really used?
Thanks for your help,
--
Guillaume WEILL