forwarding to hdfs and pig mailing-lists for  responses from wider audience.


---------- Forwarded message ----------
From: prasenjit mukherjee <prasen....@gmail.com>
Date: Tue, Mar 16, 2010 at 11:47 AM
Subject: How to avoid a full table scan for column search. ( HIVE+LUCENE)
To: hive-user <hive-u...@hadoop.apache.org>


Is there a way to avoid full table scan for an arbitrary where-clause
usage ? partitioning/bucketing makes sense only when you know which
columns will be searched upon. I was wondering if there is any project
which combines the SQL-like features of HIVE and inverted-index like
search-features of LUCENE, and works on cloud. Guess I am asking for
too much :(

I have been using oracle till now and my usage is mainly restricted to
do summation-type queries with some where clause, example being  :
"Select SUM(column1)  where col2='foo' AND col3='bar'".
The output is always some aggregation and where clauses can include
"<, >, =, IN".  I would like to use some kind of distributed
processing to speed up the table generation, search query-time. Hive (
and to some extent Pig ) seems to be the closest tool available to
what I am looking for. I am also exploring hbase, but not sure whether
it will be the right choice for my problem.

Hive can definitely help in parallelizing up the search-processing.
But my main concern is whether hive does ( or plans to do ) any
storage optimization like oracle,lucene ( apart from simple
partitioning/bucketing ).  It seems that all the hadoop-options (
hive,pig,hbase) will have to do  an entire table scan.

Appreciate any suggestions/feedback..

-thanks,
Prasen

Reply via email to