Hi Shimon, You may also want to look into a text indexing solution such as Lucene or Ferret. These systems might be able to provide the search capabilities you require. Of course, it'd be your responsibility to keep the indexes up to date.
-- Jim R. Wilson (jimbojw) On Thu, May 29, 2008 at 10:51 AM, Bryan Duxbury <[EMAIL PROTECTED]> wrote: > Right now, we don't have a plan to add indexing. It's a particularly > complicated area of functionality. We'd love to get suggestions (or patches > :) for how to do this the right way. > > Map/reduce would be the perfect way to find all records that matched on a > column value, so long as you're able to do all the accesses in batch. If > they need to be random and low-latency, then you're going to have some > trouble using map/reduce. HBase provides an interface to map/reduce that > lets you split up by a region at a time, so you can highly parallelize your > processing, so it should be reasonably fast. > > -Bryan > > On May 29, 2008, at 7:56 AM, shimon golan wrote: > >> Thanx Bryan for the quick response ! >> My table is very sparse and contains 100 column families , each containing >> about 20 items. Also, I need to search the table by each of the columns so >> the solution you suggested seems somewhat complicated for this purpose. >> >> So my ensuing questions are : >> 1. Do you plan to support having indexes on columns and thus avoiding the >> necessity of scanning when a column value is queried ? >> 2. I thought about using map reduce for the purpose of parallelizing the >> search over different regions of the table - could this be accomplished ? >> >> On Thu, May 29, 2008 at 5:42 PM, Bryan Duxbury <[EMAIL PROTECTED]> wrote: >> >>> In a regular hbase table, there is no "most efficient" way to do this. >>> The >>> only operation you have available is a table scan. >>> >>> If you find yourself looking for rows by their column values, then you >>> must >>> do some extra work to make that possible. First, make absolutely sure >>> that >>> you have the right row key picked out. If your accesses are dominated by >>> searching on a column value, then perhaps that column should be your >>> primary >>> key. >>> >>> If you must have both the existing primary key and the column value >>> -based >>> lookups, then probably your best bet is to make an "index" table. The way >>> that works is that every time you write or delete some value to the >>> primary >>> table, you also write or delete a value to the index table with the >>> column >>> value as the key and the row key as a column value. Then, when you are >>> trying to find the row by its column value, you look in the index table >>> to >>> find the row key, and then you query the main table with the row key. >>> It's >>> more work, but this is the best that HBase can offer at the moment. >>> >>> -Bryan >>> >>> >>> >>> On May 29, 2008, at 5:47 AM, Shimon wrote: >>> >>> Hi, >>>> >>>> What is most efficient way to retrieve a row when the value of a certain >>>> column is specified ( rather than the row key ) ? >>>> >>>> Thanks in advance >>>> >>> >>> > >