Hey Saajan,

Does your data have any large pieces or is it mostly just short indexed fields? 
 A Solr/HBase hybrid definitely sounds interesting but is a big undertaking.

To build on what Edward is suggesting, to be able to efficiently do this type 
of query directly on HBase you may need to have a separate table for each 
searchable field.  Are the searchable fields usually based on a fixed number of 
values?  Or are they full-text search?

To give you an idea of how you could design indexed tables, consider four 
different types of data:  full data accessed by unique identifier, time, single 
string values, full text search.

Unique identifier is the simplest:  row = <uniqueid>, columns = <metadata>

Time depends on if you want to bucket it at all (for example, you only ever 
care about searching by day not time).

Second granularity:
row = <epoch_timestamp/long>, column = <uniqueid>

Day granularity:
row = <date>, columns = [<uniqueid>] or [<stamp><uniqueid>] or 
[<descending_stamp><uniqueid>]

These tables will be ordered by time, so you will be able to do efficient scans 
of time ranges by setting the startRow and stopRow accordingly.  If your 
uniqueids are more like uuids, you may want to prefix the uniqueid in the 
columns with the epoch stamp (to have secondary sort by time).

I recommend using Bytes.toLong(long) to get binary data rather than using ascii 
characters.  One thing, if you are using epoch-style stamps and you want 
descending order time instead of the default ascending order that HBase 
provides, you will want to reverse the stamps by storing (Long.MAX_VALUE - 
stamp) instead.


If you have a fixed number of values, you can do a simple reversed index table:
row = <value>, columns = [<descending_stamp><uniqueid>]

Again, you have the option of a secondary sort by prefixing the uniqueid with 
something like a stamp.

There are a couple ways you might do full text search, but in general you index 
each word in each document, so the rows are words.  Each row contains a list of 
documents which have that word, and you can put position or scoring information 
in the value.  The base model is something like:
row = <word>, columns = [<uniqueid>], values = 
[<position_info_or_other_scoring_info>]

If you want to support cross-field full-text search, you can add information to 
the columns or values about the fields.  If you prefix the column with the 
field name, you basically get full-text search with a GROUP BY on the field.  
You can GROUP BY / ORDER BY just about anything like that.


Hope that helps.

JG

> -----Original Message-----
> From: Edward Capriolo [mailto:edlinuxg...@gmail.com]
> Sent: Monday, May 03, 2010 7:14 AM
> To: hbase-user@hadoop.apache.org
> Subject: Re: HBase Design Considerations
> 
> On Mon, May 3, 2010 at 4:04 AM, Steven Noels
> <stev...@outerthought.org>wrote:
> 
> > On Mon, May 3, 2010 at 8:42 AM, Saajan <ssangra...@veriskhealth.com>
> > wrote:
> >
> > Would highly appreciate comments on how HBase is used to support
> search
> > > applications and how we can support search / filter across multiple
> > > criteria
> > > in HBase.
> > >
> >
> > Hi,
> >
> > we were facing the same challenges during the Lily design, and
> decided to
> > build an integration between HBase and SOLR (and use an HBase-based
> WAL for
> > async operations against SOLR in a durable fashion). I realize this
> isn't
> > entirely helpful here and now (we're currently shooting for a
> prerelease
> > date of mid July), but your requirements seem to match closely what
> we are
> > building at the moment.
> >
> > Lily sources will be released under an Apache license from
> www.lilycms.org
> >
> > Cheers,
> >
> > Steven.
> > --
> > Steven Noels                            http://outerthought.org/
> > Outerthought                            Open Source Java & XML
> > stevenn at outerthought.org             Makers of the Daisy CMS
> >
> 
> A simple alternative to secondary indexes is to store the table a
> second
> time:
> 
> Key -> Value
> and
> Value -> Key
> 
> With this design you can search on the key or the value quickly. With
> this,
> a single insert is transformed into multiple inserts and keeping data
> integrity falls on the user.

Reply via email to