RE: HBase Design Considerations

Jonathan Gray Mon, 03 May 2010 09:54:37 -0700

Oh and you would have to build a special client to execute the query.  You 
could make a nice client that would do each of the conditions in a separate 
query, in a separate thread, and then join the results together in the client.  
I'm pretty sure you could do this on huge datasets and be way under your 4 
second requirement.


What is the concurrency and load like for this application?  How many 
queries/sec do you expect?

> -----Original Message-----
> From: Jonathan Gray [mailto:jg...@facebook.com]
> Sent: Monday, May 03, 2010 9:49 AM
> To: hbase-user@hadoop.apache.org
> Subject: RE: HBase Design Considerations
> 
> Hey Saajan,
> 
> Does your data have any large pieces or is it mostly just short indexed
> fields?  A Solr/HBase hybrid definitely sounds interesting but is a big
> undertaking.
> 
> To build on what Edward is suggesting, to be able to efficiently do
> this type of query directly on HBase you may need to have a separate
> table for each searchable field.  Are the searchable fields usually
> based on a fixed number of values?  Or are they full-text search?
> 
> To give you an idea of how you could design indexed tables, consider
> four different types of data:  full data accessed by unique identifier,
> time, single string values, full text search.
> 
> Unique identifier is the simplest:  row = <uniqueid>, columns =
> <metadata>
> 
> Time depends on if you want to bucket it at all (for example, you only
> ever care about searching by day not time).
> 
> Second granularity:
> row = <epoch_timestamp/long>, column = <uniqueid>
> 
> Day granularity:
> row = <date>, columns = [<uniqueid>] or [<stamp><uniqueid>] or
> [<descending_stamp><uniqueid>]
> 
> These tables will be ordered by time, so you will be able to do
> efficient scans of time ranges by setting the startRow and stopRow
> accordingly.  If your uniqueids are more like uuids, you may want to
> prefix the uniqueid in the columns with the epoch stamp (to have
> secondary sort by time).
> 
> I recommend using Bytes.toLong(long) to get binary data rather than
> using ascii characters.  One thing, if you are using epoch-style stamps
> and you want descending order time instead of the default ascending
> order that HBase provides, you will want to reverse the stamps by
> storing (Long.MAX_VALUE - stamp) instead.
> 
> 
> If you have a fixed number of values, you can do a simple reversed
> index table:
> row = <value>, columns = [<descending_stamp><uniqueid>]
> 
> Again, you have the option of a secondary sort by prefixing the
> uniqueid with something like a stamp.
> 
> There are a couple ways you might do full text search, but in general
> you index each word in each document, so the rows are words.  Each row
> contains a list of documents which have that word, and you can put
> position or scoring information in the value.  The base model is
> something like:
> row = <word>, columns = [<uniqueid>], values =
> [<position_info_or_other_scoring_info>]
> 
> If you want to support cross-field full-text search, you can add
> information to the columns or values about the fields.  If you prefix
> the column with the field name, you basically get full-text search with
> a GROUP BY on the field.  You can GROUP BY / ORDER BY just about
> anything like that.
> 
> 
> Hope that helps.
> 
> JG
> 
> > -----Original Message-----
> > From: Edward Capriolo [mailto:edlinuxg...@gmail.com]
> > Sent: Monday, May 03, 2010 7:14 AM
> > To: hbase-user@hadoop.apache.org
> > Subject: Re: HBase Design Considerations
> >
> > On Mon, May 3, 2010 at 4:04 AM, Steven Noels
> > <stev...@outerthought.org>wrote:
> >
> > > On Mon, May 3, 2010 at 8:42 AM, Saajan
> <ssangra...@veriskhealth.com>
> > > wrote:
> > >
> > > Would highly appreciate comments on how HBase is used to support
> > search
> > > > applications and how we can support search / filter across
> multiple
> > > > criteria
> > > > in HBase.
> > > >
> > >
> > > Hi,
> > >
> > > we were facing the same challenges during the Lily design, and
> > decided to
> > > build an integration between HBase and SOLR (and use an HBase-based
> > WAL for
> > > async operations against SOLR in a durable fashion). I realize this
> > isn't
> > > entirely helpful here and now (we're currently shooting for a
> > prerelease
> > > date of mid July), but your requirements seem to match closely what
> > we are
> > > building at the moment.
> > >
> > > Lily sources will be released under an Apache license from
> > www.lilycms.org
> > >
> > > Cheers,
> > >
> > > Steven.
> > > --
> > > Steven Noels                            http://outerthought.org/
> > > Outerthought                            Open Source Java & XML
> > > stevenn at outerthought.org             Makers of the Daisy CMS
> > >
> >
> > A simple alternative to secondary indexes is to store the table a
> > second
> > time:
> >
> > Key -> Value
> > and
> > Value -> Key
> >
> > With this design you can search on the key or the value quickly. With
> > this,
> > a single insert is transformed into multiple inserts and keeping data
> > integrity falls on the user.

RE: HBase Design Considerations

Reply via email to