Re: HBase Design Considerations

Todd Lipcon Mon, 03 May 2010 16:07:42 -0700

This sounds like a canonical example for where bitmap indices would be very
useful. Are you currently using bitmap indices in Oracle? I recently spoke
with the creator of another popular bitmap indexing technology about HBase
integration, but there were some issues around licensing, etc. I'll ping him
and see if we can make some progress in that direction.


-Todd

On Mon, May 3, 2010 at 9:55 AM, Jonathan Gray <jg...@facebook.com> wrote:

> Oh and you would have to build a special client to execute the query.  You
> could make a nice client that would do each of the conditions in a separate
> query, in a separate thread, and then join the results together in the
> client.  I'm pretty sure you could do this on huge datasets and be way under
> your 4 second requirement.
>
> What is the concurrency and load like for this application?  How many
> queries/sec do you expect?
>
> > -----Original Message-----
> > From: Jonathan Gray [mailto:jg...@facebook.com]
> > Sent: Monday, May 03, 2010 9:49 AM
> > To: hbase-user@hadoop.apache.org
> > Subject: RE: HBase Design Considerations
> >
> > Hey Saajan,
> >
> > Does your data have any large pieces or is it mostly just short indexed
> > fields?  A Solr/HBase hybrid definitely sounds interesting but is a big
> > undertaking.
> >
> > To build on what Edward is suggesting, to be able to efficiently do
> > this type of query directly on HBase you may need to have a separate
> > table for each searchable field.  Are the searchable fields usually
> > based on a fixed number of values?  Or are they full-text search?
> >
> > To give you an idea of how you could design indexed tables, consider
> > four different types of data:  full data accessed by unique identifier,
> > time, single string values, full text search.
> >
> > Unique identifier is the simplest:  row = <uniqueid>, columns =
> > <metadata>
> >
> > Time depends on if you want to bucket it at all (for example, you only
> > ever care about searching by day not time).
> >
> > Second granularity:
> > row = <epoch_timestamp/long>, column = <uniqueid>
> >
> > Day granularity:
> > row = <date>, columns = [<uniqueid>] or [<stamp><uniqueid>] or
> > [<descending_stamp><uniqueid>]
> >
> > These tables will be ordered by time, so you will be able to do
> > efficient scans of time ranges by setting the startRow and stopRow
> > accordingly.  If your uniqueids are more like uuids, you may want to
> > prefix the uniqueid in the columns with the epoch stamp (to have
> > secondary sort by time).
> >
> > I recommend using Bytes.toLong(long) to get binary data rather than
> > using ascii characters.  One thing, if you are using epoch-style stamps
> > and you want descending order time instead of the default ascending
> > order that HBase provides, you will want to reverse the stamps by
> > storing (Long.MAX_VALUE - stamp) instead.
> >
> >
> > If you have a fixed number of values, you can do a simple reversed
> > index table:
> > row = <value>, columns = [<descending_stamp><uniqueid>]
> >
> > Again, you have the option of a secondary sort by prefixing the
> > uniqueid with something like a stamp.
> >
> > There are a couple ways you might do full text search, but in general
> > you index each word in each document, so the rows are words.  Each row
> > contains a list of documents which have that word, and you can put
> > position or scoring information in the value.  The base model is
> > something like:
> > row = <word>, columns = [<uniqueid>], values =
> > [<position_info_or_other_scoring_info>]
> >
> > If you want to support cross-field full-text search, you can add
> > information to the columns or values about the fields.  If you prefix
> > the column with the field name, you basically get full-text search with
> > a GROUP BY on the field.  You can GROUP BY / ORDER BY just about
> > anything like that.
> >
> >
> > Hope that helps.
> >
> > JG
> >
> > > -----Original Message-----
> > > From: Edward Capriolo [mailto:edlinuxg...@gmail.com]
> > > Sent: Monday, May 03, 2010 7:14 AM
> > > To: hbase-user@hadoop.apache.org
> > > Subject: Re: HBase Design Considerations
> > >
> > > On Mon, May 3, 2010 at 4:04 AM, Steven Noels
> > > <stev...@outerthought.org>wrote:
> > >
> > > > On Mon, May 3, 2010 at 8:42 AM, Saajan
> > <ssangra...@veriskhealth.com>
> > > > wrote:
> > > >
> > > > Would highly appreciate comments on how HBase is used to support
> > > search
> > > > > applications and how we can support search / filter across
> > multiple
> > > > > criteria
> > > > > in HBase.
> > > > >
> > > >
> > > > Hi,
> > > >
> > > > we were facing the same challenges during the Lily design, and
> > > decided to
> > > > build an integration between HBase and SOLR (and use an HBase-based
> > > WAL for
> > > > async operations against SOLR in a durable fashion). I realize this
> > > isn't
> > > > entirely helpful here and now (we're currently shooting for a
> > > prerelease
> > > > date of mid July), but your requirements seem to match closely what
> > > we are
> > > > building at the moment.
> > > >
> > > > Lily sources will be released under an Apache license from
> > > www.lilycms.org
> > > >
> > > > Cheers,
> > > >
> > > > Steven.
> > > > --
> > > > Steven Noels                            http://outerthought.org/
> > > > Outerthought                            Open Source Java & XML
> > > > stevenn at outerthought.org             Makers of the Daisy CMS
> > > >
> > >
> > > A simple alternative to secondary indexes is to store the table a
> > > second
> > > time:
> > >
> > > Key -> Value
> > > and
> > > Value -> Key
> > >
> > > With this design you can search on the key or the value quickly. With
> > > this,
> > > a single insert is transformed into multiple inserts and keeping data
> > > integrity falls on the user.
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: HBase Design Considerations

Reply via email to