Hey Saajan, Does your data have any large pieces or is it mostly just short indexed fields? A Solr/HBase hybrid definitely sounds interesting but is a big undertaking.
To build on what Edward is suggesting, to be able to efficiently do this type of query directly on HBase you may need to have a separate table for each searchable field. Are the searchable fields usually based on a fixed number of values? Or are they full-text search? To give you an idea of how you could design indexed tables, consider four different types of data: full data accessed by unique identifier, time, single string values, full text search. Unique identifier is the simplest: row = <uniqueid>, columns = <metadata> Time depends on if you want to bucket it at all (for example, you only ever care about searching by day not time). Second granularity: row = <epoch_timestamp/long>, column = <uniqueid> Day granularity: row = <date>, columns = [<uniqueid>] or [<stamp><uniqueid>] or [<descending_stamp><uniqueid>] These tables will be ordered by time, so you will be able to do efficient scans of time ranges by setting the startRow and stopRow accordingly. If your uniqueids are more like uuids, you may want to prefix the uniqueid in the columns with the epoch stamp (to have secondary sort by time). I recommend using Bytes.toLong(long) to get binary data rather than using ascii characters. One thing, if you are using epoch-style stamps and you want descending order time instead of the default ascending order that HBase provides, you will want to reverse the stamps by storing (Long.MAX_VALUE - stamp) instead. If you have a fixed number of values, you can do a simple reversed index table: row = <value>, columns = [<descending_stamp><uniqueid>] Again, you have the option of a secondary sort by prefixing the uniqueid with something like a stamp. There are a couple ways you might do full text search, but in general you index each word in each document, so the rows are words. Each row contains a list of documents which have that word, and you can put position or scoring information in the value. The base model is something like: row = <word>, columns = [<uniqueid>], values = [<position_info_or_other_scoring_info>] If you want to support cross-field full-text search, you can add information to the columns or values about the fields. If you prefix the column with the field name, you basically get full-text search with a GROUP BY on the field. You can GROUP BY / ORDER BY just about anything like that. Hope that helps. JG > -----Original Message----- > From: Edward Capriolo [mailto:edlinuxg...@gmail.com] > Sent: Monday, May 03, 2010 7:14 AM > To: hbase-user@hadoop.apache.org > Subject: Re: HBase Design Considerations > > On Mon, May 3, 2010 at 4:04 AM, Steven Noels > <stev...@outerthought.org>wrote: > > > On Mon, May 3, 2010 at 8:42 AM, Saajan <ssangra...@veriskhealth.com> > > wrote: > > > > Would highly appreciate comments on how HBase is used to support > search > > > applications and how we can support search / filter across multiple > > > criteria > > > in HBase. > > > > > > > Hi, > > > > we were facing the same challenges during the Lily design, and > decided to > > build an integration between HBase and SOLR (and use an HBase-based > WAL for > > async operations against SOLR in a durable fashion). I realize this > isn't > > entirely helpful here and now (we're currently shooting for a > prerelease > > date of mid July), but your requirements seem to match closely what > we are > > building at the moment. > > > > Lily sources will be released under an Apache license from > www.lilycms.org > > > > Cheers, > > > > Steven. > > -- > > Steven Noels http://outerthought.org/ > > Outerthought Open Source Java & XML > > stevenn at outerthought.org Makers of the Daisy CMS > > > > A simple alternative to secondary indexes is to store the table a > second > time: > > Key -> Value > and > Value -> Key > > With this design you can search on the key or the value quickly. With > this, > a single insert is transformed into multiple inserts and keeping data > integrity falls on the user.