Oh and you would have to build a special client to execute the query. You could make a nice client that would do each of the conditions in a separate query, in a separate thread, and then join the results together in the client. I'm pretty sure you could do this on huge datasets and be way under your 4 second requirement.
What is the concurrency and load like for this application? How many queries/sec do you expect? > -----Original Message----- > From: Jonathan Gray [mailto:jg...@facebook.com] > Sent: Monday, May 03, 2010 9:49 AM > To: hbase-user@hadoop.apache.org > Subject: RE: HBase Design Considerations > > Hey Saajan, > > Does your data have any large pieces or is it mostly just short indexed > fields? A Solr/HBase hybrid definitely sounds interesting but is a big > undertaking. > > To build on what Edward is suggesting, to be able to efficiently do > this type of query directly on HBase you may need to have a separate > table for each searchable field. Are the searchable fields usually > based on a fixed number of values? Or are they full-text search? > > To give you an idea of how you could design indexed tables, consider > four different types of data: full data accessed by unique identifier, > time, single string values, full text search. > > Unique identifier is the simplest: row = <uniqueid>, columns = > <metadata> > > Time depends on if you want to bucket it at all (for example, you only > ever care about searching by day not time). > > Second granularity: > row = <epoch_timestamp/long>, column = <uniqueid> > > Day granularity: > row = <date>, columns = [<uniqueid>] or [<stamp><uniqueid>] or > [<descending_stamp><uniqueid>] > > These tables will be ordered by time, so you will be able to do > efficient scans of time ranges by setting the startRow and stopRow > accordingly. If your uniqueids are more like uuids, you may want to > prefix the uniqueid in the columns with the epoch stamp (to have > secondary sort by time). > > I recommend using Bytes.toLong(long) to get binary data rather than > using ascii characters. One thing, if you are using epoch-style stamps > and you want descending order time instead of the default ascending > order that HBase provides, you will want to reverse the stamps by > storing (Long.MAX_VALUE - stamp) instead. > > > If you have a fixed number of values, you can do a simple reversed > index table: > row = <value>, columns = [<descending_stamp><uniqueid>] > > Again, you have the option of a secondary sort by prefixing the > uniqueid with something like a stamp. > > There are a couple ways you might do full text search, but in general > you index each word in each document, so the rows are words. Each row > contains a list of documents which have that word, and you can put > position or scoring information in the value. The base model is > something like: > row = <word>, columns = [<uniqueid>], values = > [<position_info_or_other_scoring_info>] > > If you want to support cross-field full-text search, you can add > information to the columns or values about the fields. If you prefix > the column with the field name, you basically get full-text search with > a GROUP BY on the field. You can GROUP BY / ORDER BY just about > anything like that. > > > Hope that helps. > > JG > > > -----Original Message----- > > From: Edward Capriolo [mailto:edlinuxg...@gmail.com] > > Sent: Monday, May 03, 2010 7:14 AM > > To: hbase-user@hadoop.apache.org > > Subject: Re: HBase Design Considerations > > > > On Mon, May 3, 2010 at 4:04 AM, Steven Noels > > <stev...@outerthought.org>wrote: > > > > > On Mon, May 3, 2010 at 8:42 AM, Saajan > <ssangra...@veriskhealth.com> > > > wrote: > > > > > > Would highly appreciate comments on how HBase is used to support > > search > > > > applications and how we can support search / filter across > multiple > > > > criteria > > > > in HBase. > > > > > > > > > > Hi, > > > > > > we were facing the same challenges during the Lily design, and > > decided to > > > build an integration between HBase and SOLR (and use an HBase-based > > WAL for > > > async operations against SOLR in a durable fashion). I realize this > > isn't > > > entirely helpful here and now (we're currently shooting for a > > prerelease > > > date of mid July), but your requirements seem to match closely what > > we are > > > building at the moment. > > > > > > Lily sources will be released under an Apache license from > > www.lilycms.org > > > > > > Cheers, > > > > > > Steven. > > > -- > > > Steven Noels http://outerthought.org/ > > > Outerthought Open Source Java & XML > > > stevenn at outerthought.org Makers of the Daisy CMS > > > > > > > A simple alternative to secondary indexes is to store the table a > > second > > time: > > > > Key -> Value > > and > > Value -> Key > > > > With this design you can search on the key or the value quickly. With > > this, > > a single insert is transformed into multiple inserts and keeping data > > integrity falls on the user.