The proposal sounds interesting.

Will the indexes be maintained in the same "transaction" of an update? So,
if an update to a row is successful, but index maintenance fails, would you
roll back the row update?

Are you also considering composite (multi-column) indexes like those used
in Google's App Engine?

We have been thinking about adding to HBase a Lucene-based text index,
which can be maintained asynchronously with the table. Are you interested
in text index too?

Jun

[EMAIL PROTECTED] wrote on 04/22/2008 10:31:02 AM:

> All,
>
> We want to put secondary indexes into hbase. The motivation is that we
> are storing data in hbase that we want to serve to users. We would
> like to be able to serve rows sorted by column values. Our queries
> will be over rows with a given key prefix, so we should not be hitting
> to many regions.
>
> I was thinking it would work roughly like this:
>
> - At table creation time, individual columns can be declared as
> indexed. By default we could sort the column values lexicographically,
> or we can provide a WritableComparatorFactory<T> which has the ability
> to make values of type T from a byte [], as well as providing a
> Comparator<T>. (Better than providing a Comparator<byte[]> as it only
> costs once per row insert for deserialization, rather that twice on
> each comparison).
>
> - We catch all writes/deletes and maintain a SortedMap<T, HStoreKey>
> which keeps the column values in order, and maps them back to row
> keys. First cut may just keep all this in memory, but it should be
> backed with MapFile(s).
>
> - Add to the hregion the ability to scan through keys in column order.
> Just iterate through the SortedMap, run a filter on the key, and if it
> passes do a get on the row.
>
> - Add a ColumnOrderedClientScanner which will open column order
> scanners to all applicable hregions, and continuously pick row with
> the lowest column value from each of the client scanners.
>
> - Region splits should be easy enough, just a scan through the
> SortedMap to partition.
>
> Of course, the index could also be used for more efficient querying on
> the indexed column's values.
>
> Do other users have a need for this functionality?
>
> What do developers think about this? I know hbase is more intended for
> back-end batch style processing, but we have this need.
>
> Cheers,
> -clint

Reply via email to