Secondary indexes

2008-04-22 Thread Clint Morgan
All, We want to put secondary indexes into hbase. The motivation is that we are storing data in hbase that we want to serve to users. We would like to be able to serve rows sorted by column values. Our queries will be over rows with a given key prefix, so we should not be hitting to many regions.

Re: Secondary indexes

2008-04-22 Thread stack
Some questions interlaced below: Clint Morgan wrote: All, We want to put secondary indexes into hbase. The motivation is that we are storing data in hbase that we want to serve to users. We would like to be able to serve rows sorted by column values. Our queries will be over rows with a given k

Re: Secondary indexes

2008-04-22 Thread Bryan Duxbury
This doesn't have to be all that complicated. Why not keep another HBase table as the index? The keys would be the column values. There'd be a single matchingRows column family, and the qualifiers and values would be the rows that match that column. Then, when you want to scan in column ord

Re: Secondary indexes

2008-04-22 Thread Clint Morgan
> I don't follow what the Factory adds. The Factory part of the name just means that it can make an object of type T from a byte[] (the column value). This is the type that we keep in the set and sort on. > > We're talking about getting HBASE-82 into 0.2. Does that interfere with > this propos

Re: Secondary indexes

2008-04-22 Thread Clint Morgan
Yeah, that would be an easy approach. We would need HBASE-82. The main problem I see here is that we cannot take (as much) advantage of our row key prefix in weeding out rows. Say I want the first 10 rows that start with XXX ordered by A:amount, then I would have scan through column values from r

Re: Secondary indexes

2008-04-22 Thread Bryan Duxbury
If you just want the first 10 by a certain prefix ordered by a column, then you will definitely be better off scanning them by row and ordering them clientside. Surely your idea of maintaining a separate SortedMap in each region would work, but I don't think you should discount the cost ass

Re: Secondary indexes

2008-04-22 Thread stack
Clint Morgan wrote: ... This scanner would have a significant client-side component to do the arbitrage between all regions to figure the lowest column value? If you had a new type of 'region' -- one denoted by lowest and upper column then the client-side logic would fade away and your scanner

Re: Secondary indexes

2008-04-22 Thread Jun Rao
The proposal sounds interesting. Will the indexes be maintained in the same "transaction" of an update? So, if an update to a row is successful, but index maintenance fails, would you roll back the row update? Are you also considering composite (multi-column) indexes like those used in Google's A

Re: Secondary indexes

2008-04-22 Thread Clint Morgan
The way I picture it, we only need to check the 100 or 1000 regions to get the very first row of a order-by scan. Afterwards, we just need to get the next row from the region that we took the last row from, compare with the existing rows we have, and return the lowest. So if we have R regions to g

Re: Secondary indexes

2008-04-22 Thread Edward J. Yoon
I remember someone objecting to the scheme. :) On Wed, Apr 23, 2008 at 6:46 AM, Clint Morgan <[EMAIL PROTECTED]> wrote: > The way I picture it, we only need to check the 100 or 1000 regions to > get the very first row of a order-by scan. Afterwards, we just need to > get the next row from the regi

Re: Secondary indexes

2008-04-22 Thread Edward J. Yoon
I remember someone objecting to the scheme. :) On Wed, Apr 23, 2008 at 6:46 AM, Clint Morgan <[EMAIL PROTECTED]> wrote: > The way I picture it, we only need to check the 100 or 1000 regions to > get the very first row of a order-by scan. Afterwards, we just need to > get the next row from the regi