Actually w coprocessors you can create a secondary index in short order. Then your cost is going to be 2 fetches. Trying to do a partial table scan will be more expensive.
On Jul 31, 2012, at 12:41 PM, Matt Corgan <mcor...@hotpads.com> wrote: > When deciding between a table scan vs secondary index, you should try to > estimate what percent of the underlying data blocks will be used in the > query. By default, each block is 64KB. > > If each user's data is small and you are fitting multiple users per block, > then you're going to need all the blocks, so a tablescan is better because > it's simpler. If each user has 1MB+ data then you will want to pick out > the individual blocks relevant to each date. The secondary index will help > you go directly to those sparse blocks, but with a cost in complexity, > consistency, and extra denormalized data that knocks primary data out of > your block cache. > > If latency is not a concern, I would start with the table scan. If that's > too slow you add the secondary index, and if you still need it faster you > do the primary key lookups in parallel as Jerry mentions. > > Matt > > On Tue, Jul 31, 2012 at 10:10 AM, Jerry Lam <chiling...@gmail.com> wrote: > >> Hi Chris: >> >> I'm thinking about building a secondary index for primary key lookup, then >> query using the primary keys in parallel. >> >> I'm interested to see if there is other option too. >> >> Best Regards, >> >> Jerry >> >> On Tue, Jul 31, 2012 at 11:27 AM, Christian Schäfer <syrious3...@yahoo.de >>> wrote: >> >>> Hello there, >>> >>> I designed a row key for queries that need best performance (~100 ms) >>> which looks like this: >>> >>> userId-date-sessionId >>> >>> These queries(scans) are always based on a userId and sometimes >>> additionally on a date, too. >>> That's no problem with the key above. >>> >>> However, another kind of queries shall be based on a given time range >>> whereas the outermost left userId is not given or known. >>> In this case I need to get all rows covering the given time range with >>> their date to create a daily reporting. >>> >>> As I can't set wildcards at the beginning of a left-based index for the >>> scan, >>> I only see the possibility to scan the index of the whole table to >> collect >>> the >>> rowKeys that are inside the timerange I'm interested in. >>> >>> Is there a more elegant way to collect rows within time range X? >>> (Unfortunately, the date attribute is not equal to the timestamp that is >>> stored by hbase automatically.) >>> >>> Could/should one maybe leverage some kind of row key caching to >> accelerate >>> the collection process? >>> Is that covered by the block cache? >>> >>> Thanks in advance for any advice. >>> >>> regards >>> Chris >>> >>