Hi, What does your schema look like?
Would it make sense to changing the key to user_id '|' timestamp and then use the session_id in the column name? On Aug 2, 2012, at 7:23 AM, Christian Schäfer <syrious3...@yahoo.de> wrote: > OK, > > at first I will try the scans. > > If that's too slow I will have to upgrade hbase (currently 0.90.4-cdh3u2) to > be able to use coprocessors. > > Currently I'm stuck at the scans because it requires two steps (therefore > some kind of filter chaining) > > The key: userId-dateInMllis-sessionId > > At first I need to extract dateInMllis with regex or substring (using special > delimiters for date) > > Second, the extracted value must be parsed to Long and set to a RowFilter > Comparator like this: > > > > > > ----- Ursprüngliche Message ----- > Von: Michael Segel <michael_se...@hotmail.com> > An: user@hbase.apache.org > CC: > Gesendet: 13:52 Mittwoch, 1.August 2012 > Betreff: Re: How to query by rowKey-infix > > Actually w coprocessors you can create a secondary index in short order. > Then your cost is going to be 2 fetches. Trying to do a partial table scan > will be more expensive. > > On Jul 31, 2012, at 12:41 PM, Matt Corgan <mcor...@hotpads.com> wrote: > >> When deciding between a table scan vs secondary index, you should try to >> estimate what percent of the underlying data blocks will be used in the >> query. By default, each block is 64KB. >> >> If each user's data is small and you are fitting multiple users per block, >> then you're going to need all the blocks, so a tablescan is better because >> it's simpler. If each user has 1MB+ data then you will want to pick out >> the individual blocks relevant to each date. The secondary index will help >> you go directly to those sparse blocks, but with a cost in complexity, >> consistency, and extra denormalized data that knocks primary data out of >> your block cache. >> >> If latency is not a concern, I would start with the table scan. If that's >> too slow you add the secondary index, and if you still need it faster you >> do the primary key lookups in parallel as Jerry mentions. >> >> Matt >> >> On Tue, Jul 31, 2012 at 10:10 AM, Jerry Lam <chiling...@gmail.com> wrote: >> >>> Hi Chris: >>> >>> I'm thinking about building a secondary index for primary key lookup, then >>> query using the primary keys in parallel. >>> >>> I'm interested to see if there is other option too. >>> >>> Best Regards, >>> >>> Jerry >>> >>> On Tue, Jul 31, 2012 at 11:27 AM, Christian Schäfer <syrious3...@yahoo.de >>>> wrote: >>> >>>> Hello there, >>>> >>>> I designed a row key for queries that need best performance (~100 ms) >>>> which looks like this: >>>> >>>> userId-date-sessionId >>>> >>>> These queries(scans) are always based on a userId and sometimes >>>> additionally on a date, too. >>>> That's no problem with the key above. >>>> >>>> However, another kind of queries shall be based on a given time range >>>> whereas the outermost left userId is not given or known. >>>> In this case I need to get all rows covering the given time range with >>>> their date to create a daily reporting. >>>> >>>> As I can't set wildcards at the beginning of a left-based index for the >>>> scan, >>>> I only see the possibility to scan the index of the whole table to >>> collect >>>> the >>>> rowKeys that are inside the timerange I'm interested in. >>>> >>>> Is there a more elegant way to collect rows within time range X? >>>> (Unfortunately, the date attribute is not equal to the timestamp that is >>>> stored by hbase automatically.) >>>> >>>> Could/should one maybe leverage some kind of row key caching to >>> accelerate >>>> the collection process? >>>> Is that covered by the block cache? >>>> >>>> Thanks in advance for any advice. >>>> >>>> regards >>>> Chris >>>> >>> >