Hi Christian, I had the similar requirements as yours. So, till now i have used timestamps for filtering the data and I would say the performance is satisfactory. Here are the results of timestamp based filtering: The table has 34 million records(average row size is 1.21 KB), in 136 seconds i get the entire result of query which had 225 rows. I am running a HBase 0.92, 8 node cluster on Vmware Hypervisor. Each node had 3.2 GB of memory, and 500 GB HDFS space. Each Hard Drive in my set-up is hosting 2 Slaves Instance(2 VM's running Datanode, NodeManager,RegionServer). I have only allocated 1200MB for RS's. I haven't done any modification in the block size of HDFS or HBase. Considering the below-par hardware configuration of cluster i feel the performance is OK and IMO it'll be better than substring comparator of column values since in substring comparator filter you are essentially doing a FULL TABLE scan. Whereas, in timerange based scan you can *Skip Store Files*.
On a side note, Alex created a JIRA for enhancing the current FuzzyRowFilter to do range based filtering also. Here is the link: https://issues.apache.org/jira/browse/HBASE-6618 . You are more than welcome if you would like to chime in. HTH, Anil Gupta On Thu, Aug 9, 2012 at 1:55 PM, Christian Schäfer <syrious3...@yahoo.de>wrote: > Nice. Thanks Alex for sharing your experiences with that custom filter > implementation. > > > Currently I'm still using key filter with substring comparator. > As soon as I got a good amount of test data I will measure performance of > that naiive substring filter in comparison to your fuzzy row filter. > > regards, > Christian > > > > ________________________________ > Von: Alex Baranau <alex.barano...@gmail.com> > An: user@hbase.apache.org; Christian Schäfer <syrious3...@yahoo.de> > Gesendet: 22:18 Donnerstag, 9.August 2012 > Betreff: Re: How to query by rowKey-infix > > > jfyi: documented FuzzyRowFilter usage here: http://bit.ly/OXVdbg. Will > add documentation to HBase book very soon [1] > > Alex Baranau > ------ > Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr > > [1] https://issues.apache.org/jira/browse/HBASE-6526 > > On Fri, Aug 3, 2012 at 6:14 PM, Alex Baranau <alex.barano...@gmail.com> > wrote: > > Good! > > > > > >Submitted initial patch of fuzzy row key filter at > https://issues.apache.org/jira/browse/HBASE-6509. You can just copy the > filter class and include it in your code and use it in your setup as any > other custom filter (no need to patch HBase). > > > > > >Please let me know if you try it out (or post your comments at > HBASE-6509). > > > > > >Alex Baranau > >------ > >Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr > > > > > >On Fri, Aug 3, 2012 at 5:23 AM, Christian Schäfer <syrious3...@yahoo.de> > wrote: > > > >Hi Alex, > >> > >>thanks a lot for the hint about setting the timestamp of the put. > >>I didn't know that this would be possible but that's solving the problem > (first test was successful). > >>So I'm really glad that I don't need to apply a filter to extract the > time and so on for every row. > >> > >>Nevertheless I would like to see your custom filter implementation. > >>Would be nice if you could provide it helping me to get a bit into it. > >> > >>And yes that helped :) > >> > >>regards > >>Chris > >> > >> > >> > >>________________________________ > >>Von: Alex Baranau <alex.barano...@gmail.com> > >>An: user@hbase.apache.org; Christian Schäfer <syrious3...@yahoo.de> > >>Gesendet: 0:57 Freitag, 3.August 2012 > >> > >>Betreff: Re: How to query by rowKey-infix > >> > >> > >>Hi Christian! > >>If to put off secondary indexes and assume you are going with "heavy > scans", you can try two following things to make it much faster. If this is > appropriate to your situation, of course. > >> > >>1. > >> > >>> Is there a more elegant way to collect rows within time range X? > >>> (Unfortunately, the date attribute is not equal to the timestamp that > is stored by hbase automatically.) > >> > >>Can you set timestamp of the Puts to the one you have in row key? > Instead of relying on the one that HBase puts automatically (current ts). > If you can, this will improve reading speed a lot by setting time range on > scanner. Depending on how you are writing your data of course, but I assume > that you mostly write data in "time-increasing" manner. > >> > >> > >>2. > >> > >>If your userId has fixed length, or you can change it so that it has > fixed length, then you can actually use smth like "wildcard" in row key. > There's a way in Filter implementation to fast-forward to the record with > specific row key and by doing this skip many records. This might be used as > follows: > >>* suppose your userId is 5 characters in length > >>* suppose you are scanning for records with time between 2012-08-01 > and 2012-08-08 > >>* when you scanning records and you face e.g. key > "aaaaa_2012-08-09_3jh345j345kjh", where "aaaaa" is user id, you can tell > the scanner from your filter to fast-forward to key "aaaab_ 2012-08-01". > Because you know that all remained records of user "aaaaa" don't fall into > the interval you need (as the time for its records will be >= 2012-08-09). > >> > >>As of now, I believe you will have to implement your custom filter to do > that. > Pointer: org.apache.hadoop.hbase.filter.Filter.ReturnCode.SEEK_NEXT_USING_HINT > >>I believe I implemented similar thing some time ago. If this idea works > for you I could look for the implementation and share it if it helps. Or > may be even simply add it to HBase codebase. > >> > >>Hope this helps, > >> > >> > >>Alex Baranau > >>------ > >>Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch > - Solr > >> > >> > >> > >>On Thu, Aug 2, 2012 at 8:40 AM, Christian Schäfer <syrious3...@yahoo.de> > wrote: > >> > >> > >>> > >>>Excuse my double posting. > >>>Here is the complete mail: > >>> > >>> > >>> > >>>OK, > >>> > >>>at first I will try the scans. > >>> > >>>If that's too slow I will have to upgrade hbase (currently > 0.90.4-cdh3u2) to be able to use coprocessors. > >>> > >>> > >>>Currently I'm stuck at the scans because it requires two steps > (therefore maybe some kind of filter chaining is required) > >>> > >>> > >>>The key: userId-dateInMillis-sessionId > >>> > >>> > >>>At first I need to extract dateInMllis with regex or substring (using > special delimiters for date) > >>> > >>>Second, the extracted value must be parsed to Long and set to a > RowFilter Comparator like this: > >>> > >>>scan.setFilter(new RowFilter(CompareOp.GREATER_OR_EQUAL, new > BinaryComparator(Bytes.toBytes((Long)dateInMillis)))); > >>> > >>>How to chain that? > >>>Do I have to write a custom filter? > >>>(Would like to avoid that due to deployment) > >>> > >>>regards > >>>Chris > >>> > >>> > >>>----- Ursprüngliche Message ----- > >>>Von: Michael Segel <michael_se...@hotmail.com> > >>>An: user@hbase.apache.org > >>>CC: > >>>Gesendet: 13:52 Mittwoch, 1.August 2012 > >>>Betreff: Re: How to query by rowKey-infix > >>> > >>>Actually w coprocessors you can create a secondary index in short order. > >>>Then your cost is going to be 2 fetches. Trying to do a partial table > scan will be more expensive. > >>> > >>>On Jul 31, 2012, at 12:41 PM, Matt Corgan <mcor...@hotpads.com> wrote: > >>> > >>>> When deciding between a table scan vs secondary index, you should try > to > >>>> estimate what percent of the underlying data blocks will be used in > the > >>>> query. By default, each block is 64KB. > >>>> > >>>> If each user's data is small and you are fitting multiple users per > block, > >>>> then you're going to need all the blocks, so a tablescan is better > because > >>>> it's simpler. If each user has 1MB+ data then you will want to pick > out > >>>> the individual blocks relevant to each date. The secondary index > will help > >>>> you go directly to those sparse blocks, but with a cost in complexity, > >>>> consistency, and extra denormalized data that knocks primary data out > of > >>>> your block cache. > >>>> > >>>> If latency is not a concern, I would start with the table scan. If > that's > >>>> too slow you add the secondary index, and if you still need it faster > you > >>>> do the primary key lookups in parallel as Jerry mentions. > >>>> > >>>> Matt > >>>> > >>>> On Tue, Jul 31, 2012 at 10:10 AM, Jerry Lam <chiling...@gmail.com> > wrote: > >>>> > >>>>> Hi Chris: > >>>>> > >>>>> I'm thinking about building a secondary index for primary key > lookup, then > >>>>> query using the primary keys in parallel. > >>>>> > >>>>> I'm interested to see if there is other option too. > >>>>> > >>>>> Best Regards, > >>>>> > >>>>> Jerry > >>>>> > >>>>> On Tue, Jul 31, 2012 at 11:27 AM, Christian Schäfer < > syrious3...@yahoo.de > >>>>>> wrote: > >>>>> > >>>>>> Hello there, > >>>>>> > >>>>>> I designed a row key for queries that need best performance (~100 > ms) > >>>>>> which looks like this: > >>>>>> > >>>>>> userId-date-sessionId > >>>>>> > >>>>>> These queries(scans) are always based on a userId and sometimes > >>>>>> additionally on a date, too. > >>>>>> That's no problem with the key above. > >>>>>> > >>>>>> However, another kind of queries shall be based on a given time > range > >>>>>> whereas the outermost left userId is not given or known. > >>>>>> In this case I need to get all rows covering the given time range > with > >>>>>> their date to create a daily reporting. > >>>>>> > >>>>>> As I can't set wildcards at the beginning of a left-based index for > the > >>>>>> scan, > >>>>>> I only see the possibility to scan the index of the whole table to > >>>>> collect > >>>>>> the > >>>>>> rowKeys that are inside the timerange I'm interested in. > >>>>>> > >>>>>> Is there a more elegant way to collect rows within time range X? > >>>>>> (Unfortunately, the date attribute is not equal to the timestamp > that is > >>>>>> stored by hbase automatically.) > >>>>>> > >>>>>> Could/should one maybe leverage some kind of row key caching to > >>>>> accelerate > >>>>>> the collection process? > >>>>>> Is that covered by the block cache? > >>>>>> > >>>>>> Thanks in advance for any advice. > >>>>>> > >>>>>> regards > >>>>>> Chris > >>>>>> > >>>>> > >>> > >> > >> > >>-- > >> > >>Alex Baranau > >>------ > >>Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch > - Solr > >> > > > > > > > >-- > > > >Alex Baranau > >------ > >Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch > - Solr > > > -- Thanks & Regards, Anil Gupta