Re: How to query by rowKey-infix

Christian Schäfer Thu, 23 Aug 2012 01:41:54 -0700

Hi Anil,

to restrict data to a certain time window I also set timerange for the scan.


I'm slightly shocked about the processing time of more than 2 mins to return 
225 rows.
I would actually need a response in 5-10 sec.
In your   timestamp based filtering, do you check the timestamp as part of the 
row key or do you use the put timestamp (as I do)?
How many rows are scanned/touched  at your timestamp based filtering? 

Is it a full table scan where each row's key is checked against a given 
timestamp/timerange?


My use case of obtaining data by substring comparator operates on the row key.
It can't be replaced by setting the time range in my case, really. 

Btw. the scan is additionally restricted to a certain timerange to increase 
skipping of irrelevant files and thus improve performance.

 
regards,
Christian



----- Ursprüngliche Message -----
Von: anil gupta <anilgupt...@gmail.com>
An: user@hbase.apache.org; Christian Schäfer <syrious3...@yahoo.de>
CC: 
Gesendet: 20:42 Mittwoch, 22.August 2012
Betreff: Re: How to query by rowKey-infix

Hi Christian,

I had the similar requirements as yours. So, till now i have used
timestamps for filtering the data and I would say the performance is
satisfactory. Here are the results of timestamp based filtering:
The table has 34 million records(average row size is 1.21 KB), in 136
seconds i get the entire result of query which had 225 rows.
I am running a HBase 0.92, 8 node cluster on Vmware Hypervisor. Each node
had 3.2 GB of memory, and 500 GB HDFS space. Each Hard Drive in my set-up
is hosting 2 Slaves Instance(2 VM's running Datanode,
NodeManager,RegionServer). I have only allocated 1200MB for RS's. I haven't
done any modification in the block size of HDFS or HBase. Considering the
below-par hardware configuration of cluster i feel the performance is OK
and IMO it'll be better than substring comparator of column values since in
substring comparator filter you are essentially doing a FULL TABLE scan.
Whereas, in timerange based scan you can *Skip Store Files*.

On a side note, Alex created a JIRA for enhancing the current
FuzzyRowFilter to do range based filtering also. Here is the link:
https://issues.apache.org/jira/browse/HBASE-6618 . You are more than
welcome if you would like to chime in.

HTH,
Anil Gupta


On Thu, Aug 9, 2012 at 1:55 PM, Christian Schäfer <syrious3...@yahoo.de>wrote:

> Nice. Thanks Alex for sharing your experiences with that custom filter
> implementation.
>
>
> Currently I'm still using key filter with substring comparator.
> As soon as I got a good amount of test data I will measure performance of
> that naiive substring filter in comparison to your fuzzy row filter.
>
> regards,
> Christian
>
>
>
> ________________________________
> Von: Alex Baranau <alex.barano...@gmail.com>
> An: user@hbase.apache.org; Christian Schäfer <syrious3...@yahoo.de>
> Gesendet: 22:18 Donnerstag, 9.August 2012
> Betreff: Re: How to query by rowKey-infix
>
>
> jfyi: documented FuzzyRowFilter usage here: http://bit.ly/OXVdbg. Will
> add documentation to HBase book very soon [1]
>
> Alex Baranau
> ------
> Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr
>
> [1] https://issues.apache.org/jira/browse/HBASE-6526
>
> On Fri, Aug 3, 2012 at 6:14 PM, Alex Baranau <alex.barano...@gmail.com>
> wrote:
>
> Good!
> >
> >
> >Submitted initial patch of fuzzy row key filter at
> https://issues.apache.org/jira/browse/HBASE-6509. You can just copy the
> filter class and include it in your code and use it in your setup as any
> other custom filter (no need to patch HBase).
> >
> >
> >Please let me know if you try it out (or post your comments at
> HBASE-6509).
> >
> >
> >Alex Baranau
> >------
> >Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr
> >
> >
> >On Fri, Aug 3, 2012 at 5:23 AM, Christian Schäfer <syrious3...@yahoo.de>
> wrote:
> >
> >Hi Alex,
> >>
> >>thanks a lot for the hint about setting the timestamp of the put.
> >>I didn't know that this would be possible but that's solving the problem
> (first test was successful).
> >>So I'm really glad that I don't need to apply a filter to extract the
> time and so on for every row.
> >>
> >>Nevertheless I would like to see your custom filter implementation.
> >>Would be nice if you could provide it helping me to get a bit into it.
> >>
> >>And yes that helped :)
> >>
> >>regards
> >>Chris
> >>
> >>
> >>
> >>________________________________
> >>Von: Alex Baranau <alex.barano...@gmail.com>
> >>An: user@hbase.apache.org; Christian Schäfer <syrious3...@yahoo.de>
> >>Gesendet: 0:57 Freitag, 3.August 2012
> >>
> >>Betreff: Re: How to query by rowKey-infix
> >>
> >>
> >>Hi Christian!
> >>If to put off secondary indexes and assume you are going with "heavy
> scans", you can try two following things to make it much faster. If this is
> appropriate to your situation, of course.
> >>
> >>1.
> >>
> >>> Is there a more elegant way to collect rows within time range X?
> >>> (Unfortunately, the date attribute is not equal to the timestamp that
> is stored by hbase automatically.)
> >>
> >>Can you set timestamp of the Puts to the one you have in row key?
> Instead of relying on the one that HBase puts automatically (current ts).
> If you can, this will improve reading speed a lot by setting time range on
> scanner. Depending on how you are writing your data of course, but I assume
> that you mostly write data in "time-increasing" manner.
> >>
> >>
> >>2.
> >>
> >>If your userId has fixed length, or you can change it so that it has
> fixed length, then you can actually use smth like "wildcard"  in row key.
> There's a way in Filter implementation to fast-forward to the record with
> specific row key and by doing this skip many records. This might be used as
> follows:
> >>* suppose your userId is 5 characters in length
> >>* suppose you are scanning for records with time between 2012-08-01
> and 2012-08-08
> >>* when you scanning records and you face e.g. key
> "aaaaa_2012-08-09_3jh345j345kjh", where "aaaaa" is user id, you can tell
> the scanner from your filter to fast-forward to key "aaaab_ 2012-08-01".
> Because you know that all remained records of user "aaaaa" don't fall into
> the interval you need (as the time for its records will be >= 2012-08-09).
> >>
> >>As of now, I believe you will have to implement your custom filter to do
> that.
> Pointer: org.apache.hadoop.hbase.filter.Filter.ReturnCode.SEEK_NEXT_USING_HINT
> >>I believe I implemented similar thing some time ago. If this idea works
> for you I could look for the implementation and share it if it helps. Or
> may be even simply add it to HBase codebase.
> >>
> >>Hope this helps,
> >>
> >>
> >>Alex Baranau
> >>------
> >>Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch
> - Solr
> >>
> >>
> >>
> >>On Thu, Aug 2, 2012 at 8:40 AM, Christian Schäfer <syrious3...@yahoo.de>
> wrote:
> >>
> >>
> >>>
> >>>Excuse my double posting.
> >>>Here is the complete mail:
> >>>
> >>>
> >>>
> >>>OK,
> >>>
> >>>at first I will try the scans.
> >>>
> >>>If that's too slow I will have to upgrade hbase (currently
> 0.90.4-cdh3u2) to be able to use coprocessors.
> >>>
> >>>
> >>>Currently I'm stuck at the scans because it requires two steps
> (therefore maybe some kind of filter chaining is required)
> >>>
> >>>
> >>>The key:  userId-dateInMillis-sessionId
> >>>
> >>>
> >>>At first I need to extract dateInMllis with regex or substring (using
> special delimiters for date)
> >>>
> >>>Second, the extracted value must be parsed to Long and set to a
> RowFilter Comparator like this:
> >>>
> >>>scan.setFilter(new RowFilter(CompareOp.GREATER_OR_EQUAL, new
> BinaryComparator(Bytes.toBytes((Long)dateInMillis))));
> >>>
> >>>How to chain that?
> >>>Do I have to write a custom filter?
> >>>(Would like to avoid that due to deployment)
> >>>
> >>>regards
> >>>Chris
> >>>
> >>>
> >>>----- Ursprüngliche Message -----
> >>>Von: Michael Segel <michael_se...@hotmail.com>
> >>>An: user@hbase.apache.org
> >>>CC:
> >>>Gesendet: 13:52 Mittwoch, 1.August 2012
> >>>Betreff: Re: How to query by rowKey-infix
> >>>
> >>>Actually w coprocessors you can create a secondary index in short order.
> >>>Then your cost is going to be 2 fetches. Trying to do a partial table
> scan will be more expensive.
> >>>
> >>>On Jul 31, 2012, at 12:41 PM, Matt Corgan <mcor...@hotpads.com> wrote:
> >>>
> >>>> When deciding between a table scan vs secondary index, you should try
> to
> >>>> estimate what percent of the underlying data blocks will be used in
> the
> >>>> query.  By default, each block is 64KB.
> >>>>
> >>>> If each user's data is small and you are fitting multiple users per
> block,
> >>>> then you're going to need all the blocks, so a tablescan is better
> because
> >>>> it's simpler.  If each user has 1MB+ data then you will want to pick
> out
> >>>> the individual blocks relevant to each date.  The secondary index
> will help
> >>>> you go directly to those sparse blocks, but with a cost in complexity,
> >>>> consistency, and extra denormalized data that knocks primary data out
> of
> >>>> your block cache.
> >>>>
> >>>> If latency is not a concern, I would start with the table scan.  If
> that's
> >>>> too slow you add the secondary index, and if you still need it faster
> you
> >>>> do the primary key lookups in parallel as Jerry mentions.
> >>>>
> >>>> Matt
> >>>>
> >>>> On Tue, Jul 31, 2012 at 10:10 AM, Jerry Lam <chiling...@gmail.com>
> wrote:
> >>>>
> >>>>> Hi Chris:
> >>>>>
> >>>>> I'm thinking about building a secondary index for primary key
> lookup, then
> >>>>> query using the primary keys in parallel.
> >>>>>
> >>>>> I'm interested to see if there is other option too.
> >>>>>
> >>>>> Best Regards,
> >>>>>
> >>>>> Jerry
> >>>>>
> >>>>> On Tue, Jul 31, 2012 at 11:27 AM, Christian Schäfer <
> syrious3...@yahoo.de
> >>>>>> wrote:
> >>>>>
> >>>>>> Hello there,
> >>>>>>
> >>>>>> I designed a row key for queries that need best performance (~100
> ms)
> >>>>>> which looks like this:
> >>>>>>
> >>>>>> userId-date-sessionId
> >>>>>>
> >>>>>> These queries(scans) are always based on a userId and sometimes
> >>>>>> additionally on a date, too.
> >>>>>> That's no problem with the key above.
> >>>>>>
> >>>>>> However, another kind of queries shall be based on a given time
> range
> >>>>>> whereas the outermost left userId is not given or known.
> >>>>>> In this case I need to get all rows covering the given time range
> with
> >>>>>> their date to create a daily reporting.
> >>>>>>
> >>>>>> As I can't set wildcards at the beginning of a left-based index for
> the
> >>>>>> scan,
> >>>>>> I only see the possibility to scan the index of the whole table to
> >>>>> collect
> >>>>>> the
> >>>>>> rowKeys that are inside the timerange I'm interested in.
> >>>>>>
> >>>>>> Is there a more elegant way to collect rows within time range X?
> >>>>>> (Unfortunately, the date attribute is not equal to the timestamp
> that is
> >>>>>> stored by hbase automatically.)
> >>>>>>
> >>>>>> Could/should one maybe leverage some kind of row key caching to
> >>>>> accelerate
> >>>>>> the collection process?
> >>>>>> Is that covered by the block cache?
> >>>>>>
> >>>>>> Thanks in advance for any advice.
> >>>>>>
> >>>>>> regards
> >>>>>> Chris
> >>>>>>
> >>>>>
> >>>
> >>
> >>
> >>--
> >>
> >>Alex Baranau
> >>------
> >>Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch
> - Solr
> >>
> >
> >
> >
> >--
> >
> >Alex Baranau
> >------
> >Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch
> - Solr
> >
>



-- 
Thanks & Regards,
Anil Gupta

Re: How to query by rowKey-infix

Reply via email to