Re: How to query by rowKey-infix

anil gupta Fri, 24 Aug 2012 00:54:46 -0700

Christian: I'm slightly shocked about the processing time of more than 2
mins to return 225 rows.I would actually need a response in 5-10 sec.
Anil: I started getting the response within 1-2 sec of firing the query but
i got all the 225 results in 2 mins. My table was having 34 million rows
and every rows was having 25 columns on an average. Average size of each
row is around 1.21 KB. Size of one replica is ~40 GB in HDFS.
I havent done the comparison of timestamp based filtering and column value
based filtering. However, I strongly believe that timestamp based filtering
will be a winner due to the reason that it can skip Blocks.
Regarding the concern that my query took 2 min, one of the reason is that
the Hardware conf is way below par so i dont really look for blazing fast
performance on this cluster. If you get a really well tuned HBase then your
performance can improve by 3-4x easily(query will be done in 20-30
seconds). But, i dont think you can get blazing fast result like the ones
we get when we do scanning based on RowKey.


Christian: In your  timestamp based filtering, do you check the timestamp
as part of the row key or do you use the put timestamp (as I do)?
Anil: I use the timestamp by using Scan.setTimeRange(long, long). In my use
case i am not using row key at all. So, roughly it is full table scan but
timestamp is doing all the magic. It's a definite advantage if you can use
rowkey in your query.

Christian:Is it a full table scan where each row's key is checked against a
given timestamp/timerange?
Anil: Essentially its a full table scan since i am not using any rowkey or
other filters.

Christian:How many rows are scanned/touched  at your timestamp based
filtering?
Anil: I dont know how to get these stats. Can anyone enlighten me? I am
also curious to know this stat.

I'll try to run the column value based filter also so that we get some more
insights into the best option available. Let me know your thoughts on my
reply.

Thanks,
Anil Gupta


On Thu, Aug 23, 2012 at 1:41 AM, Christian Schäfer <syrious3...@yahoo.de>wrote:

> Hi Anil,
>
> to restrict data to a certain time window I also set timerange for the
> scan.
>
>
>
> How many rows are scanned/touched  at your timestamp based filtering?
>
>
>
> My use case of obtaining data by substring comparator operates on the row
> key.
> It can't be replaced by setting the time range in my case, really.
>
> Btw. the scan is additionally restricted to a certain timerange to
> increase skipping of irrelevant files and thus improve performance.
>
>
> regards,
> Christian
>
>
>
> ----- Ursprüngliche Message -----
> Von: anil gupta <anilgupt...@gmail.com>
> An: user@hbase.apache.org; Christian Schäfer <syrious3...@yahoo.de>
> CC:
> Gesendet: 20:42 Mittwoch, 22.August 2012
> Betreff: Re: How to query by rowKey-infix
>
> Hi Christian,
>
> I had the similar requirements as yours. So, till now i have used
> timestamps for filtering the data and I would say the performance is
> satisfactory. Here are the results of timestamp based filtering:
> The table has 34 million records(average row size is 1.21 KB), in 136
> seconds i get the entire result of query which had 225 rows.
> I am running a HBase 0.92, 8 node cluster on Vmware Hypervisor. Each node
> had 3.2 GB of memory, and 500 GB HDFS space. Each Hard Drive in my set-up
> is hosting 2 Slaves Instance(2 VM's running Datanode,
> NodeManager,RegionServer). I have only allocated 1200MB for RS's. I haven't
> done any modification in the block size of HDFS or HBase. Considering the
> below-par hardware configuration of cluster i feel the performance is OK
> and IMO it'll be better than substring comparator of column values since in
> substring comparator filter you are essentially doing a FULL TABLE scan.
> Whereas, in timerange based scan you can *Skip Store Files*.
>
> On a side note, Alex created a JIRA for enhancing the current
> FuzzyRowFilter to do range based filtering also. Here is the link:
> https://issues.apache.org/jira/browse/HBASE-6618 . You are more than
> welcome if you would like to chime in.
>
> HTH,
> Anil Gupta
>
>
> On Thu, Aug 9, 2012 at 1:55 PM, Christian Schäfer <syrious3...@yahoo.de
> >wrote:
>
> > Nice. Thanks Alex for sharing your experiences with that custom filter
> > implementation.
> >
> >
> > Currently I'm still using key filter with substring comparator.
> > As soon as I got a good amount of test data I will measure performance of
> > that naiive substring filter in comparison to your fuzzy row filter.
> >
> > regards,
> > Christian
> >
> >
> >
> > ________________________________
> > Von: Alex Baranau <alex.barano...@gmail.com>
> > An: user@hbase.apache.org; Christian Schäfer <syrious3...@yahoo.de>
> > Gesendet: 22:18 Donnerstag, 9.August 2012
> > Betreff: Re: How to query by rowKey-infix
> >
> >
> > jfyi: documented FuzzyRowFilter usage here: http://bit.ly/OXVdbg. Will
> > add documentation to HBase book very soon [1]
> >
> > Alex Baranau
> > ------
> > Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch -
> Solr
> >
> > [1] https://issues.apache.org/jira/browse/HBASE-6526
> >
> > On Fri, Aug 3, 2012 at 6:14 PM, Alex Baranau <alex.barano...@gmail.com>
> > wrote:
> >
> > Good!
> > >
> > >
> > >Submitted initial patch of fuzzy row key filter at
> > https://issues.apache.org/jira/browse/HBASE-6509. You can just copy the
> > filter class and include it in your code and use it in your setup as any
> > other custom filter (no need to patch HBase).
> > >
> > >
> > >Please let me know if you try it out (or post your comments at
> > HBASE-6509).
> > >
> > >
> > >Alex Baranau
> > >------
> > >Sematext :: http://sematext.com/ :: Hadoop - HBase - ElasticSearch -
> Solr
> > >
> > >
> > >On Fri, Aug 3, 2012 at 5:23 AM, Christian Schäfer <syrious3...@yahoo.de
> >
> > wrote:
> > >
> > >Hi Alex,
> > >>
> > >>thanks a lot for the hint about setting the timestamp of the put.
> > >>I didn't know that this would be possible but that's solving the
> problem
> > (first test was successful).
> > >>So I'm really glad that I don't need to apply a filter to extract the
> > time and so on for every row.
> > >>
> > >>Nevertheless I would like to see your custom filter implementation.
> > >>Would be nice if you could provide it helping me to get a bit into it.
> > >>
> > >>And yes that helped :)
> > >>
> > >>regards
> > >>Chris
> > >>
> > >>
> > >>
> > >>________________________________
> > >>Von: Alex Baranau <alex.barano...@gmail.com>
> > >>An: user@hbase.apache.org; Christian Schäfer <syrious3...@yahoo.de>
> > >>Gesendet: 0:57 Freitag, 3.August 2012
> > >>
> > >>Betreff: Re: How to query by rowKey-infix
> > >>
> > >>
> > >>Hi Christian!
> > >>If to put off secondary indexes and assume you are going with "heavy
> > scans", you can try two following things to make it much faster. If this
> is
> > appropriate to your situation, of course.
> > >>
> > >>1.
> > >>
> > >>> Is there a more elegant way to collect rows within time range X?
> > >>> (Unfortunately, the date attribute is not equal to the timestamp that
> > is stored by hbase automatically.)
> > >>
> > >>Can you set timestamp of the Puts to the one you have in row key?
> > Instead of relying on the one that HBase puts automatically (current ts).
> > If you can, this will improve reading speed a lot by setting time range
> on
> > scanner. Depending on how you are writing your data of course, but I
> assume
> > that you mostly write data in "time-increasing" manner.
> > >>
> > >>
> > >>2.
> > >>
> > >>If your userId has fixed length, or you can change it so that it has
> > fixed length, then you can actually use smth like "wildcard"  in row key.
> > There's a way in Filter implementation to fast-forward to the record with
> > specific row key and by doing this skip many records. This might be used
> as
> > follows:
> > >>* suppose your userId is 5 characters in length
> > >>* suppose you are scanning for records with time between 2012-08-01
> > and 2012-08-08
> > >>* when you scanning records and you face e.g. key
> > "aaaaa_2012-08-09_3jh345j345kjh", where "aaaaa" is user id, you can tell
> > the scanner from your filter to fast-forward to key "aaaab_ 2012-08-01".
> > Because you know that all remained records of user "aaaaa" don't fall
> into
> > the interval you need (as the time for its records will be >=
> 2012-08-09).
> > >>
> > >>As of now, I believe you will have to implement your custom filter to
> do
> > that.
> > Pointer:
> org.apache.hadoop.hbase.filter.Filter.ReturnCode.SEEK_NEXT_USING_HINT
> > >>I believe I implemented similar thing some time ago. If this idea works
> > for you I could look for the implementation and share it if it helps. Or
> > may be even simply add it to HBase codebase.
> > >>
> > >>Hope this helps,
> > >>
> > >>
> > >>Alex Baranau
> > >>------
> > >>Sematext :: http://blog.sematext.com/ :: Hadoop - HBase -
> ElasticSearch
> > - Solr
> > >>
> > >>
> > >>
> > >>On Thu, Aug 2, 2012 at 8:40 AM, Christian Schäfer <
> syrious3...@yahoo.de>
> > wrote:
> > >>
> > >>
> > >>>
> > >>>Excuse my double posting.
> > >>>Here is the complete mail:
> > >>>
> > >>>
> > >>>
> > >>>OK,
> > >>>
> > >>>at first I will try the scans.
> > >>>
> > >>>If that's too slow I will have to upgrade hbase (currently
> > 0.90.4-cdh3u2) to be able to use coprocessors.
> > >>>
> > >>>
> > >>>Currently I'm stuck at the scans because it requires two steps
> > (therefore maybe some kind of filter chaining is required)
> > >>>
> > >>>
> > >>>The key:  userId-dateInMillis-sessionId
> > >>>
> > >>>
> > >>>At first I need to extract dateInMllis with regex or substring (using
> > special delimiters for date)
> > >>>
> > >>>Second, the extracted value must be parsed to Long and set to a
> > RowFilter Comparator like this:
> > >>>
> > >>>scan.setFilter(new RowFilter(CompareOp.GREATER_OR_EQUAL, new
> > BinaryComparator(Bytes.toBytes((Long)dateInMillis))));
> > >>>
> > >>>How to chain that?
> > >>>Do I have to write a custom filter?
> > >>>(Would like to avoid that due to deployment)
> > >>>
> > >>>regards
> > >>>Chris
> > >>>
> > >>>
> > >>>----- Ursprüngliche Message -----
> > >>>Von: Michael Segel <michael_se...@hotmail.com>
> > >>>An: user@hbase.apache.org
> > >>>CC:
> > >>>Gesendet: 13:52 Mittwoch, 1.August 2012
> > >>>Betreff: Re: How to query by rowKey-infix
> > >>>
> > >>>Actually w coprocessors you can create a secondary index in short
> order.
> > >>>Then your cost is going to be 2 fetches. Trying to do a partial table
> > scan will be more expensive.
> > >>>
> > >>>On Jul 31, 2012, at 12:41 PM, Matt Corgan <mcor...@hotpads.com>
> wrote:
> > >>>
> > >>>> When deciding between a table scan vs secondary index, you should
> try
> > to
> > >>>> estimate what percent of the underlying data blocks will be used in
> > the
> > >>>> query.  By default, each block is 64KB.
> > >>>>
> > >>>> If each user's data is small and you are fitting multiple users per
> > block,
> > >>>> then you're going to need all the blocks, so a tablescan is better
> > because
> > >>>> it's simpler.  If each user has 1MB+ data then you will want to pick
> > out
> > >>>> the individual blocks relevant to each date.  The secondary index
> > will help
> > >>>> you go directly to those sparse blocks, but with a cost in
> complexity,
> > >>>> consistency, and extra denormalized data that knocks primary data
> out
> > of
> > >>>> your block cache.
> > >>>>
> > >>>> If latency is not a concern, I would start with the table scan.  If
> > that's
> > >>>> too slow you add the secondary index, and if you still need it
> faster
> > you
> > >>>> do the primary key lookups in parallel as Jerry mentions.
> > >>>>
> > >>>> Matt
> > >>>>
> > >>>> On Tue, Jul 31, 2012 at 10:10 AM, Jerry Lam <chiling...@gmail.com>
> > wrote:
> > >>>>
> > >>>>> Hi Chris:
> > >>>>>
> > >>>>> I'm thinking about building a secondary index for primary key
> > lookup, then
> > >>>>> query using the primary keys in parallel.
> > >>>>>
> > >>>>> I'm interested to see if there is other option too.
> > >>>>>
> > >>>>> Best Regards,
> > >>>>>
> > >>>>> Jerry
> > >>>>>
> > >>>>> On Tue, Jul 31, 2012 at 11:27 AM, Christian Schäfer <
> > syrious3...@yahoo.de
> > >>>>>> wrote:
> > >>>>>
> > >>>>>> Hello there,
> > >>>>>>
> > >>>>>> I designed a row key for queries that need best performance (~100
> > ms)
> > >>>>>> which looks like this:
> > >>>>>>
> > >>>>>> userId-date-sessionId
> > >>>>>>
> > >>>>>> These queries(scans) are always based on a userId and sometimes
> > >>>>>> additionally on a date, too.
> > >>>>>> That's no problem with the key above.
> > >>>>>>
> > >>>>>> However, another kind of queries shall be based on a given time
> > range
> > >>>>>> whereas the outermost left userId is not given or known.
> > >>>>>> In this case I need to get all rows covering the given time range
> > with
> > >>>>>> their date to create a daily reporting.
> > >>>>>>
> > >>>>>> As I can't set wildcards at the beginning of a left-based index
> for
> > the
> > >>>>>> scan,
> > >>>>>> I only see the possibility to scan the index of the whole table to
> > >>>>> collect
> > >>>>>> the
> > >>>>>> rowKeys that are inside the timerange I'm interested in.
> > >>>>>>
> > >>>>>> Is there a more elegant way to collect rows within time range X?
> > >>>>>> (Unfortunately, the date attribute is not equal to the timestamp
> > that is
> > >>>>>> stored by hbase automatically.)
> > >>>>>>
> > >>>>>> Could/should one maybe leverage some kind of row key caching to
> > >>>>> accelerate
> > >>>>>> the collection process?
> > >>>>>> Is that covered by the block cache?
> > >>>>>>
> > >>>>>> Thanks in advance for any advice.
> > >>>>>>
> > >>>>>> regards
> > >>>>>> Chris
> > >>>>>>
> > >>>>>
> > >>>
> > >>
> > >>
> > >>--
> > >>
> > >>Alex Baranau
> > >>------
> > >>Sematext :: http://blog.sematext.com/ :: Hadoop - HBase -
> ElasticSearch
> > - Solr
> > >>
> > >
> > >
> > >
> > >--
> > >
> > >Alex Baranau
> > >------
> > >Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch
> > - Solr
> > >
> >
>
>
>
> --
> Thanks & Regards,
> Anil Gupta
>
>


-- 
Thanks & Regards,
Anil Gupta

Re: How to query by rowKey-infix

Reply via email to