Tony: Take a look at http://blog.sematext.com/2012/08/09/consider-using-fuzzyrowfilter-when-in-need-for-secondary-indexes-in-hbase/
Cheers On Tue, Jul 2, 2013 at 2:31 PM, Tony Dean <tony.d...@sas.com> wrote: > The following information is what I discovered from Scan performance > testing. > > Setup > ------- > row key format: > positiion1,position2,position3 > where position1 is a fixed literal, and position2 and position3 are > variable data. > > I have created data with 6000 rows with ~40 columns in each row. The > table contains only 1 column family. > > The row that I want to query is: > vid,sid-0,Logon event:customer value=? > > ------- > > Case 1: > use fully qualified row specification in start/stop row key (e.g., > vid,sid-0,Logon) with a SingleColumnValueFilter in the Scan. > > avg response time to get Scan iterator and iterate the single result is > ~5ms. This is expected. > > > Case 2: > This is the normal case where position2 in the row key is unknown at the > time of the query: vid,?,Logon. > Using a SingleColumnValueFilter in the Scan, the avg response time to get > Scan iterator and iterate the single result is ~100ms. > This is the use case that I'm trying to improve upon. > > Case 3: > After upgrading to 0.94.8 I was able to change Case2 by using > FuzzyRowFilter instead of SingleColumnValueFilter. It's a good candidate > since I know position1 and position3. > The avg response time to get Scan iterator and iterate the single result > was ~5ms (pretty much the same response time as case 1 where I knew the > complete row key). > > I didn't expect such an improvement. Can you explain how FuzzyRowFilter > optimizes scanning rows from disk? In my case it needs to scan rows > (vid,?,xxxx) until xxxx is greater than "Logon". Then it can just stop > after that; thereby optimizing the scan, correct? So, optimization using > FuzzyRowFilter is very dependent upon the data that you are scanning. > > Thanks for any insight. > > > -----Original Message----- > From: lars hofhansl [mailto:la...@apache.org] > Sent: Monday, June 24, 2013 5:05 PM > To: user@hbase.apache.org > Subject: Re: Scan performance > > RowFilter can help. It depends on the setup. > RowFilter skip all column of the row when the row key does not match. > That will help with IO *if* your rows are larger than the HFile block size > (64k by default). Otherwise it still needs to touch each block. > > An HTable does some priming when it is created. The region information for > all tables could be substantial, so it does not make much sense to prime > the cache for all tables. > How are you using the client. If you pre-create a reuse HTable and/or > HConnection you should be OK. > > > -- Lars > > > > ________________________________ > From: Tony Dean <tony.d...@sas.com> > To: "user@hbase.apache.org" <user@hbase.apache.org>; lars hofhansl < > la...@apache.org> > Sent: Monday, June 24, 2013 1:48 PM > Subject: RE: Scan performance > > > Lars, > I'm waiting for some time to exchange out hbase jars in cluster (that > support FuzzyRow filter) in order to try out. In the meantime, I'm > wondering why RowFilter regex is not more helpful. I'm guessing that > FuzzyRow filter helps in disk io while Row filter just filters after the > disk io has completed. Also, I turned on row level bloom filter which does > not seem to help either. > > On a different performance note, I'm wondering if there is a way to prime > client connection information and such so that the first client query isn't > miserably slow. After the first query, response times do get considerably > better due to caching necessary information. Is there a way to get around > this first initial hit? I assume any such priming would have to be > application specific. > > Thanks. > > -----Original Message----- > From: lars hofhansl [mailto:la...@apache.org] > Sent: Saturday, June 22, 2013 9:24 AM > To: user@hbase.apache.org > Subject: Re: Scan performance > > "essential column families" help when you filter on one column but want to > return *other* columns for the rows that matched the column. > > Check out HBASE-5416. > > -- Lars > > > > ________________________________ > From: Vladimir Rodionov <vrodio...@carrieriq.com> > To: "user@hbase.apache.org" <user@hbase.apache.org>; lars hofhansl < > la...@apache.org> > Sent: Friday, June 21, 2013 5:00 PM > Subject: RE: Scan performance > > > Lars, > I thought that column family is the locality group and placement columns > which are frequently accessed together into > the same column family (locality group) is the obvious performance > improvement tip. What are the "essential column families" for in this > context? > > As for original question.. Unless you place your column into a separate > column family in Table 2, you will > need to scan (load from disk if not cached) ~ 40x more data for the second > table (because you have 40 columns). This may explain why do see such a > difference in > execution time if all data needs to be loaded first from HDFS. > > Best regards, > Vladimir Rodionov > Principal Platform Engineer > Carrier IQ, www.carrieriq.com > e-mail: vrodio...@carrieriq.com > > ________________________________________ > From: lars hofhansl [la...@apache.org] > Sent: Friday, June 21, 2013 3:37 PM > To: user@hbase.apache.org > Subject: Re: Scan performance > > HBase is a key value (KV) store. Each column is stored in its own KV, a > row is just a set of KVs that happen to have the row key (which is the > first part of the key). > I tried to summarize this here: > http://hadoop-hbase.blogspot.de/2011/12/introduction-to-hbase.html) > > In the StoreFiles all KVs are sorted in row/column order, but HBase still > needs to skip over many KVs in order to "reach" the next row. So more disk > and memory IO is needed. > > If you using 0.94 there is a new feature "essential column families". If > you always search by the same column you can place that one in its own > column family and all other column in another column family. In that case > your scan performance should be close identical. > > > -- Lars > ________________________________ > > From: Tony Dean <tony.d...@sas.com> > To: "user@hbase.apache.org" <user@hbase.apache.org> > Sent: Friday, June 21, 2013 2:08 PM > Subject: Scan performance > > > > > Hi, > > I hope that you can shed some light on these 2 scenarios below. > > I have 2 small tables of 6000 rows. > Table 1 has only 1 column in each of its rows. > Table 2 has 40 columns in each of its rows. > Other than that the two tables are identical. > > In both tables there is only 1 row that contains a matching column that I > am filtering on. And the Scan performs correctly in both cases by > returning only the single result. > > The code looks something like the following: > > Scan scan = new Scan(startRow, stopRow); // the start/stop rows should > include all 6000 rows > scan.addColumn(cf, qualifier); // only return the column that I am > interested in (should only be in 1 row and only 1 version) > > Filter f1 = new InclusiveStopFilter(stopRow); > Filter f2 = new SingleColumnValueFilter(cf, qualifier, > CompareFilter.CompareOp.EQUALS, value); > scan.setFilter(new FilterList(f1, f2)); > > scan .setTimeRange(0, MAX_LONG); > scan.setMaxVersions(1); > > ResultScanner rs = t.getScanner(scan); > for (Result result: rs) > { > > } > > For table 1, rs.next() takes about 30ms. > For table 2, rs.next() takes about 180ms. > > Both are returning the exact same result. Why is it taking so much longer > on table 2 to get the same result? The scan depth is the same. The only > difference is the column width. But I’m filtering on a single column and > returning only that column. > > Am I missing something? As I increase the number of columns, the response > time gets worse. I do expect the response time to get worse when > increasing the number of rows, but not by increasing the number of columns > since I’m returning only 1 column in > both cases. > > I appreciate any comments that you have. > > -Tony > > > > Tony Dean > SAS Institute Inc. > Principal Software Developer > 919-531-6704 … > > Confidentiality Notice: The information contained in this message, > including any attachments hereto, may be confidential and is intended to be > read only by the individual or entity to whom this message is addressed. If > the reader of this message is not the intended recipient or an agent or > designee of the intended recipient, please note that any review, use, > disclosure or distribution of this message or its attachments, in any form, > is strictly prohibited. If you have received this message in error, please > immediately notify the sender and/or notificati...@carrieriq.com and > delete or destroy any copy of this message and its attachments. >