Have a look at FuzzyRowFilter -Anoop-
On Sat, Jun 22, 2013 at 9:20 AM, Tony Dean <tony.d...@sas.com> wrote: > I understand more, but have additional questions about the internals... > > So, in this example I have 6000 rows X 40 columns in this table. In this > test my startRow and stopRow do not narrow the scan criterior therefore all > 6000x40 KVs must be included in the search and thus read from disk and into > memory. > > The first filter that I used was: > Filter f2 = new SingleColumnValueFilter(cf, qualifier, > CompareFilter.CompareOp.EQUALS, value); > > This means that HBase must look for the qualifier column on all 6000 rows. > As you mention I could add certain columns to a different cf; but > unfortunately, in my case there is no such small set of columns that will > need to be compared (filtered on). I could try to use indexes so that a > complete row key can be calculated from a secondary index in order to > perform a faster search against data in a primary table. This requires > additional tables and maintenance that I would like to avoid. > > I did try a row key filter with regex hoping that it would limit the > number of rows that were read from disk. > Filter f2 = new RowFilter(CompareFilter.CompareOp.EQUAL, new > RegexStringComparator(row_regexpr)); > > My row keys are something like: vid,sid,event. sid is not known at query > time so I can use a regex similar to: vid,.*,Logon where Logon is the event > that I am looking for in a particular visit. In my test data this should > have narrowed the scan to 1 row X 40 columns. The best I could do for > start/stop row is: vid,0 and vid,~ respectively. I guess that is still > going to cause all 6000 rows to be scanned, but the filtering should be > more specific with the rowKey filter. However, I did not see any > performance improvement. Anything obvious? > > Do you have any other ideas to help out with performance when row key is: > vid,sid,event and sid is not known at query time which leaves a gap in the > start/stop row? Too bad regex can't be used in start/stop row > specification. That's really what I need. > > Thanks again. > -Tony > > -----Original Message----- > From: Vladimir Rodionov [mailto:vrodio...@carrieriq.com] > Sent: Friday, June 21, 2013 8:00 PM > To: user@hbase.apache.org; lars hofhansl > Subject: RE: Scan performance > > Lars, > I thought that column family is the locality group and placement columns > which are frequently accessed together into the same column family > (locality group) is the obvious performance improvement tip. What are the > "essential column families" for in this context? > > As for original question.. Unless you place your column into a separate > column family in Table 2, you will need to scan (load from disk if not > cached) ~ 40x more data for the second table (because you have 40 columns). > This may explain why do see such a difference in execution time if all > data needs to be loaded first from HDFS. > > Best regards, > Vladimir Rodionov > Principal Platform Engineer > Carrier IQ, www.carrieriq.com > e-mail: vrodio...@carrieriq.com > > ________________________________________ > From: lars hofhansl [la...@apache.org] > Sent: Friday, June 21, 2013 3:37 PM > To: user@hbase.apache.org > Subject: Re: Scan performance > > HBase is a key value (KV) store. Each column is stored in its own KV, a > row is just a set of KVs that happen to have the row key (which is the > first part of the key). > I tried to summarize this here: > http://hadoop-hbase.blogspot.de/2011/12/introduction-to-hbase.html) > > In the StoreFiles all KVs are sorted in row/column order, but HBase still > needs to skip over many KVs in order to "reach" the next row. So more disk > and memory IO is needed. > > If you using 0.94 there is a new feature "essential column families". If > you always search by the same column you can place that one in its own > column family and all other column in another column family. In that case > your scan performance should be close identical. > > > -- Lars > ________________________________ > > From: Tony Dean <tony.d...@sas.com> > To: "user@hbase.apache.org" <user@hbase.apache.org> > Sent: Friday, June 21, 2013 2:08 PM > Subject: Scan performance > > > > > Hi, > > I hope that you can shed some light on these 2 scenarios below. > > I have 2 small tables of 6000 rows. > Table 1 has only 1 column in each of its rows. > Table 2 has 40 columns in each of its rows. > Other than that the two tables are identical. > > In both tables there is only 1 row that contains a matching column that I > am filtering on. And the Scan performs correctly in both cases by > returning only the single result. > > The code looks something like the following: > > Scan scan = new Scan(startRow, stopRow); // the start/stop rows should > include all 6000 rows > scan.addColumn(cf, qualifier); // only return the column that I am > interested in (should only be in 1 row and only 1 version) > > Filter f1 = new InclusiveStopFilter(stopRow); Filter f2 = new > SingleColumnValueFilter(cf, qualifier, CompareFilter.CompareOp.EQUALS, > value); scan.setFilter(new FilterList(f1, f2)); > > scan .setTimeRange(0, MAX_LONG); > scan.setMaxVersions(1); > > ResultScanner rs = t.getScanner(scan); > for (Result result: rs) > { > > } > > For table 1, rs.next() takes about 30ms. > For table 2, rs.next() takes about 180ms. > > Both are returning the exact same result. Why is it taking so much longer > on table 2 to get the same result? The scan depth is the same. The only > difference is the column width. But I'm filtering on a single column and > returning only that column. > > Am I missing something? As I increase the number of columns, the response > time gets worse. I do expect the response time to get worse when > increasing the number of rows, but not by increasing the number of columns > since I'm returning only 1 column in both cases. > > I appreciate any comments that you have. > > -Tony > > > > Tony Dean > SAS Institute Inc. > Principal Software Developer > 919-531-6704 ... > > Confidentiality Notice: The information contained in this message, > including any attachments hereto, may be confidential and is intended to be > read only by the individual or entity to whom this message is addressed. If > the reader of this message is not the intended recipient or an agent or > designee of the intended recipient, please note that any review, use, > disclosure or distribution of this message or its attachments, in any form, > is strictly prohibited. If you have received this message in error, please > immediately notify the sender and/or notificati...@carrieriq.com and > delete or destroy any copy of this message and its attachments. > > >