Hi Tony, Have you had a look at Phoenix(https://github.com/forcedotcom/phoenix), a SQL skin over HBase? It has a skip scan that will let you model a multi part row key and skip through it efficiently as you've described. Take a look at this blog for more info: http://phoenix-hbase.blogspot.com/2013/05/demystifying-skip-scan-in-phoenix.html?m=1
Regards, James On Jun 22, 2013, at 6:29 AM, "lars hofhansl" <la...@apache.org> wrote: > Yep generally you should design your keys such that start/stopKey can > efficiently narrow the scope. > > If that really cannot be done (and you should try hard), the 2nd best option > are "skip scans". > > Filters in HBase allow for providing the scanner framework with hints where > to go next. > They can skip to the next column (to avoid looking at many versions), to the > next row (to avoid looking at many columns), or they can provide a custom > seek hint to a specific key value. The latter is what FuzzyRowFilter does. > > > -- Lars > > > > ________________________________ > From: Anoop John <anoop.hb...@gmail.com> > To: user@hbase.apache.org > Sent: Friday, June 21, 2013 11:58 PM > Subject: Re: Scan performance > > > Have a look at FuzzyRowFilter > > -Anoop- > > On Sat, Jun 22, 2013 at 9:20 AM, Tony Dean <tony.d...@sas.com> wrote: > >> I understand more, but have additional questions about the internals... >> >> So, in this example I have 6000 rows X 40 columns in this table. In this >> test my startRow and stopRow do not narrow the scan criterior therefore all >> 6000x40 KVs must be included in the search and thus read from disk and into >> memory. >> >> The first filter that I used was: >> Filter f2 = new SingleColumnValueFilter(cf, qualifier, >> CompareFilter.CompareOp.EQUALS, value); >> >> This means that HBase must look for the qualifier column on all 6000 rows. >> As you mention I could add certain columns to a different cf; but >> unfortunately, in my case there is no such small set of columns that will >> need to be compared (filtered on). I could try to use indexes so that a >> complete row key can be calculated from a secondary index in order to >> perform a faster search against data in a primary table. This requires >> additional tables and maintenance that I would like to avoid. >> >> I did try a row key filter with regex hoping that it would limit the >> number of rows that were read from disk. >> Filter f2 = new RowFilter(CompareFilter.CompareOp.EQUAL, new >> RegexStringComparator(row_regexpr)); >> >> My row keys are something like: vid,sid,event. sid is not known at query >> time so I can use a regex similar to: vid,.*,Logon where Logon is the event >> that I am looking for in a particular visit. In my test data this should >> have narrowed the scan to 1 row X 40 columns. The best I could do for >> start/stop row is: vid,0 and vid,~ respectively. I guess that is still >> going to cause all 6000 rows to be scanned, but the filtering should be >> more specific with the rowKey filter. However, I did not see any >> performance improvement. Anything obvious? >> >> Do you have any other ideas to help out with performance when row key is: >> vid,sid,event and sid is not known at query time which leaves a gap in the >> start/stop row? Too bad regex can't be used in start/stop row >> specification. That's really what I need. >> >> Thanks again. >> -Tony >> >> -----Original Message----- >> From: Vladimir Rodionov [mailto:vrodio...@carrieriq.com] >> Sent: Friday, June 21, 2013 8:00 PM >> To: user@hbase.apache.org; lars hofhansl >> Subject: RE: Scan performance >> >> Lars, >> I thought that column family is the locality group and placement columns >> which are frequently accessed together into the same column family >> (locality group) is the obvious performance improvement tip. What are the >> "essential column families" for in this context? >> >> As for original question.. Unless you place your column into a separate >> column family in Table 2, you will need to scan (load from disk if not >> cached) ~ 40x more data for the second table (because you have 40 columns). >> This may explain why do see such a difference in execution time if all >> data needs to be loaded first from HDFS. >> >> Best regards, >> Vladimir Rodionov >> Principal Platform Engineer >> Carrier IQ, www.carrieriq.com >> e-mail: vrodio...@carrieriq.com >> >> ________________________________________ >> From: lars hofhansl [la...@apache.org] >> Sent: Friday, June 21, 2013 3:37 PM >> To: user@hbase.apache.org >> Subject: Re: Scan performance >> >> HBase is a key value (KV) store. Each column is stored in its own KV, a >> row is just a set of KVs that happen to have the row key (which is the >> first part of the key). >> I tried to summarize this here: >> http://hadoop-hbase.blogspot.de/2011/12/introduction-to-hbase.html) >> >> In the StoreFiles all KVs are sorted in row/column order, but HBase still >> needs to skip over many KVs in order to "reach" the next row. So more disk >> and memory IO is needed. >> >> If you using 0.94 there is a new feature "essential column families". If >> you always search by the same column you can place that one in its own >> column family and all other column in another column family. In that case >> your scan performance should be close identical. >> >> >> -- Lars >> ________________________________ >> >> From: Tony Dean <tony.d...@sas.com> >> To: "user@hbase.apache.org" <user@hbase.apache.org> >> Sent: Friday, June 21, 2013 2:08 PM >> Subject: Scan performance >> >> >> >> >> Hi, >> >> I hope that you can shed some light on these 2 scenarios below. >> >> I have 2 small tables of 6000 rows. >> Table 1 has only 1 column in each of its rows. >> Table 2 has 40 columns in each of its rows. >> Other than that the two tables are identical. >> >> In both tables there is only 1 row that contains a matching column that I >> am filtering on. And the Scan performs correctly in both cases by >> returning only the single result. >> >> The code looks something like the following: >> >> Scan scan = new Scan(startRow, stopRow); // the start/stop rows should >> include all 6000 rows >> scan.addColumn(cf, qualifier); // only return the column that I am >> interested in (should only be in 1 row and only 1 version) >> >> Filter f1 = new InclusiveStopFilter(stopRow); Filter f2 = new >> SingleColumnValueFilter(cf, qualifier, CompareFilter.CompareOp.EQUALS, >> value); scan.setFilter(new FilterList(f1, f2)); >> >> scan .setTimeRange(0, MAX_LONG); >> scan.setMaxVersions(1); >> >> ResultScanner rs = t.getScanner(scan); >> for (Result result: rs) >> { >> >> } >> >> For table 1, rs.next() takes about 30ms. >> For table 2, rs.next() takes about 180ms. >> >> Both are returning the exact same result. Why is it taking so much longer >> on table 2 to get the same result? The scan depth is the same. The only >> difference is the column width. But I'm filtering on a single column and >> returning only that column. >> >> Am I missing something? As I increase the number of columns, the response >> time gets worse. I do expect the response time to get worse when >> increasing the number of rows, but not by increasing the number of columns >> since I'm returning only 1 column in both cases. >> >> I appreciate any comments that you have. >> >> -Tony >> >> >> >> Tony Dean >> SAS Institute Inc. >> Principal Software Developer >> 919-531-6704 ... >> >> Confidentiality Notice: The information contained in this message, >> including any attachments hereto, may be confidential and is intended to be >> read only by the individual or entity to whom this message is addressed. If >> the reader of this message is not the intended recipient or an agent or >> designee of the intended recipient, please note that any review, use, >> disclosure or distribution of this message or its attachments, in any form, >> is strictly prohibited. If you have received this message in error, please >> immediately notify the sender and/or notificati...@carrieriq.com and >> delete or destroy any copy of this message and its attachments. >> >> >>