Re: Scan performance

James Taylor Sat, 22 Jun 2013 10:19:15 -0700

Hi Tony,
Have you had a look at Phoenix(https://github.com/forcedotcom/phoenix), a SQL 
skin over HBase? It has a skip scan that will let you model a multi part row 
key and skip through it efficiently as you've described. Take a look at this 
blog for more info: 
http://phoenix-hbase.blogspot.com/2013/05/demystifying-skip-scan-in-phoenix.html?m=1


Regards,
James

On Jun 22, 2013, at 6:29 AM, "lars hofhansl" <la...@apache.org> wrote:

> Yep generally you should design your keys such that start/stopKey can 
> efficiently narrow the scope.
> 
> If that really cannot be done (and you should try hard), the 2nd  best option 
> are "skip scans".
> 
> Filters in HBase allow for providing the scanner framework with hints where 
> to go next.
> They can skip to the next column (to avoid looking at many versions), to the 
> next row (to avoid looking at many columns), or they can provide a custom 
> seek hint to a specific key value. The latter is what FuzzyRowFilter does.
> 
> 
> -- Lars
> 
> 
> 
> ________________________________
> From: Anoop John <anoop.hb...@gmail.com>
> To: user@hbase.apache.org
> Sent: Friday, June 21, 2013 11:58 PM
> Subject: Re: Scan performance
> 
> 
> Have a look at FuzzyRowFilter
> 
> -Anoop-
> 
> On Sat, Jun 22, 2013 at 9:20 AM, Tony Dean <tony.d...@sas.com> wrote:
> 
>> I understand more, but have additional questions about the internals...
>> 
>> So, in this example I have 6000 rows X 40 columns in this table.  In this
>> test my startRow and stopRow do not narrow the scan criterior therefore all
>> 6000x40 KVs must be included in the search and thus read from disk and into
>> memory.
>> 
>> The first filter that I used was:
>> Filter f2 = new SingleColumnValueFilter(cf, qualifier,
>> CompareFilter.CompareOp.EQUALS, value);
>> 
>> This means that HBase must look for the qualifier column on all 6000 rows.
>> As you mention I could add certain columns to a different cf; but
>> unfortunately, in my case there is no such small set of columns that will
>> need to be compared (filtered on).  I could try to use indexes so that a
>> complete row key can be calculated from a secondary index in order to
>> perform a faster search against data in a primary table.  This requires
>> additional tables and maintenance that I would like to avoid.
>> 
>> I did try a row key filter with regex hoping that it would limit the
>> number of rows that were read from disk.
>> Filter f2 = new RowFilter(CompareFilter.CompareOp.EQUAL, new
>> RegexStringComparator(row_regexpr));
>> 
>> My row keys are something like: vid,sid,event.  sid is not known at query
>> time so I can use a regex similar to: vid,.*,Logon where Logon is the event
>> that I am looking for in a particular visit.  In my test data this should
>> have narrowed the scan to 1 row X 40 columns.  The best I could do for
>> start/stop row is: vid,0 and vid,~ respectively.  I guess that is still
>> going to cause all 6000 rows to be scanned, but the filtering should be
>> more specific with the rowKey filter.  However, I did not see any
>> performance improvement.  Anything obvious?
>> 
>> Do you have any other ideas to help out with performance when row key is:
>> vid,sid,event and sid is not known at query time which leaves a gap in the
>> start/stop row?  Too bad regex can't be used in start/stop row
>> specification.  That's really what I need.
>> 
>> Thanks again.
>> -Tony
>> 
>> -----Original Message-----
>> From: Vladimir Rodionov [mailto:vrodio...@carrieriq.com]
>> Sent: Friday, June 21, 2013 8:00 PM
>> To: user@hbase.apache.org; lars hofhansl
>> Subject: RE: Scan performance
>> 
>> Lars,
>> I thought that column family is the locality group and placement columns
>> which are frequently accessed together into the same column family
>> (locality group) is the obvious performance improvement tip. What are the
>> "essential column families" for in this context?
>> 
>> As for original question..  Unless you place your column into a separate
>> column family in Table 2, you will need to scan (load from disk if not
>> cached) ~ 40x more data for the second table (because you have 40 columns).
>> This may explain why do  see such a difference in execution time if all
>> data needs to be loaded first from HDFS.
>> 
>> Best regards,
>> Vladimir Rodionov
>> Principal Platform Engineer
>> Carrier IQ, www.carrieriq.com
>> e-mail: vrodio...@carrieriq.com
>> 
>> ________________________________________
>> From: lars hofhansl [la...@apache.org]
>> Sent: Friday, June 21, 2013 3:37 PM
>> To: user@hbase.apache.org
>> Subject: Re: Scan performance
>> 
>> HBase is a key value (KV) store. Each column is stored in its own KV, a
>> row is just a set of KVs that happen to have the row key (which is the
>> first part of the key).
>> I tried to summarize this here:
>> http://hadoop-hbase.blogspot.de/2011/12/introduction-to-hbase.html)
>> 
>> In the StoreFiles all KVs are sorted in row/column order, but HBase still
>> needs to skip over many KVs in order to "reach" the next row. So more disk
>> and memory IO is needed.
>> 
>> If you using 0.94 there is a new feature "essential column families". If
>> you always search by the same column you can place that one in its own
>> column family and all other column in another column family. In that case
>> your scan performance should be close identical.
>> 
>> 
>> -- Lars
>> ________________________________
>> 
>> From: Tony Dean <tony.d...@sas.com>
>> To: "user@hbase.apache.org" <user@hbase.apache.org>
>> Sent: Friday, June 21, 2013 2:08 PM
>> Subject: Scan performance
>> 
>> 
>> 
>> 
>> Hi,
>> 
>> I hope that you can shed some light on these 2 scenarios below.
>> 
>> I have 2 small tables of 6000 rows.
>> Table 1 has only 1 column in each of its rows.
>> Table 2 has 40 columns in each of its rows.
>> Other than that the two tables are identical.
>> 
>> In both tables there is only 1 row that contains a matching column that I
>> am filtering on.   And the Scan performs correctly in both cases by
>> returning only the single result.
>> 
>> The code looks something like the following:
>> 
>> Scan scan = new Scan(startRow, stopRow);   // the start/stop rows should
>> include all 6000 rows
>> scan.addColumn(cf, qualifier); // only return the column that I am
>> interested in (should only be in 1 row and only 1 version)
>> 
>> Filter f1 = new InclusiveStopFilter(stopRow); Filter f2 = new
>> SingleColumnValueFilter(cf, qualifier,  CompareFilter.CompareOp.EQUALS,
>> value); scan.setFilter(new FilterList(f1, f2));
>> 
>> scan .setTimeRange(0, MAX_LONG);
>> scan.setMaxVersions(1);
>> 
>> ResultScanner rs = t.getScanner(scan);
>> for (Result result: rs)
>> {
>> 
>> }
>> 
>> For table 1, rs.next() takes about 30ms.
>> For table 2, rs.next() takes about 180ms.
>> 
>> Both are returning the exact same result.  Why is it taking so much longer
>> on table 2 to get the same result?  The scan depth is the same.  The only
>> difference is the column width.  But I'm filtering on a single column and
>> returning only that column.
>> 
>> Am I missing something?  As I increase the number of columns, the response
>> time gets worse.  I do expect the response time to get worse when
>> increasing the number of rows, but not by increasing the number of columns
>> since I'm returning only 1 column in both cases.
>> 
>> I appreciate any comments that you have.
>> 
>> -Tony
>> 
>> 
>> 
>> Tony Dean
>> SAS Institute Inc.
>> Principal Software Developer
>> 919-531-6704          ...
>> 
>> Confidentiality Notice:  The information contained in this message,
>> including any attachments hereto, may be confidential and is intended to be
>> read only by the individual or entity to whom this message is addressed. If
>> the reader of this message is not the intended recipient or an agent or
>> designee of the intended recipient, please note that any review, use,
>> disclosure or distribution of this message or its attachments, in any form,
>> is strictly prohibited.  If you have received this message in error, please
>> immediately notify the sender and/or notificati...@carrieriq.com and
>> delete or destroy any copy of this message and its attachments.
>> 
>> 
>>

Re: Scan performance

Reply via email to