Re: Scan performance

Anoop John Fri, 21 Jun 2013 23:59:41 -0700

Have a look at FuzzyRowFilter

-Anoop-


On Sat, Jun 22, 2013 at 9:20 AM, Tony Dean <[email protected]> wrote:

> I understand more, but have additional questions about the internals...
>
> So, in this example I have 6000 rows X 40 columns in this table.  In this
> test my startRow and stopRow do not narrow the scan criterior therefore all
> 6000x40 KVs must be included in the search and thus read from disk and into
> memory.
>
> The first filter that I used was:
> Filter f2 = new SingleColumnValueFilter(cf, qualifier,
>  CompareFilter.CompareOp.EQUALS, value);
>
> This means that HBase must look for the qualifier column on all 6000 rows.
>  As you mention I could add certain columns to a different cf; but
> unfortunately, in my case there is no such small set of columns that will
> need to be compared (filtered on).  I could try to use indexes so that a
> complete row key can be calculated from a secondary index in order to
> perform a faster search against data in a primary table.  This requires
> additional tables and maintenance that I would like to avoid.
>
> I did try a row key filter with regex hoping that it would limit the
> number of rows that were read from disk.
> Filter f2 = new RowFilter(CompareFilter.CompareOp.EQUAL, new
> RegexStringComparator(row_regexpr));
>
> My row keys are something like: vid,sid,event.  sid is not known at query
> time so I can use a regex similar to: vid,.*,Logon where Logon is the event
> that I am looking for in a particular visit.  In my test data this should
> have narrowed the scan to 1 row X 40 columns.  The best I could do for
> start/stop row is: vid,0 and vid,~ respectively.  I guess that is still
> going to cause all 6000 rows to be scanned, but the filtering should be
> more specific with the rowKey filter.  However, I did not see any
> performance improvement.  Anything obvious?
>
> Do you have any other ideas to help out with performance when row key is:
> vid,sid,event and sid is not known at query time which leaves a gap in the
> start/stop row?  Too bad regex can't be used in start/stop row
> specification.  That's really what I need.
>
> Thanks again.
> -Tony
>
> -----Original Message-----
> From: Vladimir Rodionov [mailto:[email protected]]
> Sent: Friday, June 21, 2013 8:00 PM
> To: [email protected]; lars hofhansl
> Subject: RE: Scan performance
>
> Lars,
> I thought that column family is the locality group and placement columns
> which are frequently accessed together into the same column family
> (locality group) is the obvious performance improvement tip. What are the
> "essential column families" for in this context?
>
> As for original question..  Unless you place your column into a separate
> column family in Table 2, you will need to scan (load from disk if not
> cached) ~ 40x more data for the second table (because you have 40 columns).
> This may explain why do  see such a difference in execution time if all
> data needs to be loaded first from HDFS.
>
> Best regards,
> Vladimir Rodionov
> Principal Platform Engineer
> Carrier IQ, www.carrieriq.com
> e-mail: [email protected]
>
> ________________________________________
> From: lars hofhansl [[email protected]]
> Sent: Friday, June 21, 2013 3:37 PM
> To: [email protected]
> Subject: Re: Scan performance
>
> HBase is a key value (KV) store. Each column is stored in its own KV, a
> row is just a set of KVs that happen to have the row key (which is the
> first part of the key).
> I tried to summarize this here:
> http://hadoop-hbase.blogspot.de/2011/12/introduction-to-hbase.html)
>
> In the StoreFiles all KVs are sorted in row/column order, but HBase still
> needs to skip over many KVs in order to "reach" the next row. So more disk
> and memory IO is needed.
>
> If you using 0.94 there is a new feature "essential column families". If
> you always search by the same column you can place that one in its own
> column family and all other column in another column family. In that case
> your scan performance should be close identical.
>
>
> -- Lars
> ________________________________
>
> From: Tony Dean <[email protected]>
> To: "[email protected]" <[email protected]>
> Sent: Friday, June 21, 2013 2:08 PM
> Subject: Scan performance
>
>
>
>
> Hi,
>
> I hope that you can shed some light on these 2 scenarios below.
>
> I have 2 small tables of 6000 rows.
> Table 1 has only 1 column in each of its rows.
> Table 2 has 40 columns in each of its rows.
> Other than that the two tables are identical.
>
> In both tables there is only 1 row that contains a matching column that I
> am filtering on.   And the Scan performs correctly in both cases by
> returning only the single result.
>
> The code looks something like the following:
>
> Scan scan = new Scan(startRow, stopRow);   // the start/stop rows should
> include all 6000 rows
> scan.addColumn(cf, qualifier); // only return the column that I am
> interested in (should only be in 1 row and only 1 version)
>
> Filter f1 = new InclusiveStopFilter(stopRow); Filter f2 = new
> SingleColumnValueFilter(cf, qualifier,  CompareFilter.CompareOp.EQUALS,
> value); scan.setFilter(new FilterList(f1, f2));
>
> scan .setTimeRange(0, MAX_LONG);
> scan.setMaxVersions(1);
>
> ResultScanner rs = t.getScanner(scan);
> for (Result result: rs)
> {
>
> }
>
> For table 1, rs.next() takes about 30ms.
> For table 2, rs.next() takes about 180ms.
>
> Both are returning the exact same result.  Why is it taking so much longer
> on table 2 to get the same result?  The scan depth is the same.  The only
> difference is the column width.  But I'm filtering on a single column and
> returning only that column.
>
> Am I missing something?  As I increase the number of columns, the response
> time gets worse.  I do expect the response time to get worse when
> increasing the number of rows, but not by increasing the number of columns
> since I'm returning only 1 column in both cases.
>
> I appreciate any comments that you have.
>
> -Tony
>
>
>
> Tony Dean
> SAS Institute Inc.
> Principal Software Developer
> 919-531-6704          ...
>
> Confidentiality Notice:  The information contained in this message,
> including any attachments hereto, may be confidential and is intended to be
> read only by the individual or entity to whom this message is addressed. If
> the reader of this message is not the intended recipient or an agent or
> designee of the intended recipient, please note that any review, use,
> disclosure or distribution of this message or its attachments, in any form,
> is strictly prohibited.  If you have received this message in error, please
> immediately notify the sender and/or [email protected] and
> delete or destroy any copy of this message and its attachments.
>
>
>

Re: Scan performance

Reply via email to