Re: Scan performance

Ted Yu Tue, 02 Jul 2013 15:12:12 -0700

Tony:
Take a look at
http://blog.sematext.com/2012/08/09/consider-using-fuzzyrowfilter-when-in-need-for-secondary-indexes-in-hbase/


Cheers

On Tue, Jul 2, 2013 at 2:31 PM, Tony Dean <tony.d...@sas.com> wrote:

> The following information is what I discovered from Scan performance
> testing.
>
> Setup
> -------
> row key format:
> positiion1,position2,position3
> where position1 is a fixed literal, and position2 and position3 are
> variable data.
>
> I have created data with 6000 rows with ~40 columns in each row.  The
> table contains only 1 column family.
>
> The row that I want to query is:
> vid,sid-0,Logon    event:customer value=?
>
> -------
>
> Case 1:
> use fully qualified row specification in start/stop row key (e.g.,
> vid,sid-0,Logon) with a SingleColumnValueFilter in the Scan.
>
> avg response time to get Scan iterator and iterate the single result is
> ~5ms.  This is expected.
>
>
> Case 2:
> This is the normal case where position2 in the row key is unknown at the
> time of the query: vid,?,Logon.
> Using a SingleColumnValueFilter in the Scan, the avg response time to get
> Scan iterator and iterate the single result is ~100ms.
> This is the use case that I'm trying to improve upon.
>
> Case 3:
> After upgrading to 0.94.8 I was able to change Case2 by using
> FuzzyRowFilter instead of SingleColumnValueFilter.  It's a good candidate
> since I know position1 and position3.
> The avg response time to get Scan iterator and iterate the single result
> was ~5ms (pretty much the same response time as case 1 where I knew the
> complete row key).
>
> I didn't expect such an improvement.  Can you explain how FuzzyRowFilter
> optimizes scanning rows from disk?  In my case it needs to scan rows
> (vid,?,xxxx) until xxxx is greater than "Logon".  Then it can just stop
> after that; thereby optimizing the scan, correct?  So, optimization using
> FuzzyRowFilter is very dependent upon the data that you are scanning.
>
> Thanks for any insight.
>
>
> -----Original Message-----
> From: lars hofhansl [mailto:la...@apache.org]
> Sent: Monday, June 24, 2013 5:05 PM
> To: user@hbase.apache.org
> Subject: Re: Scan performance
>
> RowFilter can help. It depends on the setup.
> RowFilter skip all column of the row when the row key does not match.
> That will help with IO *if* your rows are larger than the HFile block size
> (64k by default). Otherwise it still needs to touch each block.
>
> An HTable does some priming when it is created. The region information for
> all tables could be substantial, so it does not make much sense to prime
> the cache for all tables.
> How are you using the client. If you pre-create a reuse HTable and/or
> HConnection you should be OK.
>
>
> -- Lars
>
>
>
> ________________________________
>  From: Tony Dean <tony.d...@sas.com>
> To: "user@hbase.apache.org" <user@hbase.apache.org>; lars hofhansl <
> la...@apache.org>
> Sent: Monday, June 24, 2013 1:48 PM
> Subject: RE: Scan performance
>
>
> Lars,
> I'm waiting for some time to exchange out hbase jars in cluster (that
> support FuzzyRow filter) in order to try out.  In the meantime, I'm
> wondering why RowFilter regex is not more helpful.  I'm guessing that
> FuzzyRow filter helps in disk io while Row filter just filters after the
> disk io has completed.  Also, I turned on row level bloom filter which does
> not seem to help either.
>
> On a different performance note, I'm wondering if there is a way to prime
> client connection information and such so that the first client query isn't
> miserably slow.  After the first query, response times do get considerably
> better due to caching necessary information.  Is there a way to get around
> this first initial hit?  I assume any such priming would have to be
> application specific.
>
> Thanks.
>
> -----Original Message-----
> From: lars hofhansl [mailto:la...@apache.org]
> Sent: Saturday, June 22, 2013 9:24 AM
> To: user@hbase.apache.org
> Subject: Re: Scan performance
>
> "essential column families" help when you filter on one column but want to
> return *other* columns for the rows that matched the column.
>
> Check out HBASE-5416.
>
> -- Lars
>
>
>
> ________________________________
> From: Vladimir Rodionov <vrodio...@carrieriq.com>
> To: "user@hbase.apache.org" <user@hbase.apache.org>; lars hofhansl <
> la...@apache.org>
> Sent: Friday, June 21, 2013 5:00 PM
> Subject: RE: Scan performance
>
>
> Lars,
> I thought that column family is the locality group and placement columns
> which are frequently accessed together into
> the same column family (locality group) is the obvious performance
> improvement tip. What are the "essential column families" for in this
> context?
>
> As for original question..  Unless you place your column into a separate
> column family in Table 2, you will
> need to scan (load from disk if not cached) ~ 40x more data for the second
> table (because you have 40 columns). This may explain why do  see such a
> difference in
> execution time if all data needs to be loaded first from HDFS.
>
> Best regards,
> Vladimir Rodionov
> Principal Platform Engineer
> Carrier IQ, www.carrieriq.com
> e-mail: vrodio...@carrieriq.com
>
> ________________________________________
> From: lars hofhansl [la...@apache.org]
> Sent: Friday, June 21, 2013 3:37 PM
> To: user@hbase.apache.org
> Subject: Re: Scan performance
>
> HBase is a key value (KV) store. Each column is stored in its own KV, a
> row is just a set of KVs that happen to have the row key (which is the
> first part of the key).
> I tried to summarize this here:
> http://hadoop-hbase.blogspot.de/2011/12/introduction-to-hbase.html)
>
> In the StoreFiles all KVs are sorted in row/column order, but HBase still
> needs to skip over many KVs in order to "reach" the next row. So more disk
> and memory IO is needed.
>
> If you using 0.94 there is a new feature "essential column families". If
> you always search by the same column you can place that one in its own
> column family and all other column in another column family. In that case
> your scan performance should be close identical.
>
>
> -- Lars
> ________________________________
>
> From: Tony Dean <tony.d...@sas.com>
> To: "user@hbase.apache.org" <user@hbase.apache.org>
> Sent: Friday, June 21, 2013 2:08 PM
> Subject: Scan performance
>
>
>
>
> Hi,
>
> I hope that you can shed some light on these 2 scenarios below.
>
> I have 2 small tables of 6000 rows.
> Table 1 has only 1 column in each of its rows.
> Table 2 has 40 columns in each of its rows.
> Other than that the two tables are identical.
>
> In both tables there is only 1 row that contains a matching column that I
> am filtering on.   And the Scan performs correctly in both cases by
> returning only the single result.
>
> The code looks something like the following:
>
> Scan scan = new Scan(startRow, stopRow);   // the start/stop rows should
> include all 6000 rows
> scan.addColumn(cf, qualifier); // only return the column that I am
> interested in (should only be in 1 row and only 1 version)
>
> Filter f1 = new InclusiveStopFilter(stopRow);
> Filter f2 = new SingleColumnValueFilter(cf, qualifier,
> CompareFilter.CompareOp.EQUALS, value);
> scan.setFilter(new FilterList(f1, f2));
>
> scan .setTimeRange(0, MAX_LONG);
> scan.setMaxVersions(1);
>
> ResultScanner rs = t.getScanner(scan);
> for (Result result: rs)
> {
>
> }
>
> For table 1, rs.next() takes about 30ms.
> For table 2, rs.next() takes about 180ms.
>
> Both are returning the exact same result.  Why is it taking so much longer
> on table 2 to get the same result?  The scan depth is the same.  The only
> difference is the column width.  But I’m filtering on a single column and
> returning only that column.
>
> Am I missing something?  As I increase the number of columns, the response
> time gets worse.  I do expect the response time to get worse when
> increasing the number of rows, but not by increasing the number of columns
> since I’m returning only 1 column in
> both cases.
>
> I appreciate any comments that you have.
>
> -Tony
>
>
>
> Tony Dean
> SAS Institute Inc.
> Principal Software Developer
> 919-531-6704          …
>
> Confidentiality Notice:  The information contained in this message,
> including any attachments hereto, may be confidential and is intended to be
> read only by the individual or entity to whom this message is addressed. If
> the reader of this message is not the intended recipient or an agent or
> designee of the intended recipient, please note that any review, use,
> disclosure or distribution of this message or its attachments, in any form,
> is strictly prohibited.  If you have received this message in error, please
> immediately notify the sender and/or notificati...@carrieriq.com and
> delete or destroy any copy of this message and its attachments.
>

Re: Scan performance

Reply via email to