RowFilter can help. It depends on the setup. RowFilter skip all column of the row when the row key does not match. That will help with IO *if* your rows are larger than the HFile block size (64k by default). Otherwise it still needs to touch each block.
An HTable does some priming when it is created. The region information for all tables could be substantial, so it does not make much sense to prime the cache for all tables. How are you using the client. If you pre-create a reuse HTable and/or HConnection you should be OK. -- Lars ________________________________ From: Tony Dean <tony.d...@sas.com> To: "user@hbase.apache.org" <user@hbase.apache.org>; lars hofhansl <la...@apache.org> Sent: Monday, June 24, 2013 1:48 PM Subject: RE: Scan performance Lars, I'm waiting for some time to exchange out hbase jars in cluster (that support FuzzyRow filter) in order to try out. In the meantime, I'm wondering why RowFilter regex is not more helpful. I'm guessing that FuzzyRow filter helps in disk io while Row filter just filters after the disk io has completed. Also, I turned on row level bloom filter which does not seem to help either. On a different performance note, I'm wondering if there is a way to prime client connection information and such so that the first client query isn't miserably slow. After the first query, response times do get considerably better due to caching necessary information. Is there a way to get around this first initial hit? I assume any such priming would have to be application specific. Thanks. -----Original Message----- From: lars hofhansl [mailto:la...@apache.org] Sent: Saturday, June 22, 2013 9:24 AM To: user@hbase.apache.org Subject: Re: Scan performance "essential column families" help when you filter on one column but want to return *other* columns for the rows that matched the column. Check out HBASE-5416. -- Lars ________________________________ From: Vladimir Rodionov <vrodio...@carrieriq.com> To: "user@hbase.apache.org" <user@hbase.apache.org>; lars hofhansl <la...@apache.org> Sent: Friday, June 21, 2013 5:00 PM Subject: RE: Scan performance Lars, I thought that column family is the locality group and placement columns which are frequently accessed together into the same column family (locality group) is the obvious performance improvement tip. What are the "essential column families" for in this context? As for original question.. Unless you place your column into a separate column family in Table 2, you will need to scan (load from disk if not cached) ~ 40x more data for the second table (because you have 40 columns). This may explain why do see such a difference in execution time if all data needs to be loaded first from HDFS. Best regards, Vladimir Rodionov Principal Platform Engineer Carrier IQ, www.carrieriq.com e-mail: vrodio...@carrieriq.com ________________________________________ From: lars hofhansl [la...@apache.org] Sent: Friday, June 21, 2013 3:37 PM To: user@hbase.apache.org Subject: Re: Scan performance HBase is a key value (KV) store. Each column is stored in its own KV, a row is just a set of KVs that happen to have the row key (which is the first part of the key). I tried to summarize this here: http://hadoop-hbase.blogspot.de/2011/12/introduction-to-hbase.html) In the StoreFiles all KVs are sorted in row/column order, but HBase still needs to skip over many KVs in order to "reach" the next row. So more disk and memory IO is needed. If you using 0.94 there is a new feature "essential column families". If you always search by the same column you can place that one in its own column family and all other column in another column family. In that case your scan performance should be close identical. -- Lars ________________________________ From: Tony Dean <tony.d...@sas.com> To: "user@hbase.apache.org" <user@hbase.apache.org> Sent: Friday, June 21, 2013 2:08 PM Subject: Scan performance Hi, I hope that you can shed some light on these 2 scenarios below. I have 2 small tables of 6000 rows. Table 1 has only 1 column in each of its rows. Table 2 has 40 columns in each of its rows. Other than that the two tables are identical. In both tables there is only 1 row that contains a matching column that I am filtering on. And the Scan performs correctly in both cases by returning only the single result. The code looks something like the following: Scan scan = new Scan(startRow, stopRow); // the start/stop rows should include all 6000 rows scan.addColumn(cf, qualifier); // only return the column that I am interested in (should only be in 1 row and only 1 version) Filter f1 = new InclusiveStopFilter(stopRow); Filter f2 = new SingleColumnValueFilter(cf, qualifier, CompareFilter.CompareOp.EQUALS, value); scan.setFilter(new FilterList(f1, f2)); scan .setTimeRange(0, MAX_LONG); scan.setMaxVersions(1); ResultScanner rs = t.getScanner(scan); for (Result result: rs) { } For table 1, rs.next() takes about 30ms. For table 2, rs.next() takes about 180ms. Both are returning the exact same result. Why is it taking so much longer on table 2 to get the same result? The scan depth is the same. The only difference is the column width. But I’m filtering on a single column and returning only that column. Am I missing something? As I increase the number of columns, the response time gets worse. I do expect the response time to get worse when increasing the number of rows, but not by increasing the number of columns since I’m returning only 1 column in both cases. I appreciate any comments that you have. -Tony Tony Dean SAS Institute Inc. Principal Software Developer 919-531-6704 … Confidentiality Notice: The information contained in this message, including any attachments hereto, may be confidential and is intended to be read only by the individual or entity to whom this message is addressed. If the reader of this message is not the intended recipient or an agent or designee of the intended recipient, please note that any review, use, disclosure or distribution of this message or its attachments, in any form, is strictly prohibited. If you have received this message in error, please immediately notify the sender and/or notificati...@carrieriq.com and delete or destroy any copy of this message and its attachments.