[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.

stack (JIRA) Wed, 19 Dec 2012 13:05:13 -0800

    [ 
https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13536425#comment-13536425
 ]


stack commented on HBASE-5416:
------------------------------

I do not see enough by way of tests yet to allay strong concerns raised above 
about intrusiveness of this patch.  There is a bunch of new code in here in 
HRegion scanning.   It is not all excluded if new feature flag is not set.  
I've not reviewed the changes in here closely (it is not a critical nor a 
blocker issue so is secondary to my thinking).  I do not see evidence of close 
review by others.  Has it been done?

This changes Filter Interface so 0.96 only (as said above).

+   * This can deliver huge perf gains when there's a cf with lots of data; 
however, it can
+   * also lead to some inconsistent results (e.g. due to concurrent updates, 
or splits).

Can we have more detail on what the inconsistency referred to above is about?

What is happening in SingleColumnValueExcludeFilter?  We are removing 
filterKeyValue and putting in place filterRow and hasFilterRow?

Should filterBase do return filter.isFamilyEssential(name); rather than just 
return true in isEssentialFamily.

Why is below in Region and not in RegionScanner?

+    // Heap of key-values that are not essential for the provided filters and 
are thus read
+    // on demand, if lazy column family loading is enabled.
+    KeyValueHeap joinedHeap = null;


This is a little obscene:

+                Collections.sort(results, comparator);

inside in HRegion merging results of 'essential' and 'non-essential' data (this 
probably should be rephrased...).  Can't be avoided though given what is going 
on here.







                
> Improve performance of scans with some kind of filters.
> -------------------------------------------------------
>
>                 Key: HBASE-5416
>                 URL: https://issues.apache.org/jira/browse/HBASE-5416
>             Project: HBase
>          Issue Type: Improvement
>          Components: Filters, Performance, regionserver
>    Affects Versions: 0.90.4
>            Reporter: Max Lapan
>            Assignee: Max Lapan
>             Fix For: 0.96.0
>
>         Attachments: 5416-Filtered_scans_v6.patch, 5416-v5.txt, 5416-v6.txt, 
> Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, 
> Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, 
> Filtered_scans_v7.patch, HBASE-5416-v10.patch, HBASE-5416-v7-rebased.patch, 
> HBASE-5416-v8.patch, HBASE-5416-v9.patch
>
>
> When the scan is performed, whole row is loaded into result list, after that 
> filter (if exists) is applied to detect that row is needed.
> But when scan is performed on several CFs and filter checks only data from 
> the subset of these CFs, data from CFs, not checked by a filter is not needed 
> on a filter stage. Only when we decided to include current row. And in such 
> case we can significantly reduce amount of IO performed by a scan, by loading 
> only values, actually checked by a filter.
> For example, we have two CFs: flags and snap. Flags is quite small (bunch of 
> megabytes) and is used to filter large entries from snap. Snap is very large 
> (10s of GB) and it is quite costly to scan it. If we needed only rows with 
> some flag specified, we use SingleColumnValueFilter to limit result to only 
> small subset of region. But current implementation is loading both CFs to 
> perform scan, when only small subset is needed.
> Attached patch adds one routine to Filter interface to allow filter to 
> specify which CF is needed to it's operation. In HRegion, we separate all 
> scanners into two groups: needed for filter and the rest (joined). When new 
> row is considered, only needed data is loaded, filter applied, and only if 
> filter accepts the row, rest of data is loaded. At our data, this speeds up 
> such kind of scans 30-50 times. Also, this gives us the way to better 
> normalize the data into separate columns by optimizing the scans performed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5416) Improve performance of scans with some kind of filters.

Reply via email to