[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13215519#comment-13215519 ]
Hadoop QA commented on HBASE-5416: ---------------------------------- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12515904/Filtered_scans_v3.patch against trunk revision . +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. -1 javadoc. The javadoc tool appears to have generated -133 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 155 new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1041//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1041//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1041//console This message is automatically generated. > Improve performance of scans with some kind of filters. > ------------------------------------------------------- > > Key: HBASE-5416 > URL: https://issues.apache.org/jira/browse/HBASE-5416 > Project: HBase > Issue Type: Improvement > Components: filters, performance, regionserver > Affects Versions: 0.90.4 > Reporter: Max Lapan > Assignee: Max Lapan > Attachments: Filtered_scans.patch, Filtered_scans_v2.patch, > Filtered_scans_v3.patch > > > When the scan is performed, whole row is loaded into result list, after that > filter (if exists) is applied to detect that row is needed. > But when scan is performed on several CFs and filter checks only data from > the subset of these CFs, data from CFs, not checked by a filter is not needed > on a filter stage. Only when we decided to include current row. And in such > case we can significantly reduce amount of IO performed by a scan, by loading > only values, actually checked by a filter. > For example, we have two CFs: flags and snap. Flags is quite small (bunch of > megabytes) and is used to filter large entries from snap. Snap is very large > (10s of GB) and it is quite costly to scan it. If we needed only rows with > some flag specified, we use SingleColumnValueFilter to limit result to only > small subset of region. But current implementation is loading both CFs to > perform scan, when only small subset is needed. > Attached patch adds one routine to Filter interface to allow filter to > specify which CF is needed to it's operation. In HRegion, we separate all > scanners into two groups: needed for filter and the rest (joined). When new > row is considered, only needed data is loaded, filter applied, and only if > filter accepts the row, rest of data is loaded. At our data, this speeds up > such kind of scans 30-50 times. Also, this gives us the way to better > normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira