[ https://issues.apache.org/jira/browse/HBASE-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13532040#comment-13532040 ]
Ted Yu commented on HBASE-5416: ------------------------------- Here is test result from Linux: {code} grep 'scanner finished in' testJoinedScanners-output.txt 2012-12-13 20:28:36,780 INFO [main] regionserver.TestJoinedScanners(172): Slow scanner finished in 29.421479079 seconds, got 100 rows 2012-12-13 20:28:47,617 INFO [main] regionserver.TestJoinedScanners(172): Joined scanner finished in 10.836890451 seconds, got 100 rows 2012-12-13 20:28:58,637 INFO [main] regionserver.TestJoinedScanners(172): Slow scanner finished in 11.019543361 seconds, got 100 rows 2012-12-13 20:29:07,865 INFO [main] regionserver.TestJoinedScanners(172): Joined scanner finished in 9.227820454 seconds, got 100 rows 2012-12-13 20:29:17,690 INFO [main] regionserver.TestJoinedScanners(172): Slow scanner finished in 9.824966218 seconds, got 100 rows 2012-12-13 20:29:26,317 INFO [main] regionserver.TestJoinedScanners(172): Joined scanner finished in 8.626794601 seconds, got 100 rows 2012-12-13 20:29:36,288 INFO [main] regionserver.TestJoinedScanners(172): Slow scanner finished in 9.97033987 seconds, got 100 rows 2012-12-13 20:29:45,033 INFO [main] regionserver.TestJoinedScanners(172): Joined scanner finished in 8.745137076 seconds, got 100 rows 2012-12-13 20:29:55,023 INFO [main] regionserver.TestJoinedScanners(172): Slow scanner finished in 9.989630848 seconds, got 100 rows 2012-12-13 20:30:03,416 INFO [main] regionserver.TestJoinedScanners(172): Joined scanner finished in 8.392952897 seconds, got 100 rows 2012-12-13 20:30:12,267 INFO [main] regionserver.TestJoinedScanners(172): Slow scanner finished in 8.850649054 seconds, got 100 rows 2012-12-13 20:30:20,985 INFO [main] regionserver.TestJoinedScanners(172): Joined scanner finished in 8.718266736 seconds, got 100 rows 2012-12-13 20:30:30,108 INFO [main] regionserver.TestJoinedScanners(172): Slow scanner finished in 9.122057799 seconds, got 100 rows 2012-12-13 20:30:38,669 INFO [main] regionserver.TestJoinedScanners(172): Joined scanner finished in 8.561782079 seconds, got 100 rows 2012-12-13 20:30:47,898 INFO [main] regionserver.TestJoinedScanners(172): Slow scanner finished in 9.228045508 seconds, got 100 rows 2012-12-13 20:30:57,057 INFO [main] regionserver.TestJoinedScanners(172): Joined scanner finished in 9.158965127 seconds, got 100 rows 2012-12-13 20:31:07,428 INFO [main] regionserver.TestJoinedScanners(172): Slow scanner finished in 10.370526135 seconds, got 100 rows 2012-12-13 20:31:16,586 INFO [main] regionserver.TestJoinedScanners(172): Joined scanner finished in 9.157627332 seconds, got 100 rows 2012-12-13 20:31:25,612 INFO [main] regionserver.TestJoinedScanners(172): Slow scanner finished in 9.026821302 seconds, got 100 rows 2012-12-13 20:31:34,553 INFO [main] regionserver.TestJoinedScanners(172): Joined scanner finished in 8.93992941 seconds, got 100 rows {code} > Improve performance of scans with some kind of filters. > ------------------------------------------------------- > > Key: HBASE-5416 > URL: https://issues.apache.org/jira/browse/HBASE-5416 > Project: HBase > Issue Type: Improvement > Components: Filters, Performance, regionserver > Affects Versions: 0.90.4 > Reporter: Max Lapan > Assignee: Max Lapan > Fix For: 0.96.0 > > Attachments: 5416-Filtered_scans_v6.patch, 5416-v5.txt, 5416-v6.txt, > Filtered_scans.patch, Filtered_scans_v2.patch, Filtered_scans_v3.patch, > Filtered_scans_v4.patch, Filtered_scans_v5.1.patch, Filtered_scans_v5.patch, > Filtered_scans_v7.patch, HBASE-5416-v7-rebased.patch > > > When the scan is performed, whole row is loaded into result list, after that > filter (if exists) is applied to detect that row is needed. > But when scan is performed on several CFs and filter checks only data from > the subset of these CFs, data from CFs, not checked by a filter is not needed > on a filter stage. Only when we decided to include current row. And in such > case we can significantly reduce amount of IO performed by a scan, by loading > only values, actually checked by a filter. > For example, we have two CFs: flags and snap. Flags is quite small (bunch of > megabytes) and is used to filter large entries from snap. Snap is very large > (10s of GB) and it is quite costly to scan it. If we needed only rows with > some flag specified, we use SingleColumnValueFilter to limit result to only > small subset of region. But current implementation is loading both CFs to > perform scan, when only small subset is needed. > Attached patch adds one routine to Filter interface to allow filter to > specify which CF is needed to it's operation. In HRegion, we separate all > scanners into two groups: needed for filter and the rest (joined). When new > row is considered, only needed data is loaded, filter applied, and only if > filter accepts the row, rest of data is loaded. At our data, this speeds up > such kind of scans 30-50 times. Also, this gives us the way to better > normalize the data into separate columns by optimizing the scans performed. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira