[ https://issues.apache.org/jira/browse/HBASE-29107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Duo Zhang resolved HBASE-29107. ------------------------------- Fix Version/s: 2.7.0 3.0.0-beta-2 2.6.3 2.5.12 Hadoop Flags: Reviewed Resolution: Fixed Pushed to all actibe branches. Thanks [~junegunn] for contributing! > shell: Improve 'count' performance > ---------------------------------- > > Key: HBASE-29107 > URL: https://issues.apache.org/jira/browse/HBASE-29107 > Project: HBase > Issue Type: Improvement > Components: shell > Reporter: Junegunn Choi > Assignee: Junegunn Choi > Priority: Major > Labels: pull-request-available > Fix For: 2.7.0, 3.0.0-beta-2, 2.6.3, 2.5.12 > > > I propose two changes to the 'count' command of HBase shell to improve its > performance. > h2. Not setting scanner caching > The command currently sets the scanner caching to 10 rows by default, and > instructs the users to increase it if necessary. According to HBASE-2331, the > default value was chosen as such in case the table has large records. > {quote}Default value of 10 is really slow, but should be kept as low for > clients with huge rows. > {quote} > However, with the current version of HBase, we use a better mechanism > {{{}hbase.client.scanner.max.result.size{}}}, which is 2MB by default. So > just by not setting the scanner caching, we automatically get a better > performance, and we don't have to worry about huge rows. > h3. Test > {code:java} > # Create table > create 't', 'd', {NUMREGIONS => 4, SPLITALGO => 'HexStringSplit'} > # Insert data > data = '_' * 1024 > bm = @hbase.connection.getBufferedMutator(TableName.valueOf('t')) > (8 * 1024 * 1024).times do |i| > row = format('%010x', i).reverse.to_java_bytes > p = org.apache.hadoop.hbase.client.Put.new(row) > p.addColumn('d'.to_java_bytes, ''.to_java_bytes, data.to_java_bytes) > bm.mutate(p) > end > bm.close > # Before patch > count 't', INTERVAL => 100000 > # 8388608 row(s) > # Took 53.5826 seconds > # Before patch with custom 'CACHE' > count 't', INTERVAL => 100000, CACHE => 2000 > # 8388608 row(s) > # Took 13.6717 seconds > # After patch > count 't', INTERVAL => 100000 > # 8388608 row(s) > # Took 14.0911 seconds > {code} > The test was performed locally on my machine, so the different in performance > in a real cluster should be larger. > h2. KeyOnlyFilter > Another thing we can do is to apply {{KeyOnlyFilter}} as well because we're > not interested in the values. This helps when the records are large. > h3. Test > {code:java} > create 't2', 'd', {NUMREGIONS => 4, SPLITALGO => 'HexStringSplit'} > data = '_' * 1024 * 1024 > bm = @hbase.connection.getBufferedMutator(TableName.valueOf('t2')) > (8 * 1024).times do |i| > row = format('%010x', i).reverse.to_java_bytes > p = org.apache.hadoop.hbase.client.Put.new(row) > p.addColumn('d'.to_java_bytes, ''.to_java_bytes, data.to_java_bytes) > bm.mutate(p) > end > bm.close > # Before patch > count 't2' > # 8192 row(s) > # Took 8.8952 seconds > # After patch > count 't2' > # 8192 row(s) > # Took 3.4052 seconds > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)