Junegunn Choi created HBASE-29107: ------------------------------------- Summary: shell: Improve 'count' performance Key: HBASE-29107 URL: https://issues.apache.org/jira/browse/HBASE-29107 Project: HBase Issue Type: Improvement Components: shell Reporter: Junegunn Choi
I propose two changes to the 'count' command of HBase shell to improve its performance. h2. Not setting scanner caching The command currently sets the scanner caching to 10 rows by default, and instructs the users to increase it if necessary. According to HBASE-2331, the default value was chosen as such in case the table has large records. {quote}Default value of 10 is really slow, but should be kept as low for clients with huge rows. {quote} However, with the current version of HBase, we use a better mechanism {{{}hbase.client.scanner.max.result.size{}}}, which is 2MB by default. So just by not setting the scanner caching, we automatically get a better performance, and we don't have to worry about huge rows. h3. Test {code:java} # Create table create 't', 'd', {NUMREGIONS => 4, SPLITALGO => 'HexStringSplit'} # Insert data data = '_' * 1024 bm = @hbase.connection.getBufferedMutator(TableName.valueOf('t')) (8 * 1024 * 1024).times do |i| row = format('%010x', i).reverse.to_java_bytes p = org.apache.hadoop.hbase.client.Put.new(row) p.addColumn('d'.to_java_bytes, ''.to_java_bytes, data.to_java_bytes) bm.mutate(p) end bm.close # Before patch count 't', INTERVAL => 100000 # 8388608 row(s) # Took 53.5826 seconds # Before patch with custom 'CACHE' count 't', INTERVAL => 100000, CACHE => 2000 # 8388608 row(s) # Took 13.6717 seconds # After patch count 't', INTERVAL => 100000 # 8388608 row(s) # Took 14.0911 seconds {code} The test was performed locally on my machine, so the different in performance in a real cluster should be larger. h2. KeyOnlyFilter Another thing we can do is to apply {{KeyOnlyFilter}} as well because we're not interested in the values. This helps when the records are large. h3. Test {code:java} create 't2', 'd', {NUMREGIONS => 4, SPLITALGO => 'HexStringSplit'} data = '_' * 1024 * 1024 bm = @hbase.connection.getBufferedMutator(TableName.valueOf('t2')) (8 * 1024).times do |i| row = format('%010x', i).reverse.to_java_bytes p = org.apache.hadoop.hbase.client.Put.new(row) p.addColumn('d'.to_java_bytes, ''.to_java_bytes, data.to_java_bytes) bm.mutate(p) end bm.close # Before patch count 't2' # 8192 row(s) # Took 8.8952 seconds # After patch count 't2' # 8192 row(s) # Took 3.4052 seconds {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)