[
https://issues.apache.org/jira/browse/HBASE-29107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Duo Zhang resolved HBASE-29107.
-------------------------------
Fix Version/s: 2.7.0
3.0.0-beta-2
2.6.3
2.5.12
Hadoop Flags: Reviewed
Resolution: Fixed
Pushed to all actibe branches.
Thanks [~junegunn] for contributing!
> shell: Improve 'count' performance
> ----------------------------------
>
> Key: HBASE-29107
> URL: https://issues.apache.org/jira/browse/HBASE-29107
> Project: HBase
> Issue Type: Improvement
> Components: shell
> Reporter: Junegunn Choi
> Assignee: Junegunn Choi
> Priority: Major
> Labels: pull-request-available
> Fix For: 2.7.0, 3.0.0-beta-2, 2.6.3, 2.5.12
>
>
> I propose two changes to the 'count' command of HBase shell to improve its
> performance.
> h2. Not setting scanner caching
> The command currently sets the scanner caching to 10 rows by default, and
> instructs the users to increase it if necessary. According to HBASE-2331, the
> default value was chosen as such in case the table has large records.
> {quote}Default value of 10 is really slow, but should be kept as low for
> clients with huge rows.
> {quote}
> However, with the current version of HBase, we use a better mechanism
> {{{}hbase.client.scanner.max.result.size{}}}, which is 2MB by default. So
> just by not setting the scanner caching, we automatically get a better
> performance, and we don't have to worry about huge rows.
> h3. Test
> {code:java}
> # Create table
> create 't', 'd', {NUMREGIONS => 4, SPLITALGO => 'HexStringSplit'}
> # Insert data
> data = '_' * 1024
> bm = @hbase.connection.getBufferedMutator(TableName.valueOf('t'))
> (8 * 1024 * 1024).times do |i|
> row = format('%010x', i).reverse.to_java_bytes
> p = org.apache.hadoop.hbase.client.Put.new(row)
> p.addColumn('d'.to_java_bytes, ''.to_java_bytes, data.to_java_bytes)
> bm.mutate(p)
> end
> bm.close
> # Before patch
> count 't', INTERVAL => 100000
> # 8388608 row(s)
> # Took 53.5826 seconds
> # Before patch with custom 'CACHE'
> count 't', INTERVAL => 100000, CACHE => 2000
> # 8388608 row(s)
> # Took 13.6717 seconds
> # After patch
> count 't', INTERVAL => 100000
> # 8388608 row(s)
> # Took 14.0911 seconds
> {code}
> The test was performed locally on my machine, so the different in performance
> in a real cluster should be larger.
> h2. KeyOnlyFilter
> Another thing we can do is to apply {{KeyOnlyFilter}} as well because we're
> not interested in the values. This helps when the records are large.
> h3. Test
> {code:java}
> create 't2', 'd', {NUMREGIONS => 4, SPLITALGO => 'HexStringSplit'}
> data = '_' * 1024 * 1024
> bm = @hbase.connection.getBufferedMutator(TableName.valueOf('t2'))
> (8 * 1024).times do |i|
> row = format('%010x', i).reverse.to_java_bytes
> p = org.apache.hadoop.hbase.client.Put.new(row)
> p.addColumn('d'.to_java_bytes, ''.to_java_bytes, data.to_java_bytes)
> bm.mutate(p)
> end
> bm.close
> # Before patch
> count 't2'
> # 8192 row(s)
> # Took 8.8952 seconds
> # After patch
> count 't2'
> # 8192 row(s)
> # Took 3.4052 seconds
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)