[jira] [Resolved] (HBASE-29107) shell: Improve 'count' performance

Duo Zhang (Jira) Fri, 28 Feb 2025 09:06:51 -0800


     [ 
https://issues.apache.org/jira/browse/HBASE-29107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Duo Zhang resolved HBASE-29107.
-------------------------------
    Fix Version/s: 2.7.0
                   3.0.0-beta-2
                   2.6.3
                   2.5.12
     Hadoop Flags: Reviewed
       Resolution: Fixed

Pushed to all actibe branches.

Thanks [~junegunn] for contributing!

> shell: Improve 'count' performance
> ----------------------------------
>
>                 Key: HBASE-29107
>                 URL: https://issues.apache.org/jira/browse/HBASE-29107
>             Project: HBase
>          Issue Type: Improvement
>          Components: shell
>            Reporter: Junegunn Choi
>            Assignee: Junegunn Choi
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 2.7.0, 3.0.0-beta-2, 2.6.3, 2.5.12
>
>
> I propose two changes to the 'count' command of HBase shell to improve its 
> performance.
> h2. Not setting scanner caching
> The command currently sets the scanner caching to 10 rows by default, and 
> instructs the users to increase it if necessary. According to HBASE-2331, the 
> default value was chosen as such in case the table has large records.
> {quote}Default value of 10 is really slow, but should be kept as low for 
> clients with huge rows.
> {quote}
> However, with the current version of HBase, we use a better mechanism 
> {{{}hbase.client.scanner.max.result.size{}}}, which is 2MB by default. So 
> just by not setting the scanner caching, we automatically get a better 
> performance, and we don't have to worry about huge rows.
> h3. Test
> {code:java}
> # Create table
> create 't', 'd', {NUMREGIONS => 4, SPLITALGO => 'HexStringSplit'}
> # Insert data
> data = '_' * 1024
> bm = @hbase.connection.getBufferedMutator(TableName.valueOf('t'))
> (8 * 1024 * 1024).times do |i|
>   row = format('%010x', i).reverse.to_java_bytes
>   p = org.apache.hadoop.hbase.client.Put.new(row)
>   p.addColumn('d'.to_java_bytes, ''.to_java_bytes, data.to_java_bytes)
>   bm.mutate(p)
> end
> bm.close
> # Before patch
> count 't', INTERVAL => 100000
>   # 8388608 row(s)
>   # Took 53.5826 seconds
> # Before patch with custom 'CACHE'
> count 't', INTERVAL => 100000, CACHE => 2000
>   # 8388608 row(s)
>   # Took 13.6717 seconds
> # After patch
> count 't', INTERVAL => 100000
>   # 8388608 row(s)
>   # Took 14.0911 seconds
> {code}
> The test was performed locally on my machine, so the different in performance 
> in a real cluster should be larger.
> h2. KeyOnlyFilter
> Another thing we can do is to apply {{KeyOnlyFilter}} as well because we're 
> not interested in the values. This helps when the records are large.
> h3. Test
> {code:java}
> create 't2', 'd', {NUMREGIONS => 4, SPLITALGO => 'HexStringSplit'}
> data = '_' * 1024 * 1024
> bm = @hbase.connection.getBufferedMutator(TableName.valueOf('t2'))
> (8 * 1024).times do |i|
>   row = format('%010x', i).reverse.to_java_bytes
>   p = org.apache.hadoop.hbase.client.Put.new(row)
>   p.addColumn('d'.to_java_bytes, ''.to_java_bytes, data.to_java_bytes)
>   bm.mutate(p)
> end
> bm.close
> # Before patch
> count 't2'
>   # 8192 row(s)
>   # Took 8.8952 seconds
> # After patch
> count 't2'
>   # 8192 row(s)
>   # Took 3.4052 seconds
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (HBASE-29107) shell: Improve 'count' performance

Reply via email to