Junegunn Choi created HBASE-29107:
-------------------------------------

             Summary: shell: Improve 'count' performance
                 Key: HBASE-29107
                 URL: https://issues.apache.org/jira/browse/HBASE-29107
             Project: HBase
          Issue Type: Improvement
          Components: shell
            Reporter: Junegunn Choi


I propose two changes to the 'count' command of HBase shell to improve its 
performance.
h2. Not setting scanner caching

The command currently sets the scanner caching to 10 rows by default, and 
instructs the users to increase it if necessary. According to HBASE-2331, the 
default value was chosen as such in case the table has large records.
{quote}Default value of 10 is really slow, but should be kept as low for 
clients with huge rows.
{quote}
However, with the current version of HBase, we use a better mechanism 
{{{}hbase.client.scanner.max.result.size{}}}, which is 2MB by default. So just 
by not setting the scanner caching, we automatically get a better performance, 
and we don't have to worry about huge rows.
h3. Test
{code:java}
# Create table
create 't', 'd', {NUMREGIONS => 4, SPLITALGO => 'HexStringSplit'}

# Insert data
data = '_' * 1024
bm = @hbase.connection.getBufferedMutator(TableName.valueOf('t'))
(8 * 1024 * 1024).times do |i|
  row = format('%010x', i).reverse.to_java_bytes
  p = org.apache.hadoop.hbase.client.Put.new(row)
  p.addColumn('d'.to_java_bytes, ''.to_java_bytes, data.to_java_bytes)
  bm.mutate(p)
end
bm.close

# Before patch
count 't', INTERVAL => 100000
  # 8388608 row(s)
  # Took 53.5826 seconds

# Before patch with custom 'CACHE'
count 't', INTERVAL => 100000, CACHE => 2000
  # 8388608 row(s)
  # Took 13.6717 seconds

# After patch
count 't', INTERVAL => 100000
  # 8388608 row(s)
  # Took 14.0911 seconds
{code}
The test was performed locally on my machine, so the different in performance 
in a real cluster should be larger.
h2. KeyOnlyFilter

Another thing we can do is to apply {{KeyOnlyFilter}} as well because we're not 
interested in the values. This helps when the records are large.
h3. Test
{code:java}
create 't2', 'd', {NUMREGIONS => 4, SPLITALGO => 'HexStringSplit'}

data = '_' * 1024 * 1024
bm = @hbase.connection.getBufferedMutator(TableName.valueOf('t2'))
(8 * 1024).times do |i|
  row = format('%010x', i).reverse.to_java_bytes
  p = org.apache.hadoop.hbase.client.Put.new(row)
  p.addColumn('d'.to_java_bytes, ''.to_java_bytes, data.to_java_bytes)
  bm.mutate(p)
end
bm.close

# Before patch
count 't2'
  # 8192 row(s)
  # Took 8.8952 seconds

# After patch
count 't2'
  # 8192 row(s)
  # Took 3.4052 seconds
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to