Junegunn Choi created HBASE-29107:
-------------------------------------
Summary: shell: Improve 'count' performance
Key: HBASE-29107
URL: https://issues.apache.org/jira/browse/HBASE-29107
Project: HBase
Issue Type: Improvement
Components: shell
Reporter: Junegunn Choi
I propose two changes to the 'count' command of HBase shell to improve its
performance.
h2. Not setting scanner caching
The command currently sets the scanner caching to 10 rows by default, and
instructs the users to increase it if necessary. According to HBASE-2331, the
default value was chosen as such in case the table has large records.
{quote}Default value of 10 is really slow, but should be kept as low for
clients with huge rows.
{quote}
However, with the current version of HBase, we use a better mechanism
{{{}hbase.client.scanner.max.result.size{}}}, which is 2MB by default. So just
by not setting the scanner caching, we automatically get a better performance,
and we don't have to worry about huge rows.
h3. Test
{code:java}
# Create table
create 't', 'd', {NUMREGIONS => 4, SPLITALGO => 'HexStringSplit'}
# Insert data
data = '_' * 1024
bm = @hbase.connection.getBufferedMutator(TableName.valueOf('t'))
(8 * 1024 * 1024).times do |i|
row = format('%010x', i).reverse.to_java_bytes
p = org.apache.hadoop.hbase.client.Put.new(row)
p.addColumn('d'.to_java_bytes, ''.to_java_bytes, data.to_java_bytes)
bm.mutate(p)
end
bm.close
# Before patch
count 't', INTERVAL => 100000
# 8388608 row(s)
# Took 53.5826 seconds
# Before patch with custom 'CACHE'
count 't', INTERVAL => 100000, CACHE => 2000
# 8388608 row(s)
# Took 13.6717 seconds
# After patch
count 't', INTERVAL => 100000
# 8388608 row(s)
# Took 14.0911 seconds
{code}
The test was performed locally on my machine, so the different in performance
in a real cluster should be larger.
h2. KeyOnlyFilter
Another thing we can do is to apply {{KeyOnlyFilter}} as well because we're not
interested in the values. This helps when the records are large.
h3. Test
{code:java}
create 't2', 'd', {NUMREGIONS => 4, SPLITALGO => 'HexStringSplit'}
data = '_' * 1024 * 1024
bm = @hbase.connection.getBufferedMutator(TableName.valueOf('t2'))
(8 * 1024).times do |i|
row = format('%010x', i).reverse.to_java_bytes
p = org.apache.hadoop.hbase.client.Put.new(row)
p.addColumn('d'.to_java_bytes, ''.to_java_bytes, data.to_java_bytes)
bm.mutate(p)
end
bm.close
# Before patch
count 't2'
# 8192 row(s)
# Took 8.8952 seconds
# After patch
count 't2'
# 8192 row(s)
# Took 3.4052 seconds
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)