Your analysis seems pretty accurate so far. Ultimately, it sounds like
your SAN is the bottleneck here.
You were able to work around the durability concerns by skipping the WAL
(never forget that this means your data in HBase is *not* guaranteed to
be there).
It sounds like compactions are the next bottleneck for you.
Specifically, your compactions can't complete fast enough to drive down
the number of storefiles you have.
You have two straightforward approaches to try:
1. Increase the number of compaction threads inside your regionserver.
hbase.regionserver.thread.compaction.small is likely the one you want to
increase. Eventually, you may need to also increase
hbase.regionserver.thread.compaction.large
2. Increase the hbase.client.retries.number to a larger value and/or
increase hbase.client.pause so that the client will retry more times
before giving up or wait longer in-between retry attempts
Of course, you can also change your application (the Import m/r job)
such that you can inject sleeps, but I assume you don't want to do that.
We don't expose an option in that job (to my knowledge) that would
inject slowdowns.
On 4/28/21 7:56 AM, Udo Offermann wrote:
Hello everybody
We are migrating from HBase 1.0 to HBase 2.2.5 and observe problem importing
data to the new HBase 2 cluster. The HBase clusters are connected to a SAN.
For the import we are using the standard HBbase Import (i.e. no bulk import).
We tested the import several times at the HBase 1.0 cluster and never faced any
problems.
The problem we observe is : org.apache.hadoop.hbase.RegionTooBusyException
In the log files of the region servers we found
regionserver.MemStoreFlusher: ... has too many store files
It seems that other people faced similar problems like described in this blog
post: https://gbif.blogspot.com/2012/07/optimizing-writes-in-hbase.html
However the provided solution does not help in our case (especially increasing
hbase.hstore.blockingStoreFiles).
In fact the overall problem seems to be that the Import mappers are too fast
for the region servers so that they cannot flush and compact the HFiles in
time, even if they stop accepting further writes when
the value of hbase.hstore.blockingStoreFiles is exceeded.
Increasing hbase.hstore.blockingStoreFiles means hat the region server is
allowed to keep more HFiles but as long as the write throughput of the mappers
is that high, the region server will never be able to flush and compact the
written data in time so that in the end the region servers are too busy and
finally treated as crashed!
IMHO it comes simply to the point that the incoming rate (mapper write operations)
> processing rate (writing to MemStore, Flushes and Compations) which leads
always into disaster - if I remember correctly my queues lecture at the university
;-)
We also found in the logs lots of "Slow sync cost“ so we also turned of WAL
files for the import:
yarn jar $HBASE_HOME/lib/hbase-mapreduce-2.2.5.jar import
-Dimport.wal.durability=SKIP_WAL …
which eliminated the „Slow sync cost“ messages but it didn’t solve our overall
problem.
So my question is: isn’t there a way to somehow slow down the import mapper so
that the incoming rate < region server’s processing rate?
Are there other possibilities that we can try. One thing that might help (at
least for the import scenario) is using bulk import but the question is whether
other scenarios with a high write load will lead to similar problems!
Best regards
Udo