Hello,

We've noticed an issue in our HBase cluster where one of the region-servers has 
a spike in I/O wait associated with a spike in Load for that node. As a result, 
our request times to the cluster increase dramatically. Initially, we suspected 
that we were experiencing hotspotting, but even after temporarily blocking 
requests to the highest volume regions on that region-servers the issue 
persisted. Moreover, when looking at request counts to the regions on the 
region-server from the HBase UI, they were not particularly high and our own 
application level metrics on the requests we were making were not very high 
either. From looking at a thread dump of the region-server, it appears that our 
get and scan requests are getting stuck when trying to read from the blocks in 
our bucket cache leaving the threads in a 'runnable' state. For context, we are 
running HBase 1.30 on a cluster backed by S3 running on EMR and our bucket 
cache is running in File mode. Our region-servers all have SSDs. We have a 
combined cache with the L1 standard LRU cache and the L2 file mode bucket 
cache. Our Bucket Cache utilization is less than 50% of the allocated space.

We suspect that part of the issue is our disk space utilization on the 
region-server as our max disk space utilization also increased as this 
happened. What things can we do to minimize disk space utilization? The actual 
HFiles are on S3 -- only the cache, application logs, and write ahead logs are 
on the region-servers. Other than the disk space utilization, what factors 
could cause high I/O wait in HBase and is there anything we can do to minimize 
it?

Right now, the only thing that works is terminating and recreating the cluster 
(which we can do safely because it's S3 backed).

Thanks!
Srinidhi

Reply via email to