Hi Srinidhi

Thanks for the details. As Stack said, can you get a thread dump, i/o stats
while this issue happens. You can compare it with the case when the RS is
in good shape.

If at all the SSD writes and reads is the reason for the Bucket cache read
to perform slower then it might be better to have seperate SSD. But lets
first check the dumps to know if that is the real reason.

Regards
Ram

On Fri, Mar 29, 2019 at 3:11 AM Stack <[email protected]> wrote:

> Mind putting up a thread dump?
>
> How many spindles?
>
> If you compare the i/o stats between a good RS and a stuck one, how do they
> compare?
>
> Thanks,
> S
>
>
> On Wed, Mar 27, 2019 at 11:57 AM Srinidhi Muppalla <[email protected]>
> wrote:
>
> > Hello,
> >
> > We've noticed an issue in our HBase cluster where one of the
> > region-servers has a spike in I/O wait associated with a spike in Load
> for
> > that node. As a result, our request times to the cluster increase
> > dramatically. Initially, we suspected that we were experiencing
> > hotspotting, but even after temporarily blocking requests to the highest
> > volume regions on that region-servers the issue persisted. Moreover, when
> > looking at request counts to the regions on the region-server from the
> > HBase UI, they were not particularly high and our own application level
> > metrics on the requests we were making were not very high either. From
> > looking at a thread dump of the region-server, it appears that our get
> and
> > scan requests are getting stuck when trying to read from the blocks in
> our
> > bucket cache leaving the threads in a 'runnable' state. For context, we
> are
> > running HBase 1.30 on a cluster backed by S3 running on EMR and our
> bucket
> > cache is running in File mode. Our region-servers all have SSDs. We have
> a
> > combined cache with the L1 standard LRU cache and the L2 file mode bucket
> > cache. Our Bucket Cache utilization is less than 50% of the allocated
> space.
> >
> > We suspect that part of the issue is our disk space utilization on the
> > region-server as our max disk space utilization also increased as this
> > happened. What things can we do to minimize disk space utilization? The
> > actual HFiles are on S3 -- only the cache, application logs, and write
> > ahead logs are on the region-servers. Other than the disk space
> > utilization, what factors could cause high I/O wait in HBase and is there
> > anything we can do to minimize it?
> >
> > Right now, the only thing that works is terminating and recreating the
> > cluster (which we can do safely because it's S3 backed).
> >
> > Thanks!
> > Srinidhi
> >
>

Reply via email to