Hi Srinidhi
You have File mode bucket cache. What is the size of the
cache? You configure single file path for cache or 1+ paths? If former,
splitting the cache into multiple files (paths can be given as comma
separated in the config) will help?
Anoop
On Fri, Apr 5, 2019 at 2:58 AM Srinidhi Muppalla <[email protected]>
wrote:
> After some more digging, I discovered that during the time that the RS is
> stuck the kernel message buffer outputted only this message
>
> "[1031214.108110] XFS: java(6522) possible memory allocation deadlock size
> 32944 in kmem_alloc (mode:0x2400240)"
>
> From my reading online, the cause of this error appears to generally be
> excessive memory and file fragmentation. We haven't changed the mslab
> config and we are running HBase 1.3.0 so it should be running by default.
> The issue tends to arise consistently and regularly (every 10 or so days)
> and once one node is affected other nodes start to follow after a few
> hours. What could be causing this to happen and is there any way to prevent
> or minimize fragmentation?
>
> Best,
> Srinidhi
>
> On 3/29/19, 11:02 AM, "Srinidhi Muppalla" <[email protected]> wrote:
>
> Stack and Ram,
>
> Attached the thread dumps. 'Jstack normal' is the normal node. 'Jstack
> problematic' was taken when the node was stuck.
>
> We don't have full I/O stats for the problematic node. Unfortunately,
> it was impacting production so we had to recreate the cluster as soon as
> possible and couldn't get full data. I attached the dashboards with the
> wait I/O and other CPU stats. Thanks for helping look into the issue!
>
> Best,
> Srinidhi
>
>
>
> On 3/28/19, 2:41 PM, "Stack" <[email protected]> wrote:
>
> Mind putting up a thread dump?
>
> How many spindles?
>
> If you compare the i/o stats between a good RS and a stuck one,
> how do they
> compare?
>
> Thanks,
> S
>
>
> On Wed, Mar 27, 2019 at 11:57 AM Srinidhi Muppalla <
> [email protected]>
> wrote:
>
> > Hello,
> >
> > We've noticed an issue in our HBase cluster where one of the
> > region-servers has a spike in I/O wait associated with a spike
> in Load for
> > that node. As a result, our request times to the cluster increase
> > dramatically. Initially, we suspected that we were experiencing
> > hotspotting, but even after temporarily blocking requests to the
> highest
> > volume regions on that region-servers the issue persisted.
> Moreover, when
> > looking at request counts to the regions on the region-server
> from the
> > HBase UI, they were not particularly high and our own
> application level
> > metrics on the requests we were making were not very high
> either. From
> > looking at a thread dump of the region-server, it appears that
> our get and
> > scan requests are getting stuck when trying to read from the
> blocks in our
> > bucket cache leaving the threads in a 'runnable' state. For
> context, we are
> > running HBase 1.30 on a cluster backed by S3 running on EMR and
> our bucket
> > cache is running in File mode. Our region-servers all have SSDs.
> We have a
> > combined cache with the L1 standard LRU cache and the L2 file
> mode bucket
> > cache. Our Bucket Cache utilization is less than 50% of the
> allocated space.
> >
> > We suspect that part of the issue is our disk space utilization
> on the
> > region-server as our max disk space utilization also increased
> as this
> > happened. What things can we do to minimize disk space
> utilization? The
> > actual HFiles are on S3 -- only the cache, application logs, and
> write
> > ahead logs are on the region-servers. Other than the disk space
> > utilization, what factors could cause high I/O wait in HBase and
> is there
> > anything we can do to minimize it?
> >
> > Right now, the only thing that works is terminating and
> recreating the
> > cluster (which we can do safely because it's S3 backed).
> >
> > Thanks!
> > Srinidhi
> >
>
>
>
>
>