Hi,

We are using HBase 1.0 on CDH 5.5.2 . We have taken great care to avoid
hotspotting due to inadvertent data patterns by prepending an MD5 based 4
digit hash prefix to all our data keys. This works fine most of the times,
but more and more (as much as once or twice a day) recently we have
occasions where one region server suddenly becomes "hot" (CPU above or
around 95% in various monitoring tools). When it happens it lasts for
hours, occasionally the hotspot might jump to another region server as the
master decide the region is unresponsive and gives its region to another
server.

For the longest time, we thought this must be some single rogue key in our
input data that is being hammered. All attempts to track this down have
failed though, and the following behavior argues against this being
application based:

1. plotted Get and Put rate by region on the "hot" region server in
Cloudera Manager Charts, shows no single region is an outlier.

2. cleanly restarting just the region server process causes its regions to
randomly migrate to other region servers, then it gets new ones from the
HBase master, basically a sort of shuffling, then the hotspot goes away. If
it were application based, you'd expect the hotspot to just jump to another
region server.

3. have pored through region server logs and can't see anything out of the
ordinary happening

The only other pertinent thing to mention might be that we have a special
process of our own running outside the cluster that does cluster wide major
compaction in a rolling fashion, where each batch consists of one region
from each region server, and it waits before one batch is completely done
before starting another. We have seen no real impact on the hotspot from
shutting this down and in normal times it doesn't impact our read or write
performance much.

We are at our wit's end, anyone have experience with a scenario like this?
Any help/guidance would be most appreciated.

-----
Saad

Reply via email to