Someone in our team found this:

http://community.cloudera.com/t5/Storage-Random-Access-HDFS/CPU-Usage-high-when-using-G1GC/td-p/48101

Looks like we're bitten by this bug. Unfortunately this is only fixed in
HBase 1.4.0 so we'll have to undertake a version upgrade which is not
trivial.

-----
Saad


On Wed, Mar 1, 2017 at 9:38 AM, Sudhir Babu Pothineni <sbpothin...@gmail.com
> wrote:

> First obvious thing to check is "major compaction" happening at the same
> time when it goes to 100% CPU?
> See this helps:
> https://community.hortonworks.com/articles/52616/hbase-
> compaction-tuning-tips.html
>
>
>
> Sent from my iPhone
>
> > On Mar 1, 2017, at 6:06 AM, Saad Mufti <saad.mu...@teamaol.com> wrote:
> >
> > Hi,
> >
> > We are using HBase 1.0.0-cdh5.5.2 on AWS EC2 instances. The load on HBase
> > is heavy and a mix of reads and writes. For a few months we have had a
> > problem where occasionally (once a day or more) one of the region servers
> > starts consuming close to 100% CPU. This causes all the client thread
> pool
> > to get filled up serving the slow region server, causing overall response
> > times to slow to a crawl and many calls either start timing out right in
> > the client, or at a higher level.
> >
> > We have done lots of analysis and looked at various metrics but could
> never
> > pin it down to any particular kind of traffic or specific "hot keys".
> > Looking at region server logs has not resulted in any findings. The only
> > sort of vague evidence we have is that from the reported metrics, reads
> per
> > second on the hot server looks more than the other but not in a steady
> > state but in a spiky but steady fashion, but gets per second looks no
> > different than any other server.
> >
> > Until now our hacky way that we discovered to get around this was to just
> > restart the region server. This works because while some calls error out
> > while the regions are in transition, this is a batch oriented system
> with a
> > retry strategy built in.
> >
> > But just yesterday we discovered something interesting, if we connect to
> > the region server in VisualVM and press the "Perform GC" button, there
> > seems to be a brief pause and then CPU settles down back to normal. This
> is
> > despite the fact that memory appears to be under no pressure and before
> we
> > do this, VisualVM indicates very low percentage of CPU time spent in GC,
> so
> > we're baffled, and hoping someone with deeper insight into the HBase code
> > could explain this behavior.
> >
> > Our region server processes are configured with 32GB of RAM and the
> > following GC related JVM settings :
> >
> > HBASE_REGIONSERVER_OPTS=-Xms34359738368 -Xmx34359738368 -XX:+UseG1GC
> > -XX:MaxGCPauseMillis=100
> > -XX:+ParallelRefProcEnabled -XX:-ResizePLAB -XX:ParallelGCThreads=14
> > -XX:InitiatingHeapOccupancyPercent=70
> >
> > Any insight anyone can provide would be most appreciated.
> >
> > ----
> > Saad
>

Reply via email to