Someone in our team found this: http://community.cloudera.com/t5/Storage-Random-Access-HDFS/CPU-Usage-high-when-using-G1GC/td-p/48101
Looks like we're bitten by this bug. Unfortunately this is only fixed in HBase 1.4.0 so we'll have to undertake a version upgrade which is not trivial. ----- Saad On Wed, Mar 1, 2017 at 9:38 AM, Sudhir Babu Pothineni <sbpothin...@gmail.com > wrote: > First obvious thing to check is "major compaction" happening at the same > time when it goes to 100% CPU? > See this helps: > https://community.hortonworks.com/articles/52616/hbase- > compaction-tuning-tips.html > > > > Sent from my iPhone > > > On Mar 1, 2017, at 6:06 AM, Saad Mufti <saad.mu...@teamaol.com> wrote: > > > > Hi, > > > > We are using HBase 1.0.0-cdh5.5.2 on AWS EC2 instances. The load on HBase > > is heavy and a mix of reads and writes. For a few months we have had a > > problem where occasionally (once a day or more) one of the region servers > > starts consuming close to 100% CPU. This causes all the client thread > pool > > to get filled up serving the slow region server, causing overall response > > times to slow to a crawl and many calls either start timing out right in > > the client, or at a higher level. > > > > We have done lots of analysis and looked at various metrics but could > never > > pin it down to any particular kind of traffic or specific "hot keys". > > Looking at region server logs has not resulted in any findings. The only > > sort of vague evidence we have is that from the reported metrics, reads > per > > second on the hot server looks more than the other but not in a steady > > state but in a spiky but steady fashion, but gets per second looks no > > different than any other server. > > > > Until now our hacky way that we discovered to get around this was to just > > restart the region server. This works because while some calls error out > > while the regions are in transition, this is a batch oriented system > with a > > retry strategy built in. > > > > But just yesterday we discovered something interesting, if we connect to > > the region server in VisualVM and press the "Perform GC" button, there > > seems to be a brief pause and then CPU settles down back to normal. This > is > > despite the fact that memory appears to be under no pressure and before > we > > do this, VisualVM indicates very low percentage of CPU time spent in GC, > so > > we're baffled, and hoping someone with deeper insight into the HBase code > > could explain this behavior. > > > > Our region server processes are configured with 32GB of RAM and the > > following GC related JVM settings : > > > > HBASE_REGIONSERVER_OPTS=-Xms34359738368 -Xmx34359738368 -XX:+UseG1GC > > -XX:MaxGCPauseMillis=100 > > -XX:+ParallelRefProcEnabled -XX:-ResizePLAB -XX:ParallelGCThreads=14 > > -XX:InitiatingHeapOccupancyPercent=70 > > > > Any insight anyone can provide would be most appreciated. > > > > ---- > > Saad >