Re: Region Server Hotspot/CPU Problem

2017-03-02 Thread Stack
Could try CMS GC; the thread local collection is less prominent when CMS is
in place it seems (Try thread dumping and comparing to thread dumps posted
to HBASE-17072 and related issues; the original poster did a nice job
describing the problem).

St.Ack

On Wed, Mar 1, 2017 at 2:49 PM, Saad Mufti  wrote:

> Someone in our team found this:
>
> http://community.cloudera.com/t5/Storage-Random-Access-HDFS/
> CPU-Usage-high-when-using-G1GC/td-p/48101
>
> Looks like we're bitten by this bug. Unfortunately this is only fixed in
> HBase 1.4.0 so we'll have to undertake a version upgrade which is not
> trivial.
>
> -
> Saad
>
>
> On Wed, Mar 1, 2017 at 9:38 AM, Sudhir Babu Pothineni <
> sbpothin...@gmail.com
> > wrote:
>
> > First obvious thing to check is "major compaction" happening at the same
> > time when it goes to 100% CPU?
> > See this helps:
> > https://community.hortonworks.com/articles/52616/hbase-
> > compaction-tuning-tips.html
> >
> >
> >
> > Sent from my iPhone
> >
> > > On Mar 1, 2017, at 6:06 AM, Saad Mufti  wrote:
> > >
> > > Hi,
> > >
> > > We are using HBase 1.0.0-cdh5.5.2 on AWS EC2 instances. The load on
> HBase
> > > is heavy and a mix of reads and writes. For a few months we have had a
> > > problem where occasionally (once a day or more) one of the region
> servers
> > > starts consuming close to 100% CPU. This causes all the client thread
> > pool
> > > to get filled up serving the slow region server, causing overall
> response
> > > times to slow to a crawl and many calls either start timing out right
> in
> > > the client, or at a higher level.
> > >
> > > We have done lots of analysis and looked at various metrics but could
> > never
> > > pin it down to any particular kind of traffic or specific "hot keys".
> > > Looking at region server logs has not resulted in any findings. The
> only
> > > sort of vague evidence we have is that from the reported metrics, reads
> > per
> > > second on the hot server looks more than the other but not in a steady
> > > state but in a spiky but steady fashion, but gets per second looks no
> > > different than any other server.
> > >
> > > Until now our hacky way that we discovered to get around this was to
> just
> > > restart the region server. This works because while some calls error
> out
> > > while the regions are in transition, this is a batch oriented system
> > with a
> > > retry strategy built in.
> > >
> > > But just yesterday we discovered something interesting, if we connect
> to
> > > the region server in VisualVM and press the "Perform GC" button, there
> > > seems to be a brief pause and then CPU settles down back to normal.
> This
> > is
> > > despite the fact that memory appears to be under no pressure and before
> > we
> > > do this, VisualVM indicates very low percentage of CPU time spent in
> GC,
> > so
> > > we're baffled, and hoping someone with deeper insight into the HBase
> code
> > > could explain this behavior.
> > >
> > > Our region server processes are configured with 32GB of RAM and the
> > > following GC related JVM settings :
> > >
> > > HBASE_REGIONSERVER_OPTS=-Xms34359738368 -Xmx34359738368 -XX:+UseG1GC
> > > -XX:MaxGCPauseMillis=100
> > > -XX:+ParallelRefProcEnabled -XX:-ResizePLAB -XX:ParallelGCThreads=14
> > > -XX:InitiatingHeapOccupancyPercent=70
> > >
> > > Any insight anyone can provide would be most appreciated.
> > >
> > > 
> > > Saad
> >
>


Re: Region Server Hotspot/CPU Problem

2017-03-01 Thread Sean Busbey
Several of those jiras are fixed in later versions of CDH. Since the
inclusion of jiras in packaging by particular vendors is a vendor
specific issue, please seek out help from the vendor (e.g. on the
community forum you've just mentioned).

On Wed, Mar 1, 2017 at 8:49 AM, Saad Mufti  wrote:
> Someone in our team found this:
>
> http://community.cloudera.com/t5/Storage-Random-Access-HDFS/CPU-Usage-high-when-using-G1GC/td-p/48101
>
> Looks like we're bitten by this bug. Unfortunately this is only fixed in
> HBase 1.4.0 so we'll have to undertake a version upgrade which is not
> trivial.
>
> -
> Saad
>
>
> On Wed, Mar 1, 2017 at 9:38 AM, Sudhir Babu Pothineni > wrote:
>
>> First obvious thing to check is "major compaction" happening at the same
>> time when it goes to 100% CPU?
>> See this helps:
>> https://community.hortonworks.com/articles/52616/hbase-
>> compaction-tuning-tips.html
>>
>>
>>
>> Sent from my iPhone
>>
>> > On Mar 1, 2017, at 6:06 AM, Saad Mufti  wrote:
>> >
>> > Hi,
>> >
>> > We are using HBase 1.0.0-cdh5.5.2 on AWS EC2 instances. The load on HBase
>> > is heavy and a mix of reads and writes. For a few months we have had a
>> > problem where occasionally (once a day or more) one of the region servers
>> > starts consuming close to 100% CPU. This causes all the client thread
>> pool
>> > to get filled up serving the slow region server, causing overall response
>> > times to slow to a crawl and many calls either start timing out right in
>> > the client, or at a higher level.
>> >
>> > We have done lots of analysis and looked at various metrics but could
>> never
>> > pin it down to any particular kind of traffic or specific "hot keys".
>> > Looking at region server logs has not resulted in any findings. The only
>> > sort of vague evidence we have is that from the reported metrics, reads
>> per
>> > second on the hot server looks more than the other but not in a steady
>> > state but in a spiky but steady fashion, but gets per second looks no
>> > different than any other server.
>> >
>> > Until now our hacky way that we discovered to get around this was to just
>> > restart the region server. This works because while some calls error out
>> > while the regions are in transition, this is a batch oriented system
>> with a
>> > retry strategy built in.
>> >
>> > But just yesterday we discovered something interesting, if we connect to
>> > the region server in VisualVM and press the "Perform GC" button, there
>> > seems to be a brief pause and then CPU settles down back to normal. This
>> is
>> > despite the fact that memory appears to be under no pressure and before
>> we
>> > do this, VisualVM indicates very low percentage of CPU time spent in GC,
>> so
>> > we're baffled, and hoping someone with deeper insight into the HBase code
>> > could explain this behavior.
>> >
>> > Our region server processes are configured with 32GB of RAM and the
>> > following GC related JVM settings :
>> >
>> > HBASE_REGIONSERVER_OPTS=-Xms34359738368 -Xmx34359738368 -XX:+UseG1GC
>> > -XX:MaxGCPauseMillis=100
>> > -XX:+ParallelRefProcEnabled -XX:-ResizePLAB -XX:ParallelGCThreads=14
>> > -XX:InitiatingHeapOccupancyPercent=70
>> >
>> > Any insight anyone can provide would be most appreciated.
>> >
>> > 
>> > Saad
>>


Re: Region Server Hotspot/CPU Problem

2017-03-01 Thread Saad Mufti
Someone in our team found this:

http://community.cloudera.com/t5/Storage-Random-Access-HDFS/CPU-Usage-high-when-using-G1GC/td-p/48101

Looks like we're bitten by this bug. Unfortunately this is only fixed in
HBase 1.4.0 so we'll have to undertake a version upgrade which is not
trivial.

-
Saad


On Wed, Mar 1, 2017 at 9:38 AM, Sudhir Babu Pothineni  wrote:

> First obvious thing to check is "major compaction" happening at the same
> time when it goes to 100% CPU?
> See this helps:
> https://community.hortonworks.com/articles/52616/hbase-
> compaction-tuning-tips.html
>
>
>
> Sent from my iPhone
>
> > On Mar 1, 2017, at 6:06 AM, Saad Mufti  wrote:
> >
> > Hi,
> >
> > We are using HBase 1.0.0-cdh5.5.2 on AWS EC2 instances. The load on HBase
> > is heavy and a mix of reads and writes. For a few months we have had a
> > problem where occasionally (once a day or more) one of the region servers
> > starts consuming close to 100% CPU. This causes all the client thread
> pool
> > to get filled up serving the slow region server, causing overall response
> > times to slow to a crawl and many calls either start timing out right in
> > the client, or at a higher level.
> >
> > We have done lots of analysis and looked at various metrics but could
> never
> > pin it down to any particular kind of traffic or specific "hot keys".
> > Looking at region server logs has not resulted in any findings. The only
> > sort of vague evidence we have is that from the reported metrics, reads
> per
> > second on the hot server looks more than the other but not in a steady
> > state but in a spiky but steady fashion, but gets per second looks no
> > different than any other server.
> >
> > Until now our hacky way that we discovered to get around this was to just
> > restart the region server. This works because while some calls error out
> > while the regions are in transition, this is a batch oriented system
> with a
> > retry strategy built in.
> >
> > But just yesterday we discovered something interesting, if we connect to
> > the region server in VisualVM and press the "Perform GC" button, there
> > seems to be a brief pause and then CPU settles down back to normal. This
> is
> > despite the fact that memory appears to be under no pressure and before
> we
> > do this, VisualVM indicates very low percentage of CPU time spent in GC,
> so
> > we're baffled, and hoping someone with deeper insight into the HBase code
> > could explain this behavior.
> >
> > Our region server processes are configured with 32GB of RAM and the
> > following GC related JVM settings :
> >
> > HBASE_REGIONSERVER_OPTS=-Xms34359738368 -Xmx34359738368 -XX:+UseG1GC
> > -XX:MaxGCPauseMillis=100
> > -XX:+ParallelRefProcEnabled -XX:-ResizePLAB -XX:ParallelGCThreads=14
> > -XX:InitiatingHeapOccupancyPercent=70
> >
> > Any insight anyone can provide would be most appreciated.
> >
> > 
> > Saad
>


Re: Region Server Hotspot/CPU Problem

2017-03-01 Thread Sudhir Babu Pothineni
First obvious thing to check is "major compaction" happening at the same time 
when it goes to 100% CPU?
See this helps:
https://community.hortonworks.com/articles/52616/hbase-compaction-tuning-tips.html



Sent from my iPhone

> On Mar 1, 2017, at 6:06 AM, Saad Mufti  wrote:
> 
> Hi,
> 
> We are using HBase 1.0.0-cdh5.5.2 on AWS EC2 instances. The load on HBase
> is heavy and a mix of reads and writes. For a few months we have had a
> problem where occasionally (once a day or more) one of the region servers
> starts consuming close to 100% CPU. This causes all the client thread pool
> to get filled up serving the slow region server, causing overall response
> times to slow to a crawl and many calls either start timing out right in
> the client, or at a higher level.
> 
> We have done lots of analysis and looked at various metrics but could never
> pin it down to any particular kind of traffic or specific "hot keys".
> Looking at region server logs has not resulted in any findings. The only
> sort of vague evidence we have is that from the reported metrics, reads per
> second on the hot server looks more than the other but not in a steady
> state but in a spiky but steady fashion, but gets per second looks no
> different than any other server.
> 
> Until now our hacky way that we discovered to get around this was to just
> restart the region server. This works because while some calls error out
> while the regions are in transition, this is a batch oriented system with a
> retry strategy built in.
> 
> But just yesterday we discovered something interesting, if we connect to
> the region server in VisualVM and press the "Perform GC" button, there
> seems to be a brief pause and then CPU settles down back to normal. This is
> despite the fact that memory appears to be under no pressure and before we
> do this, VisualVM indicates very low percentage of CPU time spent in GC, so
> we're baffled, and hoping someone with deeper insight into the HBase code
> could explain this behavior.
> 
> Our region server processes are configured with 32GB of RAM and the
> following GC related JVM settings :
> 
> HBASE_REGIONSERVER_OPTS=-Xms34359738368 -Xmx34359738368 -XX:+UseG1GC
> -XX:MaxGCPauseMillis=100
> -XX:+ParallelRefProcEnabled -XX:-ResizePLAB -XX:ParallelGCThreads=14
> -XX:InitiatingHeapOccupancyPercent=70
> 
> Any insight anyone can provide would be most appreciated.
> 
> 
> Saad


Region Server Hotspot/CPU Problem

2017-03-01 Thread Saad Mufti
Hi,

We are using HBase 1.0.0-cdh5.5.2 on AWS EC2 instances. The load on HBase
is heavy and a mix of reads and writes. For a few months we have had a
problem where occasionally (once a day or more) one of the region servers
starts consuming close to 100% CPU. This causes all the client thread pool
to get filled up serving the slow region server, causing overall response
times to slow to a crawl and many calls either start timing out right in
the client, or at a higher level.

We have done lots of analysis and looked at various metrics but could never
pin it down to any particular kind of traffic or specific "hot keys".
Looking at region server logs has not resulted in any findings. The only
sort of vague evidence we have is that from the reported metrics, reads per
second on the hot server looks more than the other but not in a steady
state but in a spiky but steady fashion, but gets per second looks no
different than any other server.

Until now our hacky way that we discovered to get around this was to just
restart the region server. This works because while some calls error out
while the regions are in transition, this is a batch oriented system with a
retry strategy built in.

But just yesterday we discovered something interesting, if we connect to
the region server in VisualVM and press the "Perform GC" button, there
seems to be a brief pause and then CPU settles down back to normal. This is
despite the fact that memory appears to be under no pressure and before we
do this, VisualVM indicates very low percentage of CPU time spent in GC, so
we're baffled, and hoping someone with deeper insight into the HBase code
could explain this behavior.

Our region server processes are configured with 32GB of RAM and the
following GC related JVM settings :

HBASE_REGIONSERVER_OPTS=-Xms34359738368 -Xmx34359738368 -XX:+UseG1GC
-XX:MaxGCPauseMillis=100
-XX:+ParallelRefProcEnabled -XX:-ResizePLAB -XX:ParallelGCThreads=14
-XX:InitiatingHeapOccupancyPercent=70

Any insight anyone can provide would be most appreciated.


Saad