Re: Region Server Hotspot/CPU Problem
Could try CMS GC; the thread local collection is less prominent when CMS is in place it seems (Try thread dumping and comparing to thread dumps posted to HBASE-17072 and related issues; the original poster did a nice job describing the problem). St.Ack On Wed, Mar 1, 2017 at 2:49 PM, Saad Mufti wrote: > Someone in our team found this: > > http://community.cloudera.com/t5/Storage-Random-Access-HDFS/ > CPU-Usage-high-when-using-G1GC/td-p/48101 > > Looks like we're bitten by this bug. Unfortunately this is only fixed in > HBase 1.4.0 so we'll have to undertake a version upgrade which is not > trivial. > > - > Saad > > > On Wed, Mar 1, 2017 at 9:38 AM, Sudhir Babu Pothineni < > sbpothin...@gmail.com > > wrote: > > > First obvious thing to check is "major compaction" happening at the same > > time when it goes to 100% CPU? > > See this helps: > > https://community.hortonworks.com/articles/52616/hbase- > > compaction-tuning-tips.html > > > > > > > > Sent from my iPhone > > > > > On Mar 1, 2017, at 6:06 AM, Saad Mufti wrote: > > > > > > Hi, > > > > > > We are using HBase 1.0.0-cdh5.5.2 on AWS EC2 instances. The load on > HBase > > > is heavy and a mix of reads and writes. For a few months we have had a > > > problem where occasionally (once a day or more) one of the region > servers > > > starts consuming close to 100% CPU. This causes all the client thread > > pool > > > to get filled up serving the slow region server, causing overall > response > > > times to slow to a crawl and many calls either start timing out right > in > > > the client, or at a higher level. > > > > > > We have done lots of analysis and looked at various metrics but could > > never > > > pin it down to any particular kind of traffic or specific "hot keys". > > > Looking at region server logs has not resulted in any findings. The > only > > > sort of vague evidence we have is that from the reported metrics, reads > > per > > > second on the hot server looks more than the other but not in a steady > > > state but in a spiky but steady fashion, but gets per second looks no > > > different than any other server. > > > > > > Until now our hacky way that we discovered to get around this was to > just > > > restart the region server. This works because while some calls error > out > > > while the regions are in transition, this is a batch oriented system > > with a > > > retry strategy built in. > > > > > > But just yesterday we discovered something interesting, if we connect > to > > > the region server in VisualVM and press the "Perform GC" button, there > > > seems to be a brief pause and then CPU settles down back to normal. > This > > is > > > despite the fact that memory appears to be under no pressure and before > > we > > > do this, VisualVM indicates very low percentage of CPU time spent in > GC, > > so > > > we're baffled, and hoping someone with deeper insight into the HBase > code > > > could explain this behavior. > > > > > > Our region server processes are configured with 32GB of RAM and the > > > following GC related JVM settings : > > > > > > HBASE_REGIONSERVER_OPTS=-Xms34359738368 -Xmx34359738368 -XX:+UseG1GC > > > -XX:MaxGCPauseMillis=100 > > > -XX:+ParallelRefProcEnabled -XX:-ResizePLAB -XX:ParallelGCThreads=14 > > > -XX:InitiatingHeapOccupancyPercent=70 > > > > > > Any insight anyone can provide would be most appreciated. > > > > > > > > > Saad > > >
Re: Region Server Hotspot/CPU Problem
Several of those jiras are fixed in later versions of CDH. Since the inclusion of jiras in packaging by particular vendors is a vendor specific issue, please seek out help from the vendor (e.g. on the community forum you've just mentioned). On Wed, Mar 1, 2017 at 8:49 AM, Saad Mufti wrote: > Someone in our team found this: > > http://community.cloudera.com/t5/Storage-Random-Access-HDFS/CPU-Usage-high-when-using-G1GC/td-p/48101 > > Looks like we're bitten by this bug. Unfortunately this is only fixed in > HBase 1.4.0 so we'll have to undertake a version upgrade which is not > trivial. > > - > Saad > > > On Wed, Mar 1, 2017 at 9:38 AM, Sudhir Babu Pothineni > wrote: > >> First obvious thing to check is "major compaction" happening at the same >> time when it goes to 100% CPU? >> See this helps: >> https://community.hortonworks.com/articles/52616/hbase- >> compaction-tuning-tips.html >> >> >> >> Sent from my iPhone >> >> > On Mar 1, 2017, at 6:06 AM, Saad Mufti wrote: >> > >> > Hi, >> > >> > We are using HBase 1.0.0-cdh5.5.2 on AWS EC2 instances. The load on HBase >> > is heavy and a mix of reads and writes. For a few months we have had a >> > problem where occasionally (once a day or more) one of the region servers >> > starts consuming close to 100% CPU. This causes all the client thread >> pool >> > to get filled up serving the slow region server, causing overall response >> > times to slow to a crawl and many calls either start timing out right in >> > the client, or at a higher level. >> > >> > We have done lots of analysis and looked at various metrics but could >> never >> > pin it down to any particular kind of traffic or specific "hot keys". >> > Looking at region server logs has not resulted in any findings. The only >> > sort of vague evidence we have is that from the reported metrics, reads >> per >> > second on the hot server looks more than the other but not in a steady >> > state but in a spiky but steady fashion, but gets per second looks no >> > different than any other server. >> > >> > Until now our hacky way that we discovered to get around this was to just >> > restart the region server. This works because while some calls error out >> > while the regions are in transition, this is a batch oriented system >> with a >> > retry strategy built in. >> > >> > But just yesterday we discovered something interesting, if we connect to >> > the region server in VisualVM and press the "Perform GC" button, there >> > seems to be a brief pause and then CPU settles down back to normal. This >> is >> > despite the fact that memory appears to be under no pressure and before >> we >> > do this, VisualVM indicates very low percentage of CPU time spent in GC, >> so >> > we're baffled, and hoping someone with deeper insight into the HBase code >> > could explain this behavior. >> > >> > Our region server processes are configured with 32GB of RAM and the >> > following GC related JVM settings : >> > >> > HBASE_REGIONSERVER_OPTS=-Xms34359738368 -Xmx34359738368 -XX:+UseG1GC >> > -XX:MaxGCPauseMillis=100 >> > -XX:+ParallelRefProcEnabled -XX:-ResizePLAB -XX:ParallelGCThreads=14 >> > -XX:InitiatingHeapOccupancyPercent=70 >> > >> > Any insight anyone can provide would be most appreciated. >> > >> > >> > Saad >>
Re: Region Server Hotspot/CPU Problem
Someone in our team found this: http://community.cloudera.com/t5/Storage-Random-Access-HDFS/CPU-Usage-high-when-using-G1GC/td-p/48101 Looks like we're bitten by this bug. Unfortunately this is only fixed in HBase 1.4.0 so we'll have to undertake a version upgrade which is not trivial. - Saad On Wed, Mar 1, 2017 at 9:38 AM, Sudhir Babu Pothineni wrote: > First obvious thing to check is "major compaction" happening at the same > time when it goes to 100% CPU? > See this helps: > https://community.hortonworks.com/articles/52616/hbase- > compaction-tuning-tips.html > > > > Sent from my iPhone > > > On Mar 1, 2017, at 6:06 AM, Saad Mufti wrote: > > > > Hi, > > > > We are using HBase 1.0.0-cdh5.5.2 on AWS EC2 instances. The load on HBase > > is heavy and a mix of reads and writes. For a few months we have had a > > problem where occasionally (once a day or more) one of the region servers > > starts consuming close to 100% CPU. This causes all the client thread > pool > > to get filled up serving the slow region server, causing overall response > > times to slow to a crawl and many calls either start timing out right in > > the client, or at a higher level. > > > > We have done lots of analysis and looked at various metrics but could > never > > pin it down to any particular kind of traffic or specific "hot keys". > > Looking at region server logs has not resulted in any findings. The only > > sort of vague evidence we have is that from the reported metrics, reads > per > > second on the hot server looks more than the other but not in a steady > > state but in a spiky but steady fashion, but gets per second looks no > > different than any other server. > > > > Until now our hacky way that we discovered to get around this was to just > > restart the region server. This works because while some calls error out > > while the regions are in transition, this is a batch oriented system > with a > > retry strategy built in. > > > > But just yesterday we discovered something interesting, if we connect to > > the region server in VisualVM and press the "Perform GC" button, there > > seems to be a brief pause and then CPU settles down back to normal. This > is > > despite the fact that memory appears to be under no pressure and before > we > > do this, VisualVM indicates very low percentage of CPU time spent in GC, > so > > we're baffled, and hoping someone with deeper insight into the HBase code > > could explain this behavior. > > > > Our region server processes are configured with 32GB of RAM and the > > following GC related JVM settings : > > > > HBASE_REGIONSERVER_OPTS=-Xms34359738368 -Xmx34359738368 -XX:+UseG1GC > > -XX:MaxGCPauseMillis=100 > > -XX:+ParallelRefProcEnabled -XX:-ResizePLAB -XX:ParallelGCThreads=14 > > -XX:InitiatingHeapOccupancyPercent=70 > > > > Any insight anyone can provide would be most appreciated. > > > > > > Saad >
Re: Region Server Hotspot/CPU Problem
First obvious thing to check is "major compaction" happening at the same time when it goes to 100% CPU? See this helps: https://community.hortonworks.com/articles/52616/hbase-compaction-tuning-tips.html Sent from my iPhone > On Mar 1, 2017, at 6:06 AM, Saad Mufti wrote: > > Hi, > > We are using HBase 1.0.0-cdh5.5.2 on AWS EC2 instances. The load on HBase > is heavy and a mix of reads and writes. For a few months we have had a > problem where occasionally (once a day or more) one of the region servers > starts consuming close to 100% CPU. This causes all the client thread pool > to get filled up serving the slow region server, causing overall response > times to slow to a crawl and many calls either start timing out right in > the client, or at a higher level. > > We have done lots of analysis and looked at various metrics but could never > pin it down to any particular kind of traffic or specific "hot keys". > Looking at region server logs has not resulted in any findings. The only > sort of vague evidence we have is that from the reported metrics, reads per > second on the hot server looks more than the other but not in a steady > state but in a spiky but steady fashion, but gets per second looks no > different than any other server. > > Until now our hacky way that we discovered to get around this was to just > restart the region server. This works because while some calls error out > while the regions are in transition, this is a batch oriented system with a > retry strategy built in. > > But just yesterday we discovered something interesting, if we connect to > the region server in VisualVM and press the "Perform GC" button, there > seems to be a brief pause and then CPU settles down back to normal. This is > despite the fact that memory appears to be under no pressure and before we > do this, VisualVM indicates very low percentage of CPU time spent in GC, so > we're baffled, and hoping someone with deeper insight into the HBase code > could explain this behavior. > > Our region server processes are configured with 32GB of RAM and the > following GC related JVM settings : > > HBASE_REGIONSERVER_OPTS=-Xms34359738368 -Xmx34359738368 -XX:+UseG1GC > -XX:MaxGCPauseMillis=100 > -XX:+ParallelRefProcEnabled -XX:-ResizePLAB -XX:ParallelGCThreads=14 > -XX:InitiatingHeapOccupancyPercent=70 > > Any insight anyone can provide would be most appreciated. > > > Saad
Region Server Hotspot/CPU Problem
Hi, We are using HBase 1.0.0-cdh5.5.2 on AWS EC2 instances. The load on HBase is heavy and a mix of reads and writes. For a few months we have had a problem where occasionally (once a day or more) one of the region servers starts consuming close to 100% CPU. This causes all the client thread pool to get filled up serving the slow region server, causing overall response times to slow to a crawl and many calls either start timing out right in the client, or at a higher level. We have done lots of analysis and looked at various metrics but could never pin it down to any particular kind of traffic or specific "hot keys". Looking at region server logs has not resulted in any findings. The only sort of vague evidence we have is that from the reported metrics, reads per second on the hot server looks more than the other but not in a steady state but in a spiky but steady fashion, but gets per second looks no different than any other server. Until now our hacky way that we discovered to get around this was to just restart the region server. This works because while some calls error out while the regions are in transition, this is a batch oriented system with a retry strategy built in. But just yesterday we discovered something interesting, if we connect to the region server in VisualVM and press the "Perform GC" button, there seems to be a brief pause and then CPU settles down back to normal. This is despite the fact that memory appears to be under no pressure and before we do this, VisualVM indicates very low percentage of CPU time spent in GC, so we're baffled, and hoping someone with deeper insight into the HBase code could explain this behavior. Our region server processes are configured with 32GB of RAM and the following GC related JVM settings : HBASE_REGIONSERVER_OPTS=-Xms34359738368 -Xmx34359738368 -XX:+UseG1GC -XX:MaxGCPauseMillis=100 -XX:+ParallelRefProcEnabled -XX:-ResizePLAB -XX:ParallelGCThreads=14 -XX:InitiatingHeapOccupancyPercent=70 Any insight anyone can provide would be most appreciated. Saad