Re: High Cpu sys usage
Both of these are anit-patterns. The soft commit interval of 1 second is usually far too aggressive. And committing after every add is also something to avoid. Your original problem statement is high CPU usage. To see if your committing is the culprit, I'd stop committing at all after adding and make the soft commit interval, say, 60 seconds. And keep the hard commit interval whatever it is not but make sure openSearcher is set to false. That should pinpoint whether the CPU usage is just because of your committing. From there you can figure out the right balance... If that's _not_ the source of your CPU usage, then at least you'll have eliminated it as a potential problem. Best, Erick On Wed, Mar 30, 2016 at 12:37 AM, YouPeng Yang wrote: > Hi > Thanks you Erik. >The main collection that stores our trade data is set to the softcomit > when we import data using DIH. As you guess that the softcommit intervals > is " 1000 " and we have autowarm counts to 0.However > there is some collections that store our meta info in which we commit after > each add.and these metadata collections just hold a few docs. > > > Best Regards > > > 2016-03-30 12:25 GMT+08:00 Erick Erickson : > >> Do not, repeat NOT try to "cure" the "Overlapping onDeckSearchers" >> by bumping this limit! What that means is that your commits >> (either hard commit with openSearcher=true or softCommit) are >> happening far too frequently and your Solr instance is trying to do >> all sorts of work that is immediately thrown away and chewing up >> lots of CPU. Perhaps this will help: >> >> >> https://lucidworks.com/blog/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ >> >> I'd guess that you're >> >> > commiting every second, or perhaps your indexing client is committing >> after each add. If the latter, do not do this and rely on the >> autocommit settings >> and if the formaer make those intervals as long as you can stand. >> >> > you may have your autowarm counts in your solrconfig.xml file set at >> very high numbers (let's see the filterCache settings, the queryResultCache >> settings etc.). >> >> I'd _strongly_ recommend that you put the on deck searchers back to >> 2 and figure out why you have so many overlapping searchers. >> >> Best, >> Erick >> >> On Tue, Mar 29, 2016 at 8:57 PM, YouPeng Yang >> wrote: >> > Hi Toke >> > The number of collection is just 10.One of collection has 43 >> shards,each >> > shard has two replicas.We continue importing data from oracle all the >> time >> > while our systems provide searching service. >> >There are "Overlapping onDeckSearchers" in my solr.logs. What is the >> > meaning about the "Overlapping onDeckSearchers" ,We set the the < >> > maxWarmingSearchers>20 and true> > useColdSearcher>.Is it right ? >> > >> > >> > >> > Best Regard. >> > >> > >> > 2016-03-29 22:31 GMT+08:00 Toke Eskildsen : >> > >> >> On Tue, 2016-03-29 at 20:12 +0800, YouPeng Yang wrote: >> >> > Our system still goes down as times going.We found lots of threads >> are >> >> > WAITING.Here is the threaddump that I copy from the web page.And 4 >> >> pictures >> >> > for it. >> >> > Is there any relationship with my problem? >> >> >> >> That is a lot of commitScheduler-threads. Do you have hundreds of >> >> collections in your cloud? >> >> >> >> >> >> Try grepping for "Overlapping onDeckSearchers" in your solr.logs to see >> >> if you got caught in a downwards spiral of concurrent commits. >> >> >> >> - Toke Eskildsen, State and University Library, Denmark >> >> >> >> >> >> >>
Re: High Cpu sys usage
Hi Thanks you Erik. The main collection that stores our trade data is set to the softcomit when we import data using DIH. As you guess that the softcommit intervals is " 1000 " and we have autowarm counts to 0.However there is some collections that store our meta info in which we commit after each add.and these metadata collections just hold a few docs. Best Regards 2016-03-30 12:25 GMT+08:00 Erick Erickson : > Do not, repeat NOT try to "cure" the "Overlapping onDeckSearchers" > by bumping this limit! What that means is that your commits > (either hard commit with openSearcher=true or softCommit) are > happening far too frequently and your Solr instance is trying to do > all sorts of work that is immediately thrown away and chewing up > lots of CPU. Perhaps this will help: > > > https://lucidworks.com/blog/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ > > I'd guess that you're > > > commiting every second, or perhaps your indexing client is committing > after each add. If the latter, do not do this and rely on the > autocommit settings > and if the formaer make those intervals as long as you can stand. > > > you may have your autowarm counts in your solrconfig.xml file set at > very high numbers (let's see the filterCache settings, the queryResultCache > settings etc.). > > I'd _strongly_ recommend that you put the on deck searchers back to > 2 and figure out why you have so many overlapping searchers. > > Best, > Erick > > On Tue, Mar 29, 2016 at 8:57 PM, YouPeng Yang > wrote: > > Hi Toke > > The number of collection is just 10.One of collection has 43 > shards,each > > shard has two replicas.We continue importing data from oracle all the > time > > while our systems provide searching service. > >There are "Overlapping onDeckSearchers" in my solr.logs. What is the > > meaning about the "Overlapping onDeckSearchers" ,We set the the < > > maxWarmingSearchers>20 and true > useColdSearcher>.Is it right ? > > > > > > > > Best Regard. > > > > > > 2016-03-29 22:31 GMT+08:00 Toke Eskildsen : > > > >> On Tue, 2016-03-29 at 20:12 +0800, YouPeng Yang wrote: > >> > Our system still goes down as times going.We found lots of threads > are > >> > WAITING.Here is the threaddump that I copy from the web page.And 4 > >> pictures > >> > for it. > >> > Is there any relationship with my problem? > >> > >> That is a lot of commitScheduler-threads. Do you have hundreds of > >> collections in your cloud? > >> > >> > >> Try grepping for "Overlapping onDeckSearchers" in your solr.logs to see > >> if you got caught in a downwards spiral of concurrent commits. > >> > >> - Toke Eskildsen, State and University Library, Denmark > >> > >> > >> >
Re: High Cpu sys usage
Do not, repeat NOT try to "cure" the "Overlapping onDeckSearchers" by bumping this limit! What that means is that your commits (either hard commit with openSearcher=true or softCommit) are happening far too frequently and your Solr instance is trying to do all sorts of work that is immediately thrown away and chewing up lots of CPU. Perhaps this will help: https://lucidworks.com/blog/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ I'd guess that you're > commiting every second, or perhaps your indexing client is committing after each add. If the latter, do not do this and rely on the autocommit settings and if the formaer make those intervals as long as you can stand. > you may have your autowarm counts in your solrconfig.xml file set at very high numbers (let's see the filterCache settings, the queryResultCache settings etc.). I'd _strongly_ recommend that you put the on deck searchers back to 2 and figure out why you have so many overlapping searchers. Best, Erick On Tue, Mar 29, 2016 at 8:57 PM, YouPeng Yang wrote: > Hi Toke > The number of collection is just 10.One of collection has 43 shards,each > shard has two replicas.We continue importing data from oracle all the time > while our systems provide searching service. >There are "Overlapping onDeckSearchers" in my solr.logs. What is the > meaning about the "Overlapping onDeckSearchers" ,We set the the < > maxWarmingSearchers>20 and true useColdSearcher>.Is it right ? > > > > Best Regard. > > > 2016-03-29 22:31 GMT+08:00 Toke Eskildsen : > >> On Tue, 2016-03-29 at 20:12 +0800, YouPeng Yang wrote: >> > Our system still goes down as times going.We found lots of threads are >> > WAITING.Here is the threaddump that I copy from the web page.And 4 >> pictures >> > for it. >> > Is there any relationship with my problem? >> >> That is a lot of commitScheduler-threads. Do you have hundreds of >> collections in your cloud? >> >> >> Try grepping for "Overlapping onDeckSearchers" in your solr.logs to see >> if you got caught in a downwards spiral of concurrent commits. >> >> - Toke Eskildsen, State and University Library, Denmark >> >> >>
Re: High Cpu sys usage
Hi Toke The number of collection is just 10.One of collection has 43 shards,each shard has two replicas.We continue importing data from oracle all the time while our systems provide searching service. There are "Overlapping onDeckSearchers" in my solr.logs. What is the meaning about the "Overlapping onDeckSearchers" ,We set the the < maxWarmingSearchers>20 and true.Is it right ? Best Regard. 2016-03-29 22:31 GMT+08:00 Toke Eskildsen : > On Tue, 2016-03-29 at 20:12 +0800, YouPeng Yang wrote: > > Our system still goes down as times going.We found lots of threads are > > WAITING.Here is the threaddump that I copy from the web page.And 4 > pictures > > for it. > > Is there any relationship with my problem? > > That is a lot of commitScheduler-threads. Do you have hundreds of > collections in your cloud? > > > Try grepping for "Overlapping onDeckSearchers" in your solr.logs to see > if you got caught in a downwards spiral of concurrent commits. > > - Toke Eskildsen, State and University Library, Denmark > > >
Re: High Cpu sys usage
On Tue, 2016-03-29 at 20:12 +0800, YouPeng Yang wrote: > Our system still goes down as times going.We found lots of threads are > WAITING.Here is the threaddump that I copy from the web page.And 4 pictures > for it. > Is there any relationship with my problem? That is a lot of commitScheduler-threads. Do you have hundreds of collections in your cloud? Try grepping for "Overlapping onDeckSearchers" in your solr.logs to see if you got caught in a downwards spiral of concurrent commits. - Toke Eskildsen, State and University Library, Denmark
Re: High Cpu sys usage
Hi Our system still goes down as times going.We found lots of threads are WAITING.Here is the threaddump that I copy from the web page.And 4 pictures for it. Is there any relationship with my problem? https://www.dropbox.com/s/h3wyez091oouwck/threaddump?dl=0 https://www.dropbox.com/s/p3ctuxb3t1jgo2e/threaddump1.jpg?dl=0 https://www.dropbox.com/s/w0uy15h6z984ntw/threaddump2.jpg?dl=0 https://www.dropbox.com/s/0frskxdllxlz9ha/threaddump3.jpg?dl=0 https://www.dropbox.com/s/46ptnly1ngi9nb6/threaddump4.jpg?dl=0 Best Regards 2016-03-18 14:35 GMT+08:00 YouPeng Yang : > Hi > To Patrick: Never mind .Thank you for your suggestion all the same. > To Otis. We do not use SPM. We monintor the JVM just use jstat becasue > my system went well before ,so we do not need other tools. > But SPM is really awesome . > > Still looking for help. > > Best Regards > > 2016-03-18 6:01 GMT+08:00 Patrick Plaatje : > >> Yeah, I did’t pay attention to the cached memory at all, my bad! >> >> I remember running into a similar situation a couple of years ago, one of >> the things to investigate our memory profile was to produce a full heap >> dump and manually analyse that using a tool like MAT. >> >> Cheers, >> -patrick >> >> >> >> >> On 17/03/2016, 21:58, "Otis Gospodnetić" >> wrote: >> >> >Hi, >> > >> >On Wed, Mar 16, 2016 at 10:59 AM, Patrick Plaatje >> >wrote: >> > >> >> Hi, >> >> >> >> From the sar output you supplied, it looks like you might have a memory >> >> issue on your hosts. The memory usage just before your crash seems to >> be >> >> *very* close to 100%. Even the slightest increase (Solr itself, or >> possibly >> >> by a system service) could caused the system crash. What are the >> >> specifications of your hosts and how much memory are you allocating? >> > >> > >> >That's normal actually - http://www.linuxatemyram.com/ >> > >> >You *want* Linux to be using all your memory - you paid for it :) >> > >> >Otis >> >-- >> >Monitoring - Log Management - Alerting - Anomaly Detection >> >Solr & Elasticsearch Consulting Support Training - http://sematext.com/ >> > >> > >> > >> > >> >> >> > >> > >> >> >> >> >> >> On 16/03/2016, 14:52, "YouPeng Yang" >> wrote: >> >> >> >> >Hi >> >> > It happened again,and worse thing is that my system went to crash.we >> can >> >> >even not connect to it with ssh. >> >> > I use the sar command to capture the statistics information about >> it.Here >> >> >are my details: >> >> > >> >> > >> >> >[1]cpu(by using sar -u),we have to restart our system just as the red >> font >> >> >LINUX RESTART in the logs. >> >> >> >> >> >-- >> >> >03:00:01 PM all 7.61 0.00 0.92 0.07 0.00 >> >> >91.40 >> >> >03:10:01 PM all 7.71 0.00 1.29 0.06 0.00 >> >> >90.94 >> >> >03:20:01 PM all 7.62 0.00 1.98 0.06 0.00 >> >> >90.34 >> >> >03:30:35 PM all 5.65 0.00 31.08 0.04 0.00 >> >> >63.23 >> >> >03:42:40 PM all 47.58 0.00 52.25 0.00 0.00 >> >> > 0.16 >> >> >Average:all 8.21 0.00 1.57 0.05 0.00 >> >> >90.17 >> >> > >> >> >04:42:04 PM LINUX RESTART >> >> > >> >> >04:50:01 PM CPU %user %nice %system %iowait%steal >> >> >%idle >> >> >05:00:01 PM all 3.49 0.00 0.62 0.15 0.00 >> >> >95.75 >> >> >05:10:01 PM all 9.03 0.00 0.92 0.28 0.00 >> >> >89.77 >> >> >05:20:01 PM all 7.06 0.00 0.78 0.05 0.00 >> >> >92.11 >> >> >05:30:01 PM all 6.67 0.00 0.79 0.06 0.00 >> >> >92.48 >> >> >05:40:01 PM all 6.26 0.00 0.76 0.05 0.00 >> >> >92.93 >> >> >05:50:01 PM all 5.49 0.00 0.71 0.05 0.00 >> >> >93.75 >> >> >> >> >> >-- >> >> > >> >> >[2]mem(by using sar -r) >> >> >> >> >> >-- >> >> >03:00:01 PM 1519272 196633272 99.23361112 76364340 >> 143574212 >> >> >47.77 >> >> >03:10:01 PM 1451764 196700780 99.27361196 76336340 >> 143581608 >> >> >47.77 >> >> >03:20:01 PM 1453400 196699144 99.27361448 76248584 >> 143551128 >> >> >47.76 >> >> >03:30:35 PM 1513844 196638700 99.24361648 76022016 >> 143828244 >> >> >47.85 >> >> >03:42:40 PM 1481108 196671436 99.25361676 75718320 >> 144478784 >> >> >48.07 >> >> >Average: 5051607 193100937 97.45362421 81775777 >> 142758861 >> >> >47.50 >> >> > >> >> >04:42:04 PM LINUX RESTART >> >> > >> >> >04:50:01 PM kbmemfree kbmemused %memused kbbuffers kbcached >> kbcommit >> >> >%commit >> >> >05:00:01 PM 154357132 43795412 22.10 92012 18648644 >> 134950460 >> >> >44.
Re: High Cpu sys usage
Hi, On Wed, Mar 16, 2016 at 10:59 AM, Patrick Plaatje wrote: > Hi, > > From the sar output you supplied, it looks like you might have a memory > issue on your hosts. The memory usage just before your crash seems to be > *very* close to 100%. Even the slightest increase (Solr itself, or possibly > by a system service) could caused the system crash. What are the > specifications of your hosts and how much memory are you allocating? That's normal actually - http://www.linuxatemyram.com/ You *want* Linux to be using all your memory - you paid for it :) Otis -- Monitoring - Log Management - Alerting - Anomaly Detection Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > > > > On 16/03/2016, 14:52, "YouPeng Yang" wrote: > > >Hi > > It happened again,and worse thing is that my system went to crash.we can > >even not connect to it with ssh. > > I use the sar command to capture the statistics information about it.Here > >are my details: > > > > > >[1]cpu(by using sar -u),we have to restart our system just as the red font > >LINUX RESTART in the logs. > > >-- > >03:00:01 PM all 7.61 0.00 0.92 0.07 0.00 > >91.40 > >03:10:01 PM all 7.71 0.00 1.29 0.06 0.00 > >90.94 > >03:20:01 PM all 7.62 0.00 1.98 0.06 0.00 > >90.34 > >03:30:35 PM all 5.65 0.00 31.08 0.04 0.00 > >63.23 > >03:42:40 PM all 47.58 0.00 52.25 0.00 0.00 > > 0.16 > >Average:all 8.21 0.00 1.57 0.05 0.00 > >90.17 > > > >04:42:04 PM LINUX RESTART > > > >04:50:01 PM CPU %user %nice %system %iowait%steal > >%idle > >05:00:01 PM all 3.49 0.00 0.62 0.15 0.00 > >95.75 > >05:10:01 PM all 9.03 0.00 0.92 0.28 0.00 > >89.77 > >05:20:01 PM all 7.06 0.00 0.78 0.05 0.00 > >92.11 > >05:30:01 PM all 6.67 0.00 0.79 0.06 0.00 > >92.48 > >05:40:01 PM all 6.26 0.00 0.76 0.05 0.00 > >92.93 > >05:50:01 PM all 5.49 0.00 0.71 0.05 0.00 > >93.75 > > >-- > > > >[2]mem(by using sar -r) > > >-- > >03:00:01 PM 1519272 196633272 99.23361112 76364340 143574212 > >47.77 > >03:10:01 PM 1451764 196700780 99.27361196 76336340 143581608 > >47.77 > >03:20:01 PM 1453400 196699144 99.27361448 76248584 143551128 > >47.76 > >03:30:35 PM 1513844 196638700 99.24361648 76022016 143828244 > >47.85 > >03:42:40 PM 1481108 196671436 99.25361676 75718320 144478784 > >48.07 > >Average: 5051607 193100937 97.45362421 81775777 142758861 > >47.50 > > > >04:42:04 PM LINUX RESTART > > > >04:50:01 PM kbmemfree kbmemused %memused kbbuffers kbcached kbcommit > >%commit > >05:00:01 PM 154357132 43795412 22.10 92012 18648644 134950460 > >44.90 > >05:10:01 PM 136468244 61684300 31.13219572 31709216 134966548 > >44.91 > >05:20:01 PM 135092452 63060092 31.82221488 32162324 134949788 > >44.90 > >05:30:01 PM 133410464 64742080 32.67233848 32793848 134976828 > >44.91 > >05:40:01 PM 132022052 66130492 33.37235812 33278908 135007268 > >44.92 > >05:50:01 PM 130630408 67522136 34.08237140 33900912 135099764 > >44.95 > >Average:136996792 61155752 30.86206645 30415642 134991776 > >44.91 > > >-- > > > > > >As the blue font parts show that my hardware crash from 03:30:35.It is > hung > >up until I restart it manually at 04:42:04 > >ALl the above information just snapshot the performance when it crashed > >while there is nothing cover the reason.I have also > >check the /var/log/messages and find nothing useful. > > > >Note that I run the command- sar -v .It shows something abnormal: > > > > >02:50:01 PM 11542262 9216 76446 258 > >03:00:01 PM 11645526 9536 76421 258 > >03:10:01 PM 11748690 9216 76451 258 > >03:20:01 PM 11850191 9152 76331 258 > >03:30:35 PM 11972313 10112132625 258 > >03:42:40 PM 12177319 13760340227 258 > >Average: 8293601 8950 68187 161 > > > >04:42:04 PM LINUX RESTART > > > >04:50:01 PM dentunusd file-nr inode-nrpty-nr > >05:00:01 PM 35410 7616 35223 4 > >05:10:01 PM137320 7296 42632 6 > >05:20:01 PM247010 7296
Re: High Cpu sys usage
Yeah, I did’t pay attention to the cached memory at all, my bad! I remember running into a similar situation a couple of years ago, one of the things to investigate our memory profile was to produce a full heap dump and manually analyse that using a tool like MAT. Cheers, -patrick On 17/03/2016, 21:58, "Otis Gospodnetić" wrote: >Hi, > >On Wed, Mar 16, 2016 at 10:59 AM, Patrick Plaatje >wrote: > >> Hi, >> >> From the sar output you supplied, it looks like you might have a memory >> issue on your hosts. The memory usage just before your crash seems to be >> *very* close to 100%. Even the slightest increase (Solr itself, or possibly >> by a system service) could caused the system crash. What are the >> specifications of your hosts and how much memory are you allocating? > > >That's normal actually - http://www.linuxatemyram.com/ > >You *want* Linux to be using all your memory - you paid for it :) > >Otis >-- >Monitoring - Log Management - Alerting - Anomaly Detection >Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > > > > >> > > >> >> >> On 16/03/2016, 14:52, "YouPeng Yang" wrote: >> >> >Hi >> > It happened again,and worse thing is that my system went to crash.we can >> >even not connect to it with ssh. >> > I use the sar command to capture the statistics information about it.Here >> >are my details: >> > >> > >> >[1]cpu(by using sar -u),we have to restart our system just as the red font >> >LINUX RESTART in the logs. >> >> >-- >> >03:00:01 PM all 7.61 0.00 0.92 0.07 0.00 >> >91.40 >> >03:10:01 PM all 7.71 0.00 1.29 0.06 0.00 >> >90.94 >> >03:20:01 PM all 7.62 0.00 1.98 0.06 0.00 >> >90.34 >> >03:30:35 PM all 5.65 0.00 31.08 0.04 0.00 >> >63.23 >> >03:42:40 PM all 47.58 0.00 52.25 0.00 0.00 >> > 0.16 >> >Average:all 8.21 0.00 1.57 0.05 0.00 >> >90.17 >> > >> >04:42:04 PM LINUX RESTART >> > >> >04:50:01 PM CPU %user %nice %system %iowait%steal >> >%idle >> >05:00:01 PM all 3.49 0.00 0.62 0.15 0.00 >> >95.75 >> >05:10:01 PM all 9.03 0.00 0.92 0.28 0.00 >> >89.77 >> >05:20:01 PM all 7.06 0.00 0.78 0.05 0.00 >> >92.11 >> >05:30:01 PM all 6.67 0.00 0.79 0.06 0.00 >> >92.48 >> >05:40:01 PM all 6.26 0.00 0.76 0.05 0.00 >> >92.93 >> >05:50:01 PM all 5.49 0.00 0.71 0.05 0.00 >> >93.75 >> >> >-- >> > >> >[2]mem(by using sar -r) >> >> >-- >> >03:00:01 PM 1519272 196633272 99.23361112 76364340 143574212 >> >47.77 >> >03:10:01 PM 1451764 196700780 99.27361196 76336340 143581608 >> >47.77 >> >03:20:01 PM 1453400 196699144 99.27361448 76248584 143551128 >> >47.76 >> >03:30:35 PM 1513844 196638700 99.24361648 76022016 143828244 >> >47.85 >> >03:42:40 PM 1481108 196671436 99.25361676 75718320 144478784 >> >48.07 >> >Average: 5051607 193100937 97.45362421 81775777 142758861 >> >47.50 >> > >> >04:42:04 PM LINUX RESTART >> > >> >04:50:01 PM kbmemfree kbmemused %memused kbbuffers kbcached kbcommit >> >%commit >> >05:00:01 PM 154357132 43795412 22.10 92012 18648644 134950460 >> >44.90 >> >05:10:01 PM 136468244 61684300 31.13219572 31709216 134966548 >> >44.91 >> >05:20:01 PM 135092452 63060092 31.82221488 32162324 134949788 >> >44.90 >> >05:30:01 PM 133410464 64742080 32.67233848 32793848 134976828 >> >44.91 >> >05:40:01 PM 132022052 66130492 33.37235812 33278908 135007268 >> >44.92 >> >05:50:01 PM 130630408 67522136 34.08237140 33900912 135099764 >> >44.95 >> >Average:136996792 61155752 30.86206645 30415642 134991776 >> >44.91 >> >> >-- >> > >> > >> >As the blue font parts show that my hardware crash from 03:30:35.It is >> hung >> >up until I restart it manually at 04:42:04 >> >ALl the above information just snapshot the performance when it crashed >> >while there is nothing cover the reason.I have also >> >check the /var/log/messages and find nothing useful. >> > >> >Note that I run the command- sar -v .It shows something abnormal: >> >> > >> >02:50:01 PM 11542262 9216 76446 258 >> >03:00:01 PM 11645526 9536 76421 258 >> >03:10:01 PM 11748690 9216 76451 258
Re: High Cpu sys usage
On 3/16/2016 8:27 PM, YouPeng Yang wrote: > Hi Shawn >Here is my top screenshot: > >https://www.dropbox.com/s/jaw10mkmipz943y/topscreen.jpg?dl=0 > >It is captured when my system is normal.And I have reduced the memory > size down to 48GB originating from 64GB. It looks like you have at least two Solr instances on this machine, one of which has over 600GB of index data, and the other has over 500GB of data. There may be as many as ten Solr instances, but I cannot tell for sure what those Java processes are. If my guess is correct, this means that there's over a terabyte of index data, but you only have about 100GB of RAM available to cache it. I don't think this is enough RAM for good performance, even if the disks are SSD. You'll either need a lot more memory in each machine, or more machines. The data may need to be divided into more shards. I am not seeing any evidence here of high CPU. The system only shows about 12 percent total CPU usage, and very little of it is system (kernel). Thanks, Shawn
Re: High Cpu sys usage
Hi It happened again,and worse thing is that my system went to crash.we can even not connect to it with ssh. I use the sar command to capture the statistics information about it.Here are my details: [1]cpu(by using sar -u),we have to restart our system just as the red font LINUX RESTART in the logs. -- 03:00:01 PM all 7.61 0.00 0.92 0.07 0.00 91.40 03:10:01 PM all 7.71 0.00 1.29 0.06 0.00 90.94 03:20:01 PM all 7.62 0.00 1.98 0.06 0.00 90.34 03:30:35 PM all 5.65 0.00 31.08 0.04 0.00 63.23 03:42:40 PM all 47.58 0.00 52.25 0.00 0.00 0.16 Average:all 8.21 0.00 1.57 0.05 0.00 90.17 04:42:04 PM LINUX RESTART 04:50:01 PM CPU %user %nice %system %iowait%steal %idle 05:00:01 PM all 3.49 0.00 0.62 0.15 0.00 95.75 05:10:01 PM all 9.03 0.00 0.92 0.28 0.00 89.77 05:20:01 PM all 7.06 0.00 0.78 0.05 0.00 92.11 05:30:01 PM all 6.67 0.00 0.79 0.06 0.00 92.48 05:40:01 PM all 6.26 0.00 0.76 0.05 0.00 92.93 05:50:01 PM all 5.49 0.00 0.71 0.05 0.00 93.75 -- [2]mem(by using sar -r) -- 03:00:01 PM 1519272 196633272 99.23361112 76364340 143574212 47.77 03:10:01 PM 1451764 196700780 99.27361196 76336340 143581608 47.77 03:20:01 PM 1453400 196699144 99.27361448 76248584 143551128 47.76 03:30:35 PM 1513844 196638700 99.24361648 76022016 143828244 47.85 03:42:40 PM 1481108 196671436 99.25361676 75718320 144478784 48.07 Average: 5051607 193100937 97.45362421 81775777 142758861 47.50 04:42:04 PM LINUX RESTART 04:50:01 PM kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit 05:00:01 PM 154357132 43795412 22.10 92012 18648644 134950460 44.90 05:10:01 PM 136468244 61684300 31.13219572 31709216 134966548 44.91 05:20:01 PM 135092452 63060092 31.82221488 32162324 134949788 44.90 05:30:01 PM 133410464 64742080 32.67233848 32793848 134976828 44.91 05:40:01 PM 132022052 66130492 33.37235812 33278908 135007268 44.92 05:50:01 PM 130630408 67522136 34.08237140 33900912 135099764 44.95 Average:136996792 61155752 30.86206645 30415642 134991776 44.91 -- As the blue font parts show that my hardware crash from 03:30:35.It is hung up until I restart it manually at 04:42:04 ALl the above information just snapshot the performance when it crashed while there is nothing cover the reason.I have also check the /var/log/messages and find nothing useful. Note that I run the command- sar -v .It shows something abnormal: 02:50:01 PM 11542262 9216 76446 258 03:00:01 PM 11645526 9536 76421 258 03:10:01 PM 11748690 9216 76451 258 03:20:01 PM 11850191 9152 76331 258 03:30:35 PM 11972313 10112132625 258 03:42:40 PM 12177319 13760340227 258 Average: 8293601 8950 68187 161 04:42:04 PM LINUX RESTART 04:50:01 PM dentunusd file-nr inode-nrpty-nr 05:00:01 PM 35410 7616 35223 4 05:10:01 PM137320 7296 42632 6 05:20:01 PM247010 7296 42839 9 05:30:01 PM358434 7360 42697 9 05:40:01 PM471543 7040 4292910 05:50:01 PM583787 7296 4283713 and I check the man info about the -v option : *-v* Report status of inode, file and other kernel tables. The following values are displayed: *dentunusd* Number of unused cache entries in the directory cache. *file-nr* Number of file handles used by the system. *inode-nr* Number of inode handlers used by the system. *pty-nr* Number of pseudo-terminals used by the system. Is the any clue about the crash? Would you please give me some suggestions? Best Regards. 2016-03-16 14:01 GMT+08:00 YouPeng Yang : > Hello >The problem appears several times ,however I could not capture the top >
Re: High Cpu sys usage
Hi Shawn Here is my top screenshot: https://www.dropbox.com/s/jaw10mkmipz943y/topscreen.jpg?dl=0 It is captured when my system is normal.And I have reduced the memory size down to 48GB originating from 64GB. We have two hardware clusters ,each is comprised of 3 machines,and On one cluster we deploy 3 different SolrCloud application clusters,the above top screenshot is the machine crached 4:30PM yesterday. To be convenient,I post a top sceenshot of another machine of the other cluster: https://www.dropbox.com/s/p3j3bpcl8l2i1nt/another64GBnodeTop.jpg?dl=0 On this machine ,the biggest Solrcloud node which jvm memory size is 64GB holds 730GB index size.The machine hung up for a long time just at yesterday middle night. We also have capture the iotop when it hung up. https://www.dropbox.com/s/keqqjabmon9f1ea/anthoer64GBnodeIotop.jpg?dl=0 as the iotop shows the process jdb2 is writing large .I think it will be helpfull. Best Regards 2016-03-17 7:35 GMT+08:00 Shawn Heisey : > On 3/16/2016 8:59 AM, Patrick Plaatje wrote: > > From the sar output you supplied, it looks like you might have a memory > issue on your hosts. The memory usage just before your crash seems to be > *very* close to 100%. Even the slightest increase (Solr itself, or possibly > by a system service) could caused the system crash. What are the > specifications of your hosts and how much memory are you allocating? > > It's completely normal for a machine, especially a machine running Solr > with a very large index, to run at nearly 100% memory usage. The > "Average" line from the sar output indicates 97.45 percent usage, but it > also shows 81GB of memory in the "kbcached" column -- this is memory > that can be instantly claimed by any program that asks for it. If we > discount this 81GB, since it is instantly available, the "true" memory > usage is closer to 70 percent than 100. > > https://en.wikipedia.org/wiki/Page_cache > > If YouPeng can run top and sort it by memory usage (press shift-M), then > grab a screenshot, that will be helpful for more insight. Here's an > example of this from one of my servers, shared on dropbox: > > https://www.dropbox.com/s/qfuxhw20q0y1ckx/linux-8gb-heap.png?dl=0 > > This is a server with 64GB of RAM and 110GB of index data. About 48GB > of my memory is used by the disk cache. I've got slightly less than > half my index data in the cache. > > Thanks, > Shawn > >
Re: High Cpu sys usage
On 3/16/2016 8:59 AM, Patrick Plaatje wrote: > From the sar output you supplied, it looks like you might have a memory issue > on your hosts. The memory usage just before your crash seems to be *very* > close to 100%. Even the slightest increase (Solr itself, or possibly by a > system service) could caused the system crash. What are the specifications of > your hosts and how much memory are you allocating? It's completely normal for a machine, especially a machine running Solr with a very large index, to run at nearly 100% memory usage. The "Average" line from the sar output indicates 97.45 percent usage, but it also shows 81GB of memory in the "kbcached" column -- this is memory that can be instantly claimed by any program that asks for it. If we discount this 81GB, since it is instantly available, the "true" memory usage is closer to 70 percent than 100. https://en.wikipedia.org/wiki/Page_cache If YouPeng can run top and sort it by memory usage (press shift-M), then grab a screenshot, that will be helpful for more insight. Here's an example of this from one of my servers, shared on dropbox: https://www.dropbox.com/s/qfuxhw20q0y1ckx/linux-8gb-heap.png?dl=0 This is a server with 64GB of RAM and 110GB of index data. About 48GB of my memory is used by the disk cache. I've got slightly less than half my index data in the cache. Thanks, Shawn
Re: High Cpu sys usage
Hi To Patrick: Never mind .Thank you for your suggestion all the same. To Otis. We do not use SPM. We monintor the JVM just use jstat becasue my system went well before ,so we do not need other tools. But SPM is really awesome . Still looking for help. Best Regards 2016-03-18 6:01 GMT+08:00 Patrick Plaatje : > Yeah, I did’t pay attention to the cached memory at all, my bad! > > I remember running into a similar situation a couple of years ago, one of > the things to investigate our memory profile was to produce a full heap > dump and manually analyse that using a tool like MAT. > > Cheers, > -patrick > > > > > On 17/03/2016, 21:58, "Otis Gospodnetić" > wrote: > > >Hi, > > > >On Wed, Mar 16, 2016 at 10:59 AM, Patrick Plaatje > >wrote: > > > >> Hi, > >> > >> From the sar output you supplied, it looks like you might have a memory > >> issue on your hosts. The memory usage just before your crash seems to be > >> *very* close to 100%. Even the slightest increase (Solr itself, or > possibly > >> by a system service) could caused the system crash. What are the > >> specifications of your hosts and how much memory are you allocating? > > > > > >That's normal actually - http://www.linuxatemyram.com/ > > > >You *want* Linux to be using all your memory - you paid for it :) > > > >Otis > >-- > >Monitoring - Log Management - Alerting - Anomaly Detection > >Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > > > > > > > > > >> > > > > > >> > >> > >> On 16/03/2016, 14:52, "YouPeng Yang" wrote: > >> > >> >Hi > >> > It happened again,and worse thing is that my system went to crash.we > can > >> >even not connect to it with ssh. > >> > I use the sar command to capture the statistics information about > it.Here > >> >are my details: > >> > > >> > > >> >[1]cpu(by using sar -u),we have to restart our system just as the red > font > >> >LINUX RESTART in the logs. > >> > >> > >-- > >> >03:00:01 PM all 7.61 0.00 0.92 0.07 0.00 > >> >91.40 > >> >03:10:01 PM all 7.71 0.00 1.29 0.06 0.00 > >> >90.94 > >> >03:20:01 PM all 7.62 0.00 1.98 0.06 0.00 > >> >90.34 > >> >03:30:35 PM all 5.65 0.00 31.08 0.04 0.00 > >> >63.23 > >> >03:42:40 PM all 47.58 0.00 52.25 0.00 0.00 > >> > 0.16 > >> >Average:all 8.21 0.00 1.57 0.05 0.00 > >> >90.17 > >> > > >> >04:42:04 PM LINUX RESTART > >> > > >> >04:50:01 PM CPU %user %nice %system %iowait%steal > >> >%idle > >> >05:00:01 PM all 3.49 0.00 0.62 0.15 0.00 > >> >95.75 > >> >05:10:01 PM all 9.03 0.00 0.92 0.28 0.00 > >> >89.77 > >> >05:20:01 PM all 7.06 0.00 0.78 0.05 0.00 > >> >92.11 > >> >05:30:01 PM all 6.67 0.00 0.79 0.06 0.00 > >> >92.48 > >> >05:40:01 PM all 6.26 0.00 0.76 0.05 0.00 > >> >92.93 > >> >05:50:01 PM all 5.49 0.00 0.71 0.05 0.00 > >> >93.75 > >> > >> > >-- > >> > > >> >[2]mem(by using sar -r) > >> > >> > >-- > >> >03:00:01 PM 1519272 196633272 99.23361112 76364340 143574212 > >> >47.77 > >> >03:10:01 PM 1451764 196700780 99.27361196 76336340 143581608 > >> >47.77 > >> >03:20:01 PM 1453400 196699144 99.27361448 76248584 143551128 > >> >47.76 > >> >03:30:35 PM 1513844 196638700 99.24361648 76022016 143828244 > >> >47.85 > >> >03:42:40 PM 1481108 196671436 99.25361676 75718320 144478784 > >> >48.07 > >> >Average: 5051607 193100937 97.45362421 81775777 142758861 > >> >47.50 > >> > > >> >04:42:04 PM LINUX RESTART > >> > > >> >04:50:01 PM kbmemfree kbmemused %memused kbbuffers kbcached kbcommit > >> >%commit > >> >05:00:01 PM 154357132 43795412 22.10 92012 18648644 134950460 > >> >44.90 > >> >05:10:01 PM 136468244 61684300 31.13219572 31709216 134966548 > >> >44.91 > >> >05:20:01 PM 135092452 63060092 31.82221488 32162324 134949788 > >> >44.90 > >> >05:30:01 PM 133410464 64742080 32.67233848 32793848 134976828 > >> >44.91 > >> >05:40:01 PM 132022052 66130492 33.37235812 33278908 135007268 > >> >44.92 > >> >05:50:01 PM 130630408 67522136 34.08237140 33900912 135099764 > >> >44.95 > >> >Average:136996792 61155752 30.86206645 30415642 134991776 > >> >44.91 > >> > >> > >-- > >> > > >> > > >> >As the blue font parts show that my hardware crash from 03:30:35.It is
Re: High Cpu sys usage
Hi Shawn Actually,there are three Solr instances(The top three PIDs is the three instances),and the datafile size of the stuff is 851G,592G,49G respectively ,and more and more data will be added as time going.I think it may be rare as the large scope as my solrcloud service .and it is now one of the most important core service in my company. Just as you suggest,the increasing size of data make us to devide our SolrCloud service into smaller application clusters.and we do have separated our collection into smaller shards .and I know there must be some abnormal things on the service when time is going.however the unknown reason high sys cpu is right now as a nightmare.So I look for help from our community. Would you have some experience as me and how you solve this problem? Best Regards 2016-03-17 14:16 GMT+08:00 Shawn Heisey : > On 3/16/2016 8:27 PM, YouPeng Yang wrote: > > Hi Shawn > >Here is my top screenshot: > > > >https://www.dropbox.com/s/jaw10mkmipz943y/topscreen.jpg?dl=0 > > > >It is captured when my system is normal.And I have reduced the memory > > size down to 48GB originating from 64GB. > > It looks like you have at least two Solr instances on this machine, one > of which has over 600GB of index data, and the other has over 500GB of > data. There may be as many as ten Solr instances, but I cannot tell for > sure what those Java processes are. > > If my guess is correct, this means that there's over a terabyte of index > data, but you only have about 100GB of RAM available to cache it. I > don't think this is enough RAM for good performance, even if the disks > are SSD. You'll either need a lot more memory in each machine, or more > machines. The data may need to be divided into more shards. > > I am not seeing any evidence here of high CPU. The system only shows > about 12 percent total CPU usage, and very little of it is system (kernel). > > Thanks, > Shawn > >
Re: High Cpu sys usage
Hi, >From the sar output you supplied, it looks like you might have a memory issue >on your hosts. The memory usage just before your crash seems to be *very* >close to 100%. Even the slightest increase (Solr itself, or possibly by a >system service) could caused the system crash. What are the specifications of >your hosts and how much memory are you allocating? Cheers, -patrick On 16/03/2016, 14:52, "YouPeng Yang" wrote: >Hi > It happened again,and worse thing is that my system went to crash.we can >even not connect to it with ssh. > I use the sar command to capture the statistics information about it.Here >are my details: > > >[1]cpu(by using sar -u),we have to restart our system just as the red font >LINUX RESTART in the logs. >-- >03:00:01 PM all 7.61 0.00 0.92 0.07 0.00 >91.40 >03:10:01 PM all 7.71 0.00 1.29 0.06 0.00 >90.94 >03:20:01 PM all 7.62 0.00 1.98 0.06 0.00 >90.34 >03:30:35 PM all 5.65 0.00 31.08 0.04 0.00 >63.23 >03:42:40 PM all 47.58 0.00 52.25 0.00 0.00 > 0.16 >Average:all 8.21 0.00 1.57 0.05 0.00 >90.17 > >04:42:04 PM LINUX RESTART > >04:50:01 PM CPU %user %nice %system %iowait%steal >%idle >05:00:01 PM all 3.49 0.00 0.62 0.15 0.00 >95.75 >05:10:01 PM all 9.03 0.00 0.92 0.28 0.00 >89.77 >05:20:01 PM all 7.06 0.00 0.78 0.05 0.00 >92.11 >05:30:01 PM all 6.67 0.00 0.79 0.06 0.00 >92.48 >05:40:01 PM all 6.26 0.00 0.76 0.05 0.00 >92.93 >05:50:01 PM all 5.49 0.00 0.71 0.05 0.00 >93.75 >-- > >[2]mem(by using sar -r) >-- >03:00:01 PM 1519272 196633272 99.23361112 76364340 143574212 >47.77 >03:10:01 PM 1451764 196700780 99.27361196 76336340 143581608 >47.77 >03:20:01 PM 1453400 196699144 99.27361448 76248584 143551128 >47.76 >03:30:35 PM 1513844 196638700 99.24361648 76022016 143828244 >47.85 >03:42:40 PM 1481108 196671436 99.25361676 75718320 144478784 >48.07 >Average: 5051607 193100937 97.45362421 81775777 142758861 >47.50 > >04:42:04 PM LINUX RESTART > >04:50:01 PM kbmemfree kbmemused %memused kbbuffers kbcached kbcommit >%commit >05:00:01 PM 154357132 43795412 22.10 92012 18648644 134950460 >44.90 >05:10:01 PM 136468244 61684300 31.13219572 31709216 134966548 >44.91 >05:20:01 PM 135092452 63060092 31.82221488 32162324 134949788 >44.90 >05:30:01 PM 133410464 64742080 32.67233848 32793848 134976828 >44.91 >05:40:01 PM 132022052 66130492 33.37235812 33278908 135007268 >44.92 >05:50:01 PM 130630408 67522136 34.08237140 33900912 135099764 >44.95 >Average:136996792 61155752 30.86206645 30415642 134991776 >44.91 >-- > > >As the blue font parts show that my hardware crash from 03:30:35.It is hung >up until I restart it manually at 04:42:04 >ALl the above information just snapshot the performance when it crashed >while there is nothing cover the reason.I have also >check the /var/log/messages and find nothing useful. > >Note that I run the command- sar -v .It shows something abnormal: > >02:50:01 PM 11542262 9216 76446 258 >03:00:01 PM 11645526 9536 76421 258 >03:10:01 PM 11748690 9216 76451 258 >03:20:01 PM 11850191 9152 76331 258 >03:30:35 PM 11972313 10112132625 258 >03:42:40 PM 12177319 13760340227 258 >Average: 8293601 8950 68187 161 > >04:42:04 PM LINUX RESTART > >04:50:01 PM dentunusd file-nr inode-nrpty-nr >05:00:01 PM 35410 7616 35223 4 >05:10:01 PM137320 7296 42632 6 >05:20:01 PM247010 7296 42839 9 >05:30:01 PM358434 7360 42697 9 >05:40:01 PM471543 7040 4292910 >05:50:01 PM583787 7296 4283713 > > >and I check the man info about the -v option : > >*-v* Report status of inode, file and other kernel tables. The following >values are displayed: > *dentun
Re: High Cpu sys usage
Hi, I looked at those metrics outputs, but nothing jumps out at me as problematic. How full are your JVM heap memory pools? If you are using SPM to monitor your Solr/Tomcat/Jetty/... look for a chart that looks like this: https://apps.sematext.com/spm-reports/s/zB3JcdZyRn If some of these lines are close to 100% and stay close or at 100%, that's typically a bad sign. Next, look at your Garbage Collection times and counts. If you look at your GC metrics for e.g. a month and see a recent increase in GC times or counts then, yes, you have an issue with your memory/heap and that is what is increasing your CPU usage. If it looks like heap/GC are not the issue and it's really something inside Solr, you could profile it with either one of the standard profilers or something like https://sematext.com/blog/2016/03/17/on-demand-java-profiling/ . If there is something in Solr chewing on the CPU, this should show it. I hope this helps. Otis -- Monitoring - Log Management - Alerting - Anomaly Detection Solr & Elasticsearch Consulting Support Training - http://sematext.com/ On Wed, Mar 16, 2016 at 10:52 AM, YouPeng Yang wrote: > Hi > It happened again,and worse thing is that my system went to crash.we can > even not connect to it with ssh. > I use the sar command to capture the statistics information about it.Here > are my details: > > > [1]cpu(by using sar -u),we have to restart our system just as the red font > LINUX RESTART in the logs. > > -- > 03:00:01 PM all 7.61 0.00 0.92 0.07 0.00 > 91.40 > 03:10:01 PM all 7.71 0.00 1.29 0.06 0.00 > 90.94 > 03:20:01 PM all 7.62 0.00 1.98 0.06 0.00 > 90.34 > 03:30:35 PM all 5.65 0.00 31.08 0.04 0.00 > 63.23 > 03:42:40 PM all 47.58 0.00 52.25 0.00 0.00 > 0.16 > Average:all 8.21 0.00 1.57 0.05 0.00 > 90.17 > > 04:42:04 PM LINUX RESTART > > 04:50:01 PM CPU %user %nice %system %iowait%steal > %idle > 05:00:01 PM all 3.49 0.00 0.62 0.15 0.00 > 95.75 > 05:10:01 PM all 9.03 0.00 0.92 0.28 0.00 > 89.77 > 05:20:01 PM all 7.06 0.00 0.78 0.05 0.00 > 92.11 > 05:30:01 PM all 6.67 0.00 0.79 0.06 0.00 > 92.48 > 05:40:01 PM all 6.26 0.00 0.76 0.05 0.00 > 92.93 > 05:50:01 PM all 5.49 0.00 0.71 0.05 0.00 > 93.75 > > -- > > [2]mem(by using sar -r) > > -- > 03:00:01 PM 1519272 196633272 99.23361112 76364340 143574212 > 47.77 > 03:10:01 PM 1451764 196700780 99.27361196 76336340 143581608 > 47.77 > 03:20:01 PM 1453400 196699144 99.27361448 76248584 143551128 > 47.76 > 03:30:35 PM 1513844 196638700 99.24361648 76022016 143828244 > 47.85 > 03:42:40 PM 1481108 196671436 99.25361676 75718320 144478784 > 48.07 > Average: 5051607 193100937 97.45362421 81775777 142758861 > 47.50 > > 04:42:04 PM LINUX RESTART > > 04:50:01 PM kbmemfree kbmemused %memused kbbuffers kbcached kbcommit > %commit > 05:00:01 PM 154357132 43795412 22.10 92012 18648644 134950460 > 44.90 > 05:10:01 PM 136468244 61684300 31.13219572 31709216 134966548 > 44.91 > 05:20:01 PM 135092452 63060092 31.82221488 32162324 134949788 > 44.90 > 05:30:01 PM 133410464 64742080 32.67233848 32793848 134976828 > 44.91 > 05:40:01 PM 132022052 66130492 33.37235812 33278908 135007268 > 44.92 > 05:50:01 PM 130630408 67522136 34.08237140 33900912 135099764 > 44.95 > Average:136996792 61155752 30.86206645 30415642 134991776 > 44.91 > > -- > > > As the blue font parts show that my hardware crash from 03:30:35.It is hung > up until I restart it manually at 04:42:04 > ALl the above information just snapshot the performance when it crashed > while there is nothing cover the reason.I have also > check the /var/log/messages and find nothing useful. > > Note that I run the command- sar -v .It shows something abnormal: > > > 02:50:01 PM 11542262 9216 76446 258 > 03:00:01 PM 11645526 9536 76421 258 > 03:10:01 PM 11748690 9216 76451 258 > 03:20:01 PM 11850191 9152 76331 258 > 03:30:35 PM 11972313 10112132625 258 > 03:42:40 PM 12177319 13760340227 258 > Average: 8293601
Re: High Cpu sys usage
Hello The problem appears several times ,however I could not capture the top output .My script is as follows code. I check the sys cpu usage whether it exceed 30%.the other metric information can be dumpped successfully except the top . Would you like to check my script that I am not able to figure out what is wrong. - #!/bin/bash while : do sysusage=$(mpstat 2 1 | grep -A 1 "%sys" | tail -n 1 | awk '{if($6 < 30) print 1; else print 0;}' ) if [ $sysusage -eq 0 ];then #echo $sysusage #perf record -o perf$(date +%Y%m%d%H%M%S).data -a -g -F 1000 sleep 30 file=$(date +%Y%m%d%H%M%S) top -n 2 >> top$file.data iotop -b -n 2 >> iotop$file.data iostat >> iostat$file.data netstat -an | awk '/^tcp/ {++state[$NF]} END {for(i in state) print i,"\t",state[i]}' >> netstat$file.data fi sleep 5 done You have new mail in /var/spool/mail/root - 2016-03-08 21:39 GMT+08:00 YouPeng Yang : > Hi all > Thanks for your reply.I do some investigation for much time.and I will > post some logs of the 'top' and IO in a few days when the crash come again. > > 2016-03-08 10:45 GMT+08:00 Shawn Heisey : > >> On 3/7/2016 2:23 AM, Toke Eskildsen wrote: >> > How does this relate to YouPeng reporting that the CPU usage increases? >> > >> > This is not a snark. YouPeng mentions kernel issues. It might very well >> > be that IO is the real problem, but that it manifests in a non-intuitive >> > way. Before memory-mapping it was easy: Just look at IO-Wait. Now I am >> > not so sure. Can high kernel load (Sy% in *nix top) indicate that the IO >> > system is struggling, even if IO-Wait is low? >> >> It might turn out to be not directly related to memory, you're right >> about that. A very high query rate or particularly CPU-heavy queries or >> analysis could cause high CPU usage even when memory is plentiful, but >> in that situation I would expect high user percentage, not kernel. I'm >> not completely sure what might cause high kernel usage if iowait is low, >> but no specific information was given about iowait. I've seen iowait >> percentages of 10% or less with problems clearly caused by iowait. >> >> With the available information (especially seeing 700GB of index data), >> I believe that the "not enough memory" scenario is more likely than >> anything else. If the OP replies and says they have plenty of memory, >> then we can move on to the less common (IMHO) reasons for high CPU with >> a large index. >> >> If the OS is one that reports load average, I am curious what the 5 >> minute average is, and how many real (non-HT) CPU cores there are. >> >> Thanks, >> Shawn >> >> >
Re: High Cpu sys usage
Hi all Thanks for your reply.I do some investigation for much time.and I will post some logs of the 'top' and IO in a few days when the crash come again. 2016-03-08 10:45 GMT+08:00 Shawn Heisey : > On 3/7/2016 2:23 AM, Toke Eskildsen wrote: > > How does this relate to YouPeng reporting that the CPU usage increases? > > > > This is not a snark. YouPeng mentions kernel issues. It might very well > > be that IO is the real problem, but that it manifests in a non-intuitive > > way. Before memory-mapping it was easy: Just look at IO-Wait. Now I am > > not so sure. Can high kernel load (Sy% in *nix top) indicate that the IO > > system is struggling, even if IO-Wait is low? > > It might turn out to be not directly related to memory, you're right > about that. A very high query rate or particularly CPU-heavy queries or > analysis could cause high CPU usage even when memory is plentiful, but > in that situation I would expect high user percentage, not kernel. I'm > not completely sure what might cause high kernel usage if iowait is low, > but no specific information was given about iowait. I've seen iowait > percentages of 10% or less with problems clearly caused by iowait. > > With the available information (especially seeing 700GB of index data), > I believe that the "not enough memory" scenario is more likely than > anything else. If the OP replies and says they have plenty of memory, > then we can move on to the less common (IMHO) reasons for high CPU with > a large index. > > If the OS is one that reports load average, I am curious what the 5 > minute average is, and how many real (non-HT) CPU cores there are. > > Thanks, > Shawn > >
Re: High Cpu sys usage
On 3/7/2016 2:23 AM, Toke Eskildsen wrote: > How does this relate to YouPeng reporting that the CPU usage increases? > > This is not a snark. YouPeng mentions kernel issues. It might very well > be that IO is the real problem, but that it manifests in a non-intuitive > way. Before memory-mapping it was easy: Just look at IO-Wait. Now I am > not so sure. Can high kernel load (Sy% in *nix top) indicate that the IO > system is struggling, even if IO-Wait is low? It might turn out to be not directly related to memory, you're right about that. A very high query rate or particularly CPU-heavy queries or analysis could cause high CPU usage even when memory is plentiful, but in that situation I would expect high user percentage, not kernel. I'm not completely sure what might cause high kernel usage if iowait is low, but no specific information was given about iowait. I've seen iowait percentages of 10% or less with problems clearly caused by iowait. With the available information (especially seeing 700GB of index data), I believe that the "not enough memory" scenario is more likely than anything else. If the OP replies and says they have plenty of memory, then we can move on to the less common (IMHO) reasons for high CPU with a large index. If the OS is one that reports load average, I am curious what the 5 minute average is, and how many real (non-HT) CPU cores there are. Thanks, Shawn
Re: High Cpu sys usage
On Sun, 2016-03-06 at 08:26 -0700, Shawn Heisey wrote: > On 3/5/2016 11:44 PM, YouPeng Yang wrote: > > We are using Solr Cloud 4.6 in our production for searching service > > since 2 years ago.And now it has 700GB in one cluster which is comprised > > of 3 machines with ssd. At beginning ,everything go well,while more and > > more business services interfered with our searching service .And a problem > > which we haunted with is just like a nightmare . That is the cpu sys > > usage is often growing up to over 10% even higher, and as a result the > > machine will hang down because system resources have be drained out.We have > > to restart the machine manually. > One of the most common reasons for performance issues with Solr is not > having enough system memory to effectively cache the index. [...] How does this relate to YouPeng reporting that the CPU usage increases? This is not a snark. YouPeng mentions kernel issues. It might very well be that IO is the real problem, but that it manifests in a non-intuitive way. Before memory-mapping it was easy: Just look at IO-Wait. Now I am not so sure. Can high kernel load (Sy% in *nix top) indicate that the IO system is struggling, even if IO-Wait is low? YouPeng: If you are on a *nix-system, can you call 'top' on a machine and copy-paste the output somewhere we can see? - Toke Eskildsen, State and University Library, Denmark
Re: High Cpu sys usage
On 3/5/2016 11:44 PM, YouPeng Yang wrote: > We are using Solr Cloud 4.6 in our production for searching service > since 2 years ago.And now it has 700GB in one cluster which is comprised > of 3 machines with ssd. At beginning ,everything go well,while more and > more business services interfered with our searching service .And a problem > which we haunted with is just like a nightmare . That is the cpu sys > usage is often growing up to over 10% even higher, and as a result the > machine will hang down because system resources have be drained out.We have > to restart the machine manually. One of the most common reasons for performance issues with Solr is not having enough system memory to effectively cache the index. Another is running with a heap that's too small, or a heap that's really large with ineffective garbage collection tuning. All of these problems get worse as query rate climbs. Running on SSD can reduce, but not eliminate, the requirement for plenty of system memory. With 700GB of index data, you are likely to need somewhere between 128GB and 512GB of memory for good performance. If the query rate is high, then requirements are more likely to land in the upper end of that range. There's no way for me to narrow that range down -- it depends on a number of factors, and usually has to be determined through trial and error. If the data were on regular disks instead of SSD, I would be recommending even more memory. https://wiki.apache.org/solr/SolrPerformanceProblems https://lucidworks.com/blog/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/ If you want a single number recommendation for memory size, I would recommend starting with 256GB, and being ready to add more. It is very common for servers to be incapable of handling that much memory, though. The servers that I use for Solr max out at 64GB. You might need to split your index onto additional machines by sharding it, and gain the additional memory that way. Thanks, Shawn