Re: High Cpu sys usage

2016-03-30 Thread Erick Erickson
Both of these are anit-patterns. The soft commit interval of 1 second
is usually far too aggressive. And committing after every add is
also something to avoid.

Your original problem statement is high CPU usage. To see if your
committing is the culprit, I'd stop committing at all after adding and
make the soft commit interval, say, 60 seconds. And keep the
hard commit interval whatever it is not but make sure openSearcher is
set to false.

That should pinpoint whether the CPU usage is just because of your
committing. From there you can figure out the right balance...

If that's _not_ the source of your CPU usage, then at least you'll have
eliminated it as a potential problem.

Best,
Erick

On Wed, Mar 30, 2016 at 12:37 AM, YouPeng Yang
 wrote:
> Hi
>   Thanks you Erik.
>The main collection that stores our trade data is set to the softcomit
> when we import data using DIH. As you guess that the softcommit intervals
> is " 1000 " and we have autowarm counts to 0.However
> there is some collections that store our meta info in which we commit after
> each add.and these metadata collections just hold a few docs.
>
>
> Best Regards
>
>
> 2016-03-30 12:25 GMT+08:00 Erick Erickson :
>
>> Do not, repeat NOT try to "cure" the "Overlapping onDeckSearchers"
>> by bumping this limit! What that means is that your commits
>> (either hard commit with openSearcher=true or softCommit) are
>> happening far too frequently and your Solr instance is trying to do
>> all sorts of work that is immediately thrown away and chewing up
>> lots of CPU. Perhaps this will help:
>>
>>
>> https://lucidworks.com/blog/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
>>
>> I'd guess that you're
>>
>> > commiting every second, or perhaps your indexing client is committing
>> after each add. If the latter, do not do this and rely on the
>> autocommit settings
>> and if the formaer make those intervals as long as you can stand.
>>
>> > you may have your autowarm counts in your solrconfig.xml file set at
>> very high numbers (let's see the filterCache settings, the queryResultCache
>> settings etc.).
>>
>> I'd _strongly_ recommend that you put the on deck searchers back to
>> 2 and figure out why you have so many overlapping searchers.
>>
>> Best,
>> Erick
>>
>> On Tue, Mar 29, 2016 at 8:57 PM, YouPeng Yang 
>> wrote:
>> > Hi Toke
>> >   The number of collection is just 10.One of collection has 43
>> shards,each
>> > shard has two replicas.We continue  importing data from oracle all the
>> time
>> > while our systems provide searching service.
>> >There are "Overlapping onDeckSearchers"  in my solr.logs. What is the
>> > meaning about the "Overlapping onDeckSearchers" ,We set the the <
>> > maxWarmingSearchers>20 and true> > useColdSearcher>.Is it right ?
>> >
>> >
>> >
>> > Best Regard.
>> >
>> >
>> > 2016-03-29 22:31 GMT+08:00 Toke Eskildsen :
>> >
>> >> On Tue, 2016-03-29 at 20:12 +0800, YouPeng Yang wrote:
>> >> >   Our system still goes down as times going.We found lots of threads
>> are
>> >> > WAITING.Here is the threaddump that I copy from the web page.And 4
>> >> pictures
>> >> > for it.
>> >> >   Is there any relationship with my problem?
>> >>
>> >> That is a lot of commitScheduler-threads. Do you have hundreds of
>> >> collections in your cloud?
>> >>
>> >>
>> >> Try grepping for "Overlapping onDeckSearchers" in your solr.logs to see
>> >> if you got caught in a downwards spiral of concurrent commits.
>> >>
>> >> - Toke Eskildsen, State and University Library, Denmark
>> >>
>> >>
>> >>
>>


Re: High Cpu sys usage

2016-03-30 Thread YouPeng Yang
Hi
  Thanks you Erik.
   The main collection that stores our trade data is set to the softcomit
when we import data using DIH. As you guess that the softcommit intervals
is " 1000 " and we have autowarm counts to 0.However
there is some collections that store our meta info in which we commit after
each add.and these metadata collections just hold a few docs.


Best Regards


2016-03-30 12:25 GMT+08:00 Erick Erickson :

> Do not, repeat NOT try to "cure" the "Overlapping onDeckSearchers"
> by bumping this limit! What that means is that your commits
> (either hard commit with openSearcher=true or softCommit) are
> happening far too frequently and your Solr instance is trying to do
> all sorts of work that is immediately thrown away and chewing up
> lots of CPU. Perhaps this will help:
>
>
> https://lucidworks.com/blog/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
>
> I'd guess that you're
>
> > commiting every second, or perhaps your indexing client is committing
> after each add. If the latter, do not do this and rely on the
> autocommit settings
> and if the formaer make those intervals as long as you can stand.
>
> > you may have your autowarm counts in your solrconfig.xml file set at
> very high numbers (let's see the filterCache settings, the queryResultCache
> settings etc.).
>
> I'd _strongly_ recommend that you put the on deck searchers back to
> 2 and figure out why you have so many overlapping searchers.
>
> Best,
> Erick
>
> On Tue, Mar 29, 2016 at 8:57 PM, YouPeng Yang 
> wrote:
> > Hi Toke
> >   The number of collection is just 10.One of collection has 43
> shards,each
> > shard has two replicas.We continue  importing data from oracle all the
> time
> > while our systems provide searching service.
> >There are "Overlapping onDeckSearchers"  in my solr.logs. What is the
> > meaning about the "Overlapping onDeckSearchers" ,We set the the <
> > maxWarmingSearchers>20 and true > useColdSearcher>.Is it right ?
> >
> >
> >
> > Best Regard.
> >
> >
> > 2016-03-29 22:31 GMT+08:00 Toke Eskildsen :
> >
> >> On Tue, 2016-03-29 at 20:12 +0800, YouPeng Yang wrote:
> >> >   Our system still goes down as times going.We found lots of threads
> are
> >> > WAITING.Here is the threaddump that I copy from the web page.And 4
> >> pictures
> >> > for it.
> >> >   Is there any relationship with my problem?
> >>
> >> That is a lot of commitScheduler-threads. Do you have hundreds of
> >> collections in your cloud?
> >>
> >>
> >> Try grepping for "Overlapping onDeckSearchers" in your solr.logs to see
> >> if you got caught in a downwards spiral of concurrent commits.
> >>
> >> - Toke Eskildsen, State and University Library, Denmark
> >>
> >>
> >>
>


Re: High Cpu sys usage

2016-03-29 Thread Erick Erickson
Do not, repeat NOT try to "cure" the "Overlapping onDeckSearchers"
by bumping this limit! What that means is that your commits
(either hard commit with openSearcher=true or softCommit) are
happening far too frequently and your Solr instance is trying to do
all sorts of work that is immediately thrown away and chewing up
lots of CPU. Perhaps this will help:

https://lucidworks.com/blog/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

I'd guess that you're

> commiting every second, or perhaps your indexing client is committing
after each add. If the latter, do not do this and rely on the
autocommit settings
and if the formaer make those intervals as long as you can stand.

> you may have your autowarm counts in your solrconfig.xml file set at
very high numbers (let's see the filterCache settings, the queryResultCache
settings etc.).

I'd _strongly_ recommend that you put the on deck searchers back to
2 and figure out why you have so many overlapping searchers.

Best,
Erick

On Tue, Mar 29, 2016 at 8:57 PM, YouPeng Yang  wrote:
> Hi Toke
>   The number of collection is just 10.One of collection has 43 shards,each
> shard has two replicas.We continue  importing data from oracle all the time
> while our systems provide searching service.
>There are "Overlapping onDeckSearchers"  in my solr.logs. What is the
> meaning about the "Overlapping onDeckSearchers" ,We set the the <
> maxWarmingSearchers>20 and true useColdSearcher>.Is it right ?
>
>
>
> Best Regard.
>
>
> 2016-03-29 22:31 GMT+08:00 Toke Eskildsen :
>
>> On Tue, 2016-03-29 at 20:12 +0800, YouPeng Yang wrote:
>> >   Our system still goes down as times going.We found lots of threads are
>> > WAITING.Here is the threaddump that I copy from the web page.And 4
>> pictures
>> > for it.
>> >   Is there any relationship with my problem?
>>
>> That is a lot of commitScheduler-threads. Do you have hundreds of
>> collections in your cloud?
>>
>>
>> Try grepping for "Overlapping onDeckSearchers" in your solr.logs to see
>> if you got caught in a downwards spiral of concurrent commits.
>>
>> - Toke Eskildsen, State and University Library, Denmark
>>
>>
>>


Re: High Cpu sys usage

2016-03-29 Thread YouPeng Yang
Hi Toke
  The number of collection is just 10.One of collection has 43 shards,each
shard has two replicas.We continue  importing data from oracle all the time
while our systems provide searching service.
   There are "Overlapping onDeckSearchers"  in my solr.logs. What is the
meaning about the "Overlapping onDeckSearchers" ,We set the the <
maxWarmingSearchers>20 and true.Is it right ?



Best Regard.


2016-03-29 22:31 GMT+08:00 Toke Eskildsen :

> On Tue, 2016-03-29 at 20:12 +0800, YouPeng Yang wrote:
> >   Our system still goes down as times going.We found lots of threads are
> > WAITING.Here is the threaddump that I copy from the web page.And 4
> pictures
> > for it.
> >   Is there any relationship with my problem?
>
> That is a lot of commitScheduler-threads. Do you have hundreds of
> collections in your cloud?
>
>
> Try grepping for "Overlapping onDeckSearchers" in your solr.logs to see
> if you got caught in a downwards spiral of concurrent commits.
>
> - Toke Eskildsen, State and University Library, Denmark
>
>
>


Re: High Cpu sys usage

2016-03-29 Thread Toke Eskildsen
On Tue, 2016-03-29 at 20:12 +0800, YouPeng Yang wrote:
>   Our system still goes down as times going.We found lots of threads are
> WAITING.Here is the threaddump that I copy from the web page.And 4 pictures
> for it.
>   Is there any relationship with my problem?

That is a lot of commitScheduler-threads. Do you have hundreds of
collections in your cloud?


Try grepping for "Overlapping onDeckSearchers" in your solr.logs to see
if you got caught in a downwards spiral of concurrent commits.

- Toke Eskildsen, State and University Library, Denmark




Re: High Cpu sys usage

2016-03-29 Thread YouPeng Yang
Hi
  Our system still goes down as times going.We found lots of threads are
WAITING.Here is the threaddump that I copy from the web page.And 4 pictures
for it.
  Is there any relationship with my problem?


https://www.dropbox.com/s/h3wyez091oouwck/threaddump?dl=0
https://www.dropbox.com/s/p3ctuxb3t1jgo2e/threaddump1.jpg?dl=0
https://www.dropbox.com/s/w0uy15h6z984ntw/threaddump2.jpg?dl=0
https://www.dropbox.com/s/0frskxdllxlz9ha/threaddump3.jpg?dl=0
https://www.dropbox.com/s/46ptnly1ngi9nb6/threaddump4.jpg?dl=0


Best Regards

2016-03-18 14:35 GMT+08:00 YouPeng Yang :

> Hi
>   To Patrick: Never mind .Thank you for your suggestion all the same.
>   To Otis. We do not use SPM. We monintor the JVM just use jstat becasue
> my system went well before ,so we do not need  other tools.
> But SPM is really awesome .
>
>   Still looking for help.
>
> Best Regards
>
> 2016-03-18 6:01 GMT+08:00 Patrick Plaatje :
>
>> Yeah, I did’t pay attention to the cached memory at all, my bad!
>>
>> I remember running into a similar situation a couple of years ago, one of
>> the things to investigate our memory profile was to produce a full heap
>> dump and manually analyse that using a tool like MAT.
>>
>> Cheers,
>> -patrick
>>
>>
>>
>>
>> On 17/03/2016, 21:58, "Otis Gospodnetić" 
>> wrote:
>>
>> >Hi,
>> >
>> >On Wed, Mar 16, 2016 at 10:59 AM, Patrick Plaatje 
>> >wrote:
>> >
>> >> Hi,
>> >>
>> >> From the sar output you supplied, it looks like you might have a memory
>> >> issue on your hosts. The memory usage just before your crash seems to
>> be
>> >> *very* close to 100%. Even the slightest increase (Solr itself, or
>> possibly
>> >> by a system service) could caused the system crash. What are the
>> >> specifications of your hosts and how much memory are you allocating?
>> >
>> >
>> >That's normal actually - http://www.linuxatemyram.com/
>> >
>> >You *want* Linux to be using all your memory - you paid for it :)
>> >
>> >Otis
>> >--
>> >Monitoring - Log Management - Alerting - Anomaly Detection
>> >Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>> >
>> >
>> >
>> >
>> >>
>> >
>> >
>> >>
>> >>
>> >> On 16/03/2016, 14:52, "YouPeng Yang" 
>> wrote:
>> >>
>> >> >Hi
>> >> > It happened again,and worse thing is that my system went to crash.we
>> can
>> >> >even not connect to it with ssh.
>> >> > I use the sar command to capture the statistics information about
>> it.Here
>> >> >are my details:
>> >> >
>> >> >
>> >> >[1]cpu(by using sar -u),we have to restart our system just as the red
>> font
>> >> >LINUX RESTART in the logs.
>> >>
>> >>
>> >--
>> >> >03:00:01 PM all  7.61  0.00  0.92  0.07  0.00
>> >> >91.40
>> >> >03:10:01 PM all  7.71  0.00  1.29  0.06  0.00
>> >> >90.94
>> >> >03:20:01 PM all  7.62  0.00  1.98  0.06  0.00
>> >> >90.34
>> >> >03:30:35 PM all  5.65  0.00 31.08  0.04  0.00
>> >> >63.23
>> >> >03:42:40 PM all 47.58  0.00 52.25  0.00  0.00
>> >> > 0.16
>> >> >Average:all  8.21  0.00  1.57  0.05  0.00
>> >> >90.17
>> >> >
>> >> >04:42:04 PM   LINUX RESTART
>> >> >
>> >> >04:50:01 PM CPU %user %nice   %system   %iowait%steal
>> >> >%idle
>> >> >05:00:01 PM all  3.49  0.00  0.62  0.15  0.00
>> >> >95.75
>> >> >05:10:01 PM all  9.03  0.00  0.92  0.28  0.00
>> >> >89.77
>> >> >05:20:01 PM all  7.06  0.00  0.78  0.05  0.00
>> >> >92.11
>> >> >05:30:01 PM all  6.67  0.00  0.79  0.06  0.00
>> >> >92.48
>> >> >05:40:01 PM all  6.26  0.00  0.76  0.05  0.00
>> >> >92.93
>> >> >05:50:01 PM all  5.49  0.00  0.71  0.05  0.00
>> >> >93.75
>> >>
>> >>
>> >--
>> >> >
>> >> >[2]mem(by using sar -r)
>> >>
>> >>
>> >--
>> >> >03:00:01 PM   1519272 196633272 99.23361112  76364340
>> 143574212
>> >> >47.77
>> >> >03:10:01 PM   1451764 196700780 99.27361196  76336340
>> 143581608
>> >> >47.77
>> >> >03:20:01 PM   1453400 196699144 99.27361448  76248584
>> 143551128
>> >> >47.76
>> >> >03:30:35 PM   1513844 196638700 99.24361648  76022016
>> 143828244
>> >> >47.85
>> >> >03:42:40 PM   1481108 196671436 99.25361676  75718320
>> 144478784
>> >> >48.07
>> >> >Average:  5051607 193100937 97.45362421  81775777
>> 142758861
>> >> >47.50
>> >> >
>> >> >04:42:04 PM   LINUX RESTART
>> >> >
>> >> >04:50:01 PM kbmemfree kbmemused  %memused kbbuffers  kbcached
>> kbcommit
>> >> >%commit
>> >> >05:00:01 PM 154357132  43795412 22.10 92012  18648644
>> 134950460
>> >> >44.

Re: High Cpu sys usage

2016-03-20 Thread Otis Gospodnetić
Hi,

On Wed, Mar 16, 2016 at 10:59 AM, Patrick Plaatje 
wrote:

> Hi,
>
> From the sar output you supplied, it looks like you might have a memory
> issue on your hosts. The memory usage just before your crash seems to be
> *very* close to 100%. Even the slightest increase (Solr itself, or possibly
> by a system service) could caused the system crash. What are the
> specifications of your hosts and how much memory are you allocating?


That's normal actually - http://www.linuxatemyram.com/

You *want* Linux to be using all your memory - you paid for it :)

Otis
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/




>


>
>
> On 16/03/2016, 14:52, "YouPeng Yang"  wrote:
>
> >Hi
> > It happened again,and worse thing is that my system went to crash.we can
> >even not connect to it with ssh.
> > I use the sar command to capture the statistics information about it.Here
> >are my details:
> >
> >
> >[1]cpu(by using sar -u),we have to restart our system just as the red font
> >LINUX RESTART in the logs.
>
> >--
> >03:00:01 PM all  7.61  0.00  0.92  0.07  0.00
> >91.40
> >03:10:01 PM all  7.71  0.00  1.29  0.06  0.00
> >90.94
> >03:20:01 PM all  7.62  0.00  1.98  0.06  0.00
> >90.34
> >03:30:35 PM all  5.65  0.00 31.08  0.04  0.00
> >63.23
> >03:42:40 PM all 47.58  0.00 52.25  0.00  0.00
> > 0.16
> >Average:all  8.21  0.00  1.57  0.05  0.00
> >90.17
> >
> >04:42:04 PM   LINUX RESTART
> >
> >04:50:01 PM CPU %user %nice   %system   %iowait%steal
> >%idle
> >05:00:01 PM all  3.49  0.00  0.62  0.15  0.00
> >95.75
> >05:10:01 PM all  9.03  0.00  0.92  0.28  0.00
> >89.77
> >05:20:01 PM all  7.06  0.00  0.78  0.05  0.00
> >92.11
> >05:30:01 PM all  6.67  0.00  0.79  0.06  0.00
> >92.48
> >05:40:01 PM all  6.26  0.00  0.76  0.05  0.00
> >92.93
> >05:50:01 PM all  5.49  0.00  0.71  0.05  0.00
> >93.75
>
> >--
> >
> >[2]mem(by using sar -r)
>
> >--
> >03:00:01 PM   1519272 196633272 99.23361112  76364340 143574212
> >47.77
> >03:10:01 PM   1451764 196700780 99.27361196  76336340 143581608
> >47.77
> >03:20:01 PM   1453400 196699144 99.27361448  76248584 143551128
> >47.76
> >03:30:35 PM   1513844 196638700 99.24361648  76022016 143828244
> >47.85
> >03:42:40 PM   1481108 196671436 99.25361676  75718320 144478784
> >48.07
> >Average:  5051607 193100937 97.45362421  81775777 142758861
> >47.50
> >
> >04:42:04 PM   LINUX RESTART
> >
> >04:50:01 PM kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit
> >%commit
> >05:00:01 PM 154357132  43795412 22.10 92012  18648644 134950460
> >44.90
> >05:10:01 PM 136468244  61684300 31.13219572  31709216 134966548
> >44.91
> >05:20:01 PM 135092452  63060092 31.82221488  32162324 134949788
> >44.90
> >05:30:01 PM 133410464  64742080 32.67233848  32793848 134976828
> >44.91
> >05:40:01 PM 132022052  66130492 33.37235812  33278908 135007268
> >44.92
> >05:50:01 PM 130630408  67522136 34.08237140  33900912 135099764
> >44.95
> >Average:136996792  61155752 30.86206645  30415642 134991776
> >44.91
>
> >--
> >
> >
> >As the blue font parts show that my hardware crash from 03:30:35.It is
> hung
> >up until I restart it manually at 04:42:04
> >ALl the above information just snapshot the performance when it crashed
> >while there is nothing cover the reason.I have also
> >check the /var/log/messages and find nothing useful.
> >
> >Note that I run the command- sar -v .It shows something abnormal:
>
> >
> >02:50:01 PM  11542262  9216 76446   258
> >03:00:01 PM  11645526  9536 76421   258
> >03:10:01 PM  11748690  9216 76451   258
> >03:20:01 PM  11850191  9152 76331   258
> >03:30:35 PM  11972313 10112132625   258
> >03:42:40 PM  12177319 13760340227   258
> >Average:  8293601  8950 68187   161
> >
> >04:42:04 PM   LINUX RESTART
> >
> >04:50:01 PM dentunusd   file-nr  inode-nrpty-nr
> >05:00:01 PM 35410  7616 35223 4
> >05:10:01 PM137320  7296 42632 6
> >05:20:01 PM247010  7296 

Re: High Cpu sys usage

2016-03-19 Thread Patrick Plaatje
Yeah, I did’t pay attention to the cached memory at all, my bad!

I remember running into a similar situation a couple of years ago, one of the 
things to investigate our memory profile was to produce a full heap dump and 
manually analyse that using a tool like MAT.

Cheers,
-patrick




On 17/03/2016, 21:58, "Otis Gospodnetić"  wrote:

>Hi,
>
>On Wed, Mar 16, 2016 at 10:59 AM, Patrick Plaatje 
>wrote:
>
>> Hi,
>>
>> From the sar output you supplied, it looks like you might have a memory
>> issue on your hosts. The memory usage just before your crash seems to be
>> *very* close to 100%. Even the slightest increase (Solr itself, or possibly
>> by a system service) could caused the system crash. What are the
>> specifications of your hosts and how much memory are you allocating?
>
>
>That's normal actually - http://www.linuxatemyram.com/
>
>You *want* Linux to be using all your memory - you paid for it :)
>
>Otis
>--
>Monitoring - Log Management - Alerting - Anomaly Detection
>Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
>
>>
>
>
>>
>>
>> On 16/03/2016, 14:52, "YouPeng Yang"  wrote:
>>
>> >Hi
>> > It happened again,and worse thing is that my system went to crash.we can
>> >even not connect to it with ssh.
>> > I use the sar command to capture the statistics information about it.Here
>> >are my details:
>> >
>> >
>> >[1]cpu(by using sar -u),we have to restart our system just as the red font
>> >LINUX RESTART in the logs.
>>
>> >--
>> >03:00:01 PM all  7.61  0.00  0.92  0.07  0.00
>> >91.40
>> >03:10:01 PM all  7.71  0.00  1.29  0.06  0.00
>> >90.94
>> >03:20:01 PM all  7.62  0.00  1.98  0.06  0.00
>> >90.34
>> >03:30:35 PM all  5.65  0.00 31.08  0.04  0.00
>> >63.23
>> >03:42:40 PM all 47.58  0.00 52.25  0.00  0.00
>> > 0.16
>> >Average:all  8.21  0.00  1.57  0.05  0.00
>> >90.17
>> >
>> >04:42:04 PM   LINUX RESTART
>> >
>> >04:50:01 PM CPU %user %nice   %system   %iowait%steal
>> >%idle
>> >05:00:01 PM all  3.49  0.00  0.62  0.15  0.00
>> >95.75
>> >05:10:01 PM all  9.03  0.00  0.92  0.28  0.00
>> >89.77
>> >05:20:01 PM all  7.06  0.00  0.78  0.05  0.00
>> >92.11
>> >05:30:01 PM all  6.67  0.00  0.79  0.06  0.00
>> >92.48
>> >05:40:01 PM all  6.26  0.00  0.76  0.05  0.00
>> >92.93
>> >05:50:01 PM all  5.49  0.00  0.71  0.05  0.00
>> >93.75
>>
>> >--
>> >
>> >[2]mem(by using sar -r)
>>
>> >--
>> >03:00:01 PM   1519272 196633272 99.23361112  76364340 143574212
>> >47.77
>> >03:10:01 PM   1451764 196700780 99.27361196  76336340 143581608
>> >47.77
>> >03:20:01 PM   1453400 196699144 99.27361448  76248584 143551128
>> >47.76
>> >03:30:35 PM   1513844 196638700 99.24361648  76022016 143828244
>> >47.85
>> >03:42:40 PM   1481108 196671436 99.25361676  75718320 144478784
>> >48.07
>> >Average:  5051607 193100937 97.45362421  81775777 142758861
>> >47.50
>> >
>> >04:42:04 PM   LINUX RESTART
>> >
>> >04:50:01 PM kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit
>> >%commit
>> >05:00:01 PM 154357132  43795412 22.10 92012  18648644 134950460
>> >44.90
>> >05:10:01 PM 136468244  61684300 31.13219572  31709216 134966548
>> >44.91
>> >05:20:01 PM 135092452  63060092 31.82221488  32162324 134949788
>> >44.90
>> >05:30:01 PM 133410464  64742080 32.67233848  32793848 134976828
>> >44.91
>> >05:40:01 PM 132022052  66130492 33.37235812  33278908 135007268
>> >44.92
>> >05:50:01 PM 130630408  67522136 34.08237140  33900912 135099764
>> >44.95
>> >Average:136996792  61155752 30.86206645  30415642 134991776
>> >44.91
>>
>> >--
>> >
>> >
>> >As the blue font parts show that my hardware crash from 03:30:35.It is
>> hung
>> >up until I restart it manually at 04:42:04
>> >ALl the above information just snapshot the performance when it crashed
>> >while there is nothing cover the reason.I have also
>> >check the /var/log/messages and find nothing useful.
>> >
>> >Note that I run the command- sar -v .It shows something abnormal:
>>
>> >
>> >02:50:01 PM  11542262  9216 76446   258
>> >03:00:01 PM  11645526  9536 76421   258
>> >03:10:01 PM  11748690  9216 76451   258

Re: High Cpu sys usage

2016-03-19 Thread Shawn Heisey
On 3/16/2016 8:27 PM, YouPeng Yang wrote:
> Hi Shawn
>Here is my top screenshot:
>
>https://www.dropbox.com/s/jaw10mkmipz943y/topscreen.jpg?dl=0
>
>It is captured when my system is normal.And I have reduced the memory
> size down to 48GB originating  from 64GB.

It looks like you have at least two Solr instances on this machine, one
of which has over 600GB of index data, and the other has over 500GB of
data.  There may be as many as ten Solr instances, but I cannot tell for
sure what those Java processes are.

If my guess is correct, this means that there's over a terabyte of index
data, but you only have about 100GB of RAM available to cache it.  I
don't think this is enough RAM for good performance, even if the disks
are SSD.  You'll either need a lot more memory in each machine, or more
machines.  The data may need to be divided into more shards.

I am not seeing any evidence here of high CPU.  The system only shows
about 12 percent total CPU usage, and very little of it is system (kernel).

Thanks,
Shawn



Re: High Cpu sys usage

2016-03-19 Thread YouPeng Yang
Hi
 It happened again,and worse thing is that my system went to crash.we can
even not connect to it with ssh.
 I use the sar command to capture the statistics information about it.Here
are my details:


[1]cpu(by using sar -u),we have to restart our system just as the red font
LINUX RESTART in the logs.
--
03:00:01 PM all  7.61  0.00  0.92  0.07  0.00
91.40
03:10:01 PM all  7.71  0.00  1.29  0.06  0.00
90.94
03:20:01 PM all  7.62  0.00  1.98  0.06  0.00
90.34
03:30:35 PM all  5.65  0.00 31.08  0.04  0.00
63.23
03:42:40 PM all 47.58  0.00 52.25  0.00  0.00
 0.16
Average:all  8.21  0.00  1.57  0.05  0.00
90.17

04:42:04 PM   LINUX RESTART

04:50:01 PM CPU %user %nice   %system   %iowait%steal
%idle
05:00:01 PM all  3.49  0.00  0.62  0.15  0.00
95.75
05:10:01 PM all  9.03  0.00  0.92  0.28  0.00
89.77
05:20:01 PM all  7.06  0.00  0.78  0.05  0.00
92.11
05:30:01 PM all  6.67  0.00  0.79  0.06  0.00
92.48
05:40:01 PM all  6.26  0.00  0.76  0.05  0.00
92.93
05:50:01 PM all  5.49  0.00  0.71  0.05  0.00
93.75
--

[2]mem(by using sar -r)
--
03:00:01 PM   1519272 196633272 99.23361112  76364340 143574212
47.77
03:10:01 PM   1451764 196700780 99.27361196  76336340 143581608
47.77
03:20:01 PM   1453400 196699144 99.27361448  76248584 143551128
47.76
03:30:35 PM   1513844 196638700 99.24361648  76022016 143828244
47.85
03:42:40 PM   1481108 196671436 99.25361676  75718320 144478784
48.07
Average:  5051607 193100937 97.45362421  81775777 142758861
47.50

04:42:04 PM   LINUX RESTART

04:50:01 PM kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit
%commit
05:00:01 PM 154357132  43795412 22.10 92012  18648644 134950460
44.90
05:10:01 PM 136468244  61684300 31.13219572  31709216 134966548
44.91
05:20:01 PM 135092452  63060092 31.82221488  32162324 134949788
44.90
05:30:01 PM 133410464  64742080 32.67233848  32793848 134976828
44.91
05:40:01 PM 132022052  66130492 33.37235812  33278908 135007268
44.92
05:50:01 PM 130630408  67522136 34.08237140  33900912 135099764
44.95
Average:136996792  61155752 30.86206645  30415642 134991776
44.91
--


As the blue font parts show that my hardware crash from 03:30:35.It is hung
up until I restart it manually at 04:42:04
ALl the above information just snapshot the performance when it crashed
while there is nothing cover the reason.I have also
check the /var/log/messages and find nothing useful.

Note that I run the command- sar -v .It shows something abnormal:

02:50:01 PM  11542262  9216 76446   258
03:00:01 PM  11645526  9536 76421   258
03:10:01 PM  11748690  9216 76451   258
03:20:01 PM  11850191  9152 76331   258
03:30:35 PM  11972313 10112132625   258
03:42:40 PM  12177319 13760340227   258
Average:  8293601  8950 68187   161

04:42:04 PM   LINUX RESTART

04:50:01 PM dentunusd   file-nr  inode-nrpty-nr
05:00:01 PM 35410  7616 35223 4
05:10:01 PM137320  7296 42632 6
05:20:01 PM247010  7296 42839 9
05:30:01 PM358434  7360 42697 9
05:40:01 PM471543  7040 4292910
05:50:01 PM583787  7296 4283713


and I check the man info about the -v option :

*-v*  Report status of inode, file and other kernel tables.  The following
values are displayed:
   *dentunusd*
Number of unused cache entries in the directory cache.
*file-nr*
Number of file handles used by the system.
*inode-nr*
Number of inode handlers used by the system.
*pty-nr*
Number of pseudo-terminals used by the system.


Is the any clue about the crash? Would you please give me some suggestions?


Best Regards.


2016-03-16 14:01 GMT+08:00 YouPeng Yang :

> Hello
>The problem appears several times ,however I could not capture the top
> 

Re: High Cpu sys usage

2016-03-19 Thread YouPeng Yang
Hi Shawn
   Here is my top screenshot:

   https://www.dropbox.com/s/jaw10mkmipz943y/topscreen.jpg?dl=0

   It is captured when my system is normal.And I have reduced the memory
size down to 48GB originating  from 64GB.


  We have two hardware clusters ,each is comprised of 3 machines,and On one
cluster we deploy 3 different SolrCloud application clusters,the above top
screenshot is the machine crached 4:30PM  yesterday.

  To be convenient,I post a top sceenshot of  another machine  of the other
cluster:

   https://www.dropbox.com/s/p3j3bpcl8l2i1nt/another64GBnodeTop.jpg?dl=0

  On this machine ,the biggest  Solrcloud node  which jvm memory size is
64GB holds 730GB index size.The machine hung up for a long time just at
yesterday middle night.

We also have capture the iotop when it hung up.

   https://www.dropbox.com/s/keqqjabmon9f1ea/anthoer64GBnodeIotop.jpg?dl=0

as the iotop shows the process jdb2  is writing large .I think it will be
helpfull.

Best Regards





2016-03-17 7:35 GMT+08:00 Shawn Heisey :

> On 3/16/2016 8:59 AM, Patrick Plaatje wrote:
> > From the sar output you supplied, it looks like you might have a memory
> issue on your hosts. The memory usage just before your crash seems to be
> *very* close to 100%. Even the slightest increase (Solr itself, or possibly
> by a system service) could caused the system crash. What are the
> specifications of your hosts and how much memory are you allocating?
>
> It's completely normal for a machine, especially a machine running Solr
> with a very large index, to run at nearly 100% memory usage.  The
> "Average" line from the sar output indicates 97.45 percent usage, but it
> also shows 81GB of memory in the "kbcached" column -- this is memory
> that can be instantly claimed by any program that asks for it.  If we
> discount this 81GB, since it is instantly available, the "true" memory
> usage is closer to 70 percent than 100.
>
> https://en.wikipedia.org/wiki/Page_cache
>
> If YouPeng can run top and sort it by memory usage (press shift-M), then
> grab a screenshot, that will be helpful for more insight.  Here's an
> example of this from one of my servers, shared on dropbox:
>
> https://www.dropbox.com/s/qfuxhw20q0y1ckx/linux-8gb-heap.png?dl=0
>
> This is a server with 64GB of RAM and 110GB of index data.  About 48GB
> of my memory is used by the disk cache.  I've got slightly less than
> half my index data in the cache.
>
> Thanks,
> Shawn
>
>


Re: High Cpu sys usage

2016-03-19 Thread Shawn Heisey
On 3/16/2016 8:59 AM, Patrick Plaatje wrote:
> From the sar output you supplied, it looks like you might have a memory issue 
> on your hosts. The memory usage just before your crash seems to be *very* 
> close to 100%. Even the slightest increase (Solr itself, or possibly by a 
> system service) could caused the system crash. What are the specifications of 
> your hosts and how much memory are you allocating?

It's completely normal for a machine, especially a machine running Solr
with a very large index, to run at nearly 100% memory usage.  The
"Average" line from the sar output indicates 97.45 percent usage, but it
also shows 81GB of memory in the "kbcached" column -- this is memory
that can be instantly claimed by any program that asks for it.  If we
discount this 81GB, since it is instantly available, the "true" memory
usage is closer to 70 percent than 100.

https://en.wikipedia.org/wiki/Page_cache

If YouPeng can run top and sort it by memory usage (press shift-M), then
grab a screenshot, that will be helpful for more insight.  Here's an
example of this from one of my servers, shared on dropbox:

https://www.dropbox.com/s/qfuxhw20q0y1ckx/linux-8gb-heap.png?dl=0

This is a server with 64GB of RAM and 110GB of index data.  About 48GB
of my memory is used by the disk cache.  I've got slightly less than
half my index data in the cache.

Thanks,
Shawn



Re: High Cpu sys usage

2016-03-19 Thread YouPeng Yang
Hi
  To Patrick: Never mind .Thank you for your suggestion all the same.
  To Otis. We do not use SPM. We monintor the JVM just use jstat becasue my
system went well before ,so we do not need  other tools.
But SPM is really awesome .

  Still looking for help.

Best Regards

2016-03-18 6:01 GMT+08:00 Patrick Plaatje :

> Yeah, I did’t pay attention to the cached memory at all, my bad!
>
> I remember running into a similar situation a couple of years ago, one of
> the things to investigate our memory profile was to produce a full heap
> dump and manually analyse that using a tool like MAT.
>
> Cheers,
> -patrick
>
>
>
>
> On 17/03/2016, 21:58, "Otis Gospodnetić" 
> wrote:
>
> >Hi,
> >
> >On Wed, Mar 16, 2016 at 10:59 AM, Patrick Plaatje 
> >wrote:
> >
> >> Hi,
> >>
> >> From the sar output you supplied, it looks like you might have a memory
> >> issue on your hosts. The memory usage just before your crash seems to be
> >> *very* close to 100%. Even the slightest increase (Solr itself, or
> possibly
> >> by a system service) could caused the system crash. What are the
> >> specifications of your hosts and how much memory are you allocating?
> >
> >
> >That's normal actually - http://www.linuxatemyram.com/
> >
> >You *want* Linux to be using all your memory - you paid for it :)
> >
> >Otis
> >--
> >Monitoring - Log Management - Alerting - Anomaly Detection
> >Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> >
> >
> >
> >
> >>
> >
> >
> >>
> >>
> >> On 16/03/2016, 14:52, "YouPeng Yang"  wrote:
> >>
> >> >Hi
> >> > It happened again,and worse thing is that my system went to crash.we
> can
> >> >even not connect to it with ssh.
> >> > I use the sar command to capture the statistics information about
> it.Here
> >> >are my details:
> >> >
> >> >
> >> >[1]cpu(by using sar -u),we have to restart our system just as the red
> font
> >> >LINUX RESTART in the logs.
> >>
> >>
> >--
> >> >03:00:01 PM all  7.61  0.00  0.92  0.07  0.00
> >> >91.40
> >> >03:10:01 PM all  7.71  0.00  1.29  0.06  0.00
> >> >90.94
> >> >03:20:01 PM all  7.62  0.00  1.98  0.06  0.00
> >> >90.34
> >> >03:30:35 PM all  5.65  0.00 31.08  0.04  0.00
> >> >63.23
> >> >03:42:40 PM all 47.58  0.00 52.25  0.00  0.00
> >> > 0.16
> >> >Average:all  8.21  0.00  1.57  0.05  0.00
> >> >90.17
> >> >
> >> >04:42:04 PM   LINUX RESTART
> >> >
> >> >04:50:01 PM CPU %user %nice   %system   %iowait%steal
> >> >%idle
> >> >05:00:01 PM all  3.49  0.00  0.62  0.15  0.00
> >> >95.75
> >> >05:10:01 PM all  9.03  0.00  0.92  0.28  0.00
> >> >89.77
> >> >05:20:01 PM all  7.06  0.00  0.78  0.05  0.00
> >> >92.11
> >> >05:30:01 PM all  6.67  0.00  0.79  0.06  0.00
> >> >92.48
> >> >05:40:01 PM all  6.26  0.00  0.76  0.05  0.00
> >> >92.93
> >> >05:50:01 PM all  5.49  0.00  0.71  0.05  0.00
> >> >93.75
> >>
> >>
> >--
> >> >
> >> >[2]mem(by using sar -r)
> >>
> >>
> >--
> >> >03:00:01 PM   1519272 196633272 99.23361112  76364340 143574212
> >> >47.77
> >> >03:10:01 PM   1451764 196700780 99.27361196  76336340 143581608
> >> >47.77
> >> >03:20:01 PM   1453400 196699144 99.27361448  76248584 143551128
> >> >47.76
> >> >03:30:35 PM   1513844 196638700 99.24361648  76022016 143828244
> >> >47.85
> >> >03:42:40 PM   1481108 196671436 99.25361676  75718320 144478784
> >> >48.07
> >> >Average:  5051607 193100937 97.45362421  81775777 142758861
> >> >47.50
> >> >
> >> >04:42:04 PM   LINUX RESTART
> >> >
> >> >04:50:01 PM kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit
> >> >%commit
> >> >05:00:01 PM 154357132  43795412 22.10 92012  18648644 134950460
> >> >44.90
> >> >05:10:01 PM 136468244  61684300 31.13219572  31709216 134966548
> >> >44.91
> >> >05:20:01 PM 135092452  63060092 31.82221488  32162324 134949788
> >> >44.90
> >> >05:30:01 PM 133410464  64742080 32.67233848  32793848 134976828
> >> >44.91
> >> >05:40:01 PM 132022052  66130492 33.37235812  33278908 135007268
> >> >44.92
> >> >05:50:01 PM 130630408  67522136 34.08237140  33900912 135099764
> >> >44.95
> >> >Average:136996792  61155752 30.86206645  30415642 134991776
> >> >44.91
> >>
> >>
> >--
> >> >
> >> >
> >> >As the blue font parts show that my hardware crash from 03:30:35.It is

Re: High Cpu sys usage

2016-03-19 Thread YouPeng Yang
Hi Shawn

   Actually,there are three Solr instances(The top three PIDs is the three
instances),and the datafile size of the stuff is 851G,592G,49G respectively
,and more and more data will be added as time going.I think it may be rare
as the large scope as my solrcloud service .and
it is now one of the most important core service in my company.
Just as you suggest,the increasing size of data make us to devide our
SolrCloud service into smaller application clusters.and we do have
 separated our collection into smaller shards .and I know   there must be
some abnormal things on the service when time is going.however the unknown
reason high sys cpu is right now as a nightmare.So I look for help from our
community.
   Would you have some experience as me and how you solve this problem?




Best Regards




2016-03-17 14:16 GMT+08:00 Shawn Heisey :

> On 3/16/2016 8:27 PM, YouPeng Yang wrote:
> > Hi Shawn
> >Here is my top screenshot:
> >
> >https://www.dropbox.com/s/jaw10mkmipz943y/topscreen.jpg?dl=0
> >
> >It is captured when my system is normal.And I have reduced the memory
> > size down to 48GB originating  from 64GB.
>
> It looks like you have at least two Solr instances on this machine, one
> of which has over 600GB of index data, and the other has over 500GB of
> data.  There may be as many as ten Solr instances, but I cannot tell for
> sure what those Java processes are.
>
> If my guess is correct, this means that there's over a terabyte of index
> data, but you only have about 100GB of RAM available to cache it.  I
> don't think this is enough RAM for good performance, even if the disks
> are SSD.  You'll either need a lot more memory in each machine, or more
> machines.  The data may need to be divided into more shards.
>
> I am not seeing any evidence here of high CPU.  The system only shows
> about 12 percent total CPU usage, and very little of it is system (kernel).
>
> Thanks,
> Shawn
>
>


Re: High Cpu sys usage

2016-03-19 Thread Patrick Plaatje
Hi,

>From the sar output you supplied, it looks like you might have a memory issue 
>on your hosts. The memory usage just before your crash seems to be *very* 
>close to 100%. Even the slightest increase (Solr itself, or possibly by a 
>system service) could caused the system crash. What are the specifications of 
>your hosts and how much memory are you allocating?

Cheers,
-patrick




On 16/03/2016, 14:52, "YouPeng Yang"  wrote:

>Hi
> It happened again,and worse thing is that my system went to crash.we can
>even not connect to it with ssh.
> I use the sar command to capture the statistics information about it.Here
>are my details:
>
>
>[1]cpu(by using sar -u),we have to restart our system just as the red font
>LINUX RESTART in the logs.
>--
>03:00:01 PM all  7.61  0.00  0.92  0.07  0.00
>91.40
>03:10:01 PM all  7.71  0.00  1.29  0.06  0.00
>90.94
>03:20:01 PM all  7.62  0.00  1.98  0.06  0.00
>90.34
>03:30:35 PM all  5.65  0.00 31.08  0.04  0.00
>63.23
>03:42:40 PM all 47.58  0.00 52.25  0.00  0.00
> 0.16
>Average:all  8.21  0.00  1.57  0.05  0.00
>90.17
>
>04:42:04 PM   LINUX RESTART
>
>04:50:01 PM CPU %user %nice   %system   %iowait%steal
>%idle
>05:00:01 PM all  3.49  0.00  0.62  0.15  0.00
>95.75
>05:10:01 PM all  9.03  0.00  0.92  0.28  0.00
>89.77
>05:20:01 PM all  7.06  0.00  0.78  0.05  0.00
>92.11
>05:30:01 PM all  6.67  0.00  0.79  0.06  0.00
>92.48
>05:40:01 PM all  6.26  0.00  0.76  0.05  0.00
>92.93
>05:50:01 PM all  5.49  0.00  0.71  0.05  0.00
>93.75
>--
>
>[2]mem(by using sar -r)
>--
>03:00:01 PM   1519272 196633272 99.23361112  76364340 143574212
>47.77
>03:10:01 PM   1451764 196700780 99.27361196  76336340 143581608
>47.77
>03:20:01 PM   1453400 196699144 99.27361448  76248584 143551128
>47.76
>03:30:35 PM   1513844 196638700 99.24361648  76022016 143828244
>47.85
>03:42:40 PM   1481108 196671436 99.25361676  75718320 144478784
>48.07
>Average:  5051607 193100937 97.45362421  81775777 142758861
>47.50
>
>04:42:04 PM   LINUX RESTART
>
>04:50:01 PM kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit
>%commit
>05:00:01 PM 154357132  43795412 22.10 92012  18648644 134950460
>44.90
>05:10:01 PM 136468244  61684300 31.13219572  31709216 134966548
>44.91
>05:20:01 PM 135092452  63060092 31.82221488  32162324 134949788
>44.90
>05:30:01 PM 133410464  64742080 32.67233848  32793848 134976828
>44.91
>05:40:01 PM 132022052  66130492 33.37235812  33278908 135007268
>44.92
>05:50:01 PM 130630408  67522136 34.08237140  33900912 135099764
>44.95
>Average:136996792  61155752 30.86206645  30415642 134991776
>44.91
>--
>
>
>As the blue font parts show that my hardware crash from 03:30:35.It is hung
>up until I restart it manually at 04:42:04
>ALl the above information just snapshot the performance when it crashed
>while there is nothing cover the reason.I have also
>check the /var/log/messages and find nothing useful.
>
>Note that I run the command- sar -v .It shows something abnormal:
>
>02:50:01 PM  11542262  9216 76446   258
>03:00:01 PM  11645526  9536 76421   258
>03:10:01 PM  11748690  9216 76451   258
>03:20:01 PM  11850191  9152 76331   258
>03:30:35 PM  11972313 10112132625   258
>03:42:40 PM  12177319 13760340227   258
>Average:  8293601  8950 68187   161
>
>04:42:04 PM   LINUX RESTART
>
>04:50:01 PM dentunusd   file-nr  inode-nrpty-nr
>05:00:01 PM 35410  7616 35223 4
>05:10:01 PM137320  7296 42632 6
>05:20:01 PM247010  7296 42839 9
>05:30:01 PM358434  7360 42697 9
>05:40:01 PM471543  7040 4292910
>05:50:01 PM583787  7296 4283713
>
>
>and I check the man info about the -v option :
>
>*-v*  Report status of inode, file and other kernel tables.  The following
>values are displayed:
>   *dentun

Re: High Cpu sys usage

2016-03-19 Thread Otis Gospodnetić
Hi,

I looked at those metrics outputs, but nothing jumps out at me as
problematic.

How full are your JVM heap memory pools?  If you are using SPM to monitor
your Solr/Tomcat/Jetty/... look for a chart that looks like this:
https://apps.sematext.com/spm-reports/s/zB3JcdZyRn

If some of these lines are close to 100% and stay close or at 100%, that's
typically a bad sign.
Next, look at your Garbage Collection times and counts.  If you look at
your GC metrics for e.g. a month and see a recent increase in GC times or
counts then, yes, you have an issue with your memory/heap and that is what
is increasing your CPU usage.

If it looks like heap/GC are not the issue and it's really something inside
Solr, you could profile it with either one of the standard profilers or
something like
https://sematext.com/blog/2016/03/17/on-demand-java-profiling/ .  If there
is something in Solr chewing on the CPU, this should show it.

I hope this helps.

Otis
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/


On Wed, Mar 16, 2016 at 10:52 AM, YouPeng Yang 
wrote:

> Hi
>  It happened again,and worse thing is that my system went to crash.we can
> even not connect to it with ssh.
>  I use the sar command to capture the statistics information about it.Here
> are my details:
>
>
> [1]cpu(by using sar -u),we have to restart our system just as the red font
> LINUX RESTART in the logs.
>
> --
> 03:00:01 PM all  7.61  0.00  0.92  0.07  0.00
> 91.40
> 03:10:01 PM all  7.71  0.00  1.29  0.06  0.00
> 90.94
> 03:20:01 PM all  7.62  0.00  1.98  0.06  0.00
> 90.34
> 03:30:35 PM all  5.65  0.00 31.08  0.04  0.00
> 63.23
> 03:42:40 PM all 47.58  0.00 52.25  0.00  0.00
>  0.16
> Average:all  8.21  0.00  1.57  0.05  0.00
> 90.17
>
> 04:42:04 PM   LINUX RESTART
>
> 04:50:01 PM CPU %user %nice   %system   %iowait%steal
> %idle
> 05:00:01 PM all  3.49  0.00  0.62  0.15  0.00
> 95.75
> 05:10:01 PM all  9.03  0.00  0.92  0.28  0.00
> 89.77
> 05:20:01 PM all  7.06  0.00  0.78  0.05  0.00
> 92.11
> 05:30:01 PM all  6.67  0.00  0.79  0.06  0.00
> 92.48
> 05:40:01 PM all  6.26  0.00  0.76  0.05  0.00
> 92.93
> 05:50:01 PM all  5.49  0.00  0.71  0.05  0.00
> 93.75
>
> --
>
> [2]mem(by using sar -r)
>
> --
> 03:00:01 PM   1519272 196633272 99.23361112  76364340 143574212
> 47.77
> 03:10:01 PM   1451764 196700780 99.27361196  76336340 143581608
> 47.77
> 03:20:01 PM   1453400 196699144 99.27361448  76248584 143551128
> 47.76
> 03:30:35 PM   1513844 196638700 99.24361648  76022016 143828244
> 47.85
> 03:42:40 PM   1481108 196671436 99.25361676  75718320 144478784
> 48.07
> Average:  5051607 193100937 97.45362421  81775777 142758861
> 47.50
>
> 04:42:04 PM   LINUX RESTART
>
> 04:50:01 PM kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit
> %commit
> 05:00:01 PM 154357132  43795412 22.10 92012  18648644 134950460
> 44.90
> 05:10:01 PM 136468244  61684300 31.13219572  31709216 134966548
> 44.91
> 05:20:01 PM 135092452  63060092 31.82221488  32162324 134949788
> 44.90
> 05:30:01 PM 133410464  64742080 32.67233848  32793848 134976828
> 44.91
> 05:40:01 PM 132022052  66130492 33.37235812  33278908 135007268
> 44.92
> 05:50:01 PM 130630408  67522136 34.08237140  33900912 135099764
> 44.95
> Average:136996792  61155752 30.86206645  30415642 134991776
> 44.91
>
> --
>
>
> As the blue font parts show that my hardware crash from 03:30:35.It is hung
> up until I restart it manually at 04:42:04
> ALl the above information just snapshot the performance when it crashed
> while there is nothing cover the reason.I have also
> check the /var/log/messages and find nothing useful.
>
> Note that I run the command- sar -v .It shows something abnormal:
>
> 
> 02:50:01 PM  11542262  9216 76446   258
> 03:00:01 PM  11645526  9536 76421   258
> 03:10:01 PM  11748690  9216 76451   258
> 03:20:01 PM  11850191  9152 76331   258
> 03:30:35 PM  11972313 10112132625   258
> 03:42:40 PM  12177319 13760340227   258
> Average:  8293601

Re: High Cpu sys usage

2016-03-15 Thread YouPeng Yang
Hello
   The problem appears several times ,however I could not capture the top
output .My script is as follows code.
I check the sys cpu usage whether it exceed 30%.the other metric
information can be dumpped successfully except the top .
Would you like to check my script that I am not able to figure out what is
wrong.

-
#!/bin/bash

while :
  do
sysusage=$(mpstat 2 1 | grep -A 1 "%sys" | tail -n 1 | awk '{if($6 <
30) print 1; else print 0;}' )

if [ $sysusage -eq 0 ];then
#echo $sysusage
#perf record -o perf$(date +%Y%m%d%H%M%S).data  -a -g -F 1000 sleep
30
file=$(date +%Y%m%d%H%M%S)
top -n 2 >> top$file.data
iotop -b -n 2  >> iotop$file.data
iostat >> iostat$file.data
netstat -an | awk '/^tcp/ {++state[$NF]} END {for(i in state) print
i,"\t",state[i]}' >> netstat$file.data
fi
sleep 5
  done
You have new mail in /var/spool/mail/root


-

2016-03-08 21:39 GMT+08:00 YouPeng Yang :

> Hi all
>   Thanks for your reply.I do some investigation for much time.and I will
> post some logs of the 'top' and IO in a few days when the crash come again.
>
> 2016-03-08 10:45 GMT+08:00 Shawn Heisey :
>
>> On 3/7/2016 2:23 AM, Toke Eskildsen wrote:
>> > How does this relate to YouPeng reporting that the CPU usage increases?
>> >
>> > This is not a snark. YouPeng mentions kernel issues. It might very well
>> > be that IO is the real problem, but that it manifests in a non-intuitive
>> > way. Before memory-mapping it was easy: Just look at IO-Wait. Now I am
>> > not so sure. Can high kernel load (Sy% in *nix top) indicate that the IO
>> > system is struggling, even if IO-Wait is low?
>>
>> It might turn out to be not directly related to memory, you're right
>> about that.  A very high query rate or particularly CPU-heavy queries or
>> analysis could cause high CPU usage even when memory is plentiful, but
>> in that situation I would expect high user percentage, not kernel.  I'm
>> not completely sure what might cause high kernel usage if iowait is low,
>> but no specific information was given about iowait.  I've seen iowait
>> percentages of 10% or less with problems clearly caused by iowait.
>>
>> With the available information (especially seeing 700GB of index data),
>> I believe that the "not enough memory" scenario is more likely than
>> anything else.  If the OP replies and says they have plenty of memory,
>> then we can move on to the less common (IMHO) reasons for high CPU with
>> a large index.
>>
>> If the OS is one that reports load average, I am curious what the 5
>> minute average is, and how many real (non-HT) CPU cores there are.
>>
>> Thanks,
>> Shawn
>>
>>
>


Re: High Cpu sys usage

2016-03-08 Thread YouPeng Yang
Hi all
  Thanks for your reply.I do some investigation for much time.and I will
post some logs of the 'top' and IO in a few days when the crash come again.

2016-03-08 10:45 GMT+08:00 Shawn Heisey :

> On 3/7/2016 2:23 AM, Toke Eskildsen wrote:
> > How does this relate to YouPeng reporting that the CPU usage increases?
> >
> > This is not a snark. YouPeng mentions kernel issues. It might very well
> > be that IO is the real problem, but that it manifests in a non-intuitive
> > way. Before memory-mapping it was easy: Just look at IO-Wait. Now I am
> > not so sure. Can high kernel load (Sy% in *nix top) indicate that the IO
> > system is struggling, even if IO-Wait is low?
>
> It might turn out to be not directly related to memory, you're right
> about that.  A very high query rate or particularly CPU-heavy queries or
> analysis could cause high CPU usage even when memory is plentiful, but
> in that situation I would expect high user percentage, not kernel.  I'm
> not completely sure what might cause high kernel usage if iowait is low,
> but no specific information was given about iowait.  I've seen iowait
> percentages of 10% or less with problems clearly caused by iowait.
>
> With the available information (especially seeing 700GB of index data),
> I believe that the "not enough memory" scenario is more likely than
> anything else.  If the OP replies and says they have plenty of memory,
> then we can move on to the less common (IMHO) reasons for high CPU with
> a large index.
>
> If the OS is one that reports load average, I am curious what the 5
> minute average is, and how many real (non-HT) CPU cores there are.
>
> Thanks,
> Shawn
>
>


Re: High Cpu sys usage

2016-03-07 Thread Shawn Heisey
On 3/7/2016 2:23 AM, Toke Eskildsen wrote:
> How does this relate to YouPeng reporting that the CPU usage increases?
>
> This is not a snark. YouPeng mentions kernel issues. It might very well
> be that IO is the real problem, but that it manifests in a non-intuitive
> way. Before memory-mapping it was easy: Just look at IO-Wait. Now I am
> not so sure. Can high kernel load (Sy% in *nix top) indicate that the IO
> system is struggling, even if IO-Wait is low?

It might turn out to be not directly related to memory, you're right
about that.  A very high query rate or particularly CPU-heavy queries or
analysis could cause high CPU usage even when memory is plentiful, but
in that situation I would expect high user percentage, not kernel.  I'm
not completely sure what might cause high kernel usage if iowait is low,
but no specific information was given about iowait.  I've seen iowait
percentages of 10% or less with problems clearly caused by iowait.

With the available information (especially seeing 700GB of index data),
I believe that the "not enough memory" scenario is more likely than
anything else.  If the OP replies and says they have plenty of memory,
then we can move on to the less common (IMHO) reasons for high CPU with
a large index.

If the OS is one that reports load average, I am curious what the 5
minute average is, and how many real (non-HT) CPU cores there are.

Thanks,
Shawn



Re: High Cpu sys usage

2016-03-07 Thread Toke Eskildsen
On Sun, 2016-03-06 at 08:26 -0700, Shawn Heisey wrote:
> On 3/5/2016 11:44 PM, YouPeng Yang wrote:
> >   We are using Solr Cloud 4.6 in our production for searching service
> > since 2 years ago.And now it has 700GB in one cluster which is  comprised
> > of 3 machines with ssd. At beginning ,everything go well,while more and
> > more business services interfered with our searching service .And a problem
> >  which we haunted with is just like a  nightmare . That is the cpu sys
> > usage is often growing up to  over 10% even higher, and as a result the
> > machine will hang down because system resources have be drained out.We have
> > to restart the machine manually.

> One of the most common reasons for performance issues with Solr is not
> having enough system memory to effectively cache the index. [...]

How does this relate to YouPeng reporting that the CPU usage increases?

This is not a snark. YouPeng mentions kernel issues. It might very well
be that IO is the real problem, but that it manifests in a non-intuitive
way. Before memory-mapping it was easy: Just look at IO-Wait. Now I am
not so sure. Can high kernel load (Sy% in *nix top) indicate that the IO
system is struggling, even if IO-Wait is low?

YouPeng: If you are on a *nix-system, can you call 'top' on a machine
and copy-paste the output somewhere we can see?


- Toke Eskildsen, State and University Library, Denmark




Re: High Cpu sys usage

2016-03-06 Thread Shawn Heisey
On 3/5/2016 11:44 PM, YouPeng Yang wrote:
>   We are using Solr Cloud 4.6 in our production for searching service
> since 2 years ago.And now it has 700GB in one cluster which is  comprised
> of 3 machines with ssd. At beginning ,everything go well,while more and
> more business services interfered with our searching service .And a problem
>  which we haunted with is just like a  nightmare . That is the cpu sys
> usage is often growing up to  over 10% even higher, and as a result the
> machine will hang down because system resources have be drained out.We have
> to restart the machine manually.

One of the most common reasons for performance issues with Solr is not
having enough system memory to effectively cache the index.  Another is
running with a heap that's too small, or a heap that's really large with
ineffective garbage collection tuning.  All of these problems get worse
as query rate climbs.

Running on SSD can reduce, but not eliminate, the requirement for plenty
of system memory.

With 700GB of index data, you are likely to need somewhere between 128GB
and 512GB of memory for good performance.  If the query rate is high,
then requirements are more likely to land in the upper end of that
range.  There's no way for me to narrow that range down -- it depends on
a number of factors, and usually has to be determined through trial and
error.  If the data were on regular disks instead of SSD, I would be
recommending even more memory.

https://wiki.apache.org/solr/SolrPerformanceProblems
https://lucidworks.com/blog/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

If you want a single number recommendation for memory size, I would
recommend starting with 256GB, and being ready to add more.  It is very
common for servers to be incapable of handling that much memory,
though.  The servers that I use for Solr max out at 64GB.

You might need to split your index onto additional machines by sharding
it, and gain the additional memory that way.

Thanks,
Shawn