Hi, I looked at those metrics outputs, but nothing jumps out at me as problematic.
How full are your JVM heap memory pools? If you are using SPM to monitor your Solr/Tomcat/Jetty/... look for a chart that looks like this: https://apps.sematext.com/spm-reports/s/zB3JcdZyRn If some of these lines are close to 100% and stay close or at 100%, that's typically a bad sign. Next, look at your Garbage Collection times and counts. If you look at your GC metrics for e.g. a month and see a recent increase in GC times or counts then, yes, you have an issue with your memory/heap and that is what is increasing your CPU usage. If it looks like heap/GC are not the issue and it's really something inside Solr, you could profile it with either one of the standard profilers or something like https://sematext.com/blog/2016/03/17/on-demand-java-profiling/ . If there is something in Solr chewing on the CPU, this should show it. I hope this helps. Otis -- Monitoring - Log Management - Alerting - Anomaly Detection Solr & Elasticsearch Consulting Support Training - http://sematext.com/ On Wed, Mar 16, 2016 at 10:52 AM, YouPeng Yang <yypvsxf19870...@gmail.com> wrote: > Hi > It happened again,and worse thing is that my system went to crash.we can > even not connect to it with ssh. > I use the sar command to capture the statistics information about it.Here > are my details: > > > [1]cpu(by using sar -u),we have to restart our system just as the red font > LINUX RESTART in the logs. > > -------------------------------------------------------------------------------------------------- > 03:00:01 PM all 7.61 0.00 0.92 0.07 0.00 > 91.40 > 03:10:01 PM all 7.71 0.00 1.29 0.06 0.00 > 90.94 > 03:20:01 PM all 7.62 0.00 1.98 0.06 0.00 > 90.34 > 03:30:35 PM all 5.65 0.00 31.08 0.04 0.00 > 63.23 > 03:42:40 PM all 47.58 0.00 52.25 0.00 0.00 > 0.16 > Average: all 8.21 0.00 1.57 0.05 0.00 > 90.17 > > 04:42:04 PM LINUX RESTART > > 04:50:01 PM CPU %user %nice %system %iowait %steal > %idle > 05:00:01 PM all 3.49 0.00 0.62 0.15 0.00 > 95.75 > 05:10:01 PM all 9.03 0.00 0.92 0.28 0.00 > 89.77 > 05:20:01 PM all 7.06 0.00 0.78 0.05 0.00 > 92.11 > 05:30:01 PM all 6.67 0.00 0.79 0.06 0.00 > 92.48 > 05:40:01 PM all 6.26 0.00 0.76 0.05 0.00 > 92.93 > 05:50:01 PM all 5.49 0.00 0.71 0.05 0.00 > 93.75 > > -------------------------------------------------------------------------------------------------- > > [2]mem(by using sar -r) > > -------------------------------------------------------------------------------------------------- > 03:00:01 PM 1519272 196633272 99.23 361112 76364340 143574212 > 47.77 > 03:10:01 PM 1451764 196700780 99.27 361196 76336340 143581608 > 47.77 > 03:20:01 PM 1453400 196699144 99.27 361448 76248584 143551128 > 47.76 > 03:30:35 PM 1513844 196638700 99.24 361648 76022016 143828244 > 47.85 > 03:42:40 PM 1481108 196671436 99.25 361676 75718320 144478784 > 48.07 > Average: 5051607 193100937 97.45 362421 81775777 142758861 > 47.50 > > 04:42:04 PM LINUX RESTART > > 04:50:01 PM kbmemfree kbmemused %memused kbbuffers kbcached kbcommit > %commit > 05:00:01 PM 154357132 43795412 22.10 92012 18648644 134950460 > 44.90 > 05:10:01 PM 136468244 61684300 31.13 219572 31709216 134966548 > 44.91 > 05:20:01 PM 135092452 63060092 31.82 221488 32162324 134949788 > 44.90 > 05:30:01 PM 133410464 64742080 32.67 233848 32793848 134976828 > 44.91 > 05:40:01 PM 132022052 66130492 33.37 235812 33278908 135007268 > 44.92 > 05:50:01 PM 130630408 67522136 34.08 237140 33900912 135099764 > 44.95 > Average: 136996792 61155752 30.86 206645 30415642 134991776 > 44.91 > > -------------------------------------------------------------------------------------------------- > > > As the blue font parts show that my hardware crash from 03:30:35.It is hung > up until I restart it manually at 04:42:04 > ALl the above information just snapshot the performance when it crashed > while there is nothing cover the reason.I have also > check the /var/log/messages and find nothing useful. > > Note that I run the command- sar -v .It shows something abnormal: > > ------------------------------------------------------------------------------------------------ > 02:50:01 PM 11542262 9216 76446 258 > 03:00:01 PM 11645526 9536 76421 258 > 03:10:01 PM 11748690 9216 76451 258 > 03:20:01 PM 11850191 9152 76331 258 > 03:30:35 PM 11972313 10112 132625 258 > 03:42:40 PM 12177319 13760 340227 258 > Average: 8293601 8950 68187 161 > > 04:42:04 PM LINUX RESTART > > 04:50:01 PM dentunusd file-nr inode-nr pty-nr > 05:00:01 PM 35410 7616 35223 4 > 05:10:01 PM 137320 7296 42632 6 > 05:20:01 PM 247010 7296 42839 9 > 05:30:01 PM 358434 7360 42697 9 > 05:40:01 PM 471543 7040 42929 10 > 05:50:01 PM 583787 7296 42837 13 > > ------------------------------------------------------------------------------------------------ > > and I check the man info about the -v option : > > ------------------------------------------------------------------------------------------------ > *-v* Report status of inode, file and other kernel tables. The following > values are displayed: > *dentunusd* > Number of unused cache entries in the directory cache. > *file-nr* > Number of file handles used by the system. > *inode-nr* > Number of inode handlers used by the system. > *pty-nr* > Number of pseudo-terminals used by the system. > > ------------------------------------------------------------------------------------------------ > > Is the any clue about the crash? Would you please give me some suggestions? > > > Best Regards. > > > 2016-03-16 14:01 GMT+08:00 YouPeng Yang <yypvsxf19870...@gmail.com>: > > > Hello > > The problem appears several times ,however I could not capture the top > > output .My script is as follows code. > > I check the sys cpu usage whether it exceed 30%.the other metric > > information can be dumpped successfully except the top . > > Would you like to check my script that I am not able to figure out what > is > > wrong. > > > > > > > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > #!/bin/bash > > > > while : > > do > > sysusage=$(mpstat 2 1 | grep -A 1 "%sys" | tail -n 1 | awk '{if($6 < > > 30) print 1; else print 0;}' ) > > > > if [ $sysusage -eq 0 ];then > > #echo $sysusage > > #perf record -o perf$(date +%Y%m%d%H%M%S).data -a -g -F 1000 > > sleep 30 > > file=$(date +%Y%m%d%H%M%S) > > top -n 2 >> top$file.data > > iotop -b -n 2 >> iotop$file.data > > iostat >> iostat$file.data > > netstat -an | awk '/^tcp/ {++state[$NF]} END {for(i in state) > > print i,"\t",state[i]}' >> netstat$file.data > > fi > > sleep 5 > > done > > You have new mail in /var/spool/mail/root > > > > > > > > > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > > > 2016-03-08 21:39 GMT+08:00 YouPeng Yang <yypvsxf19870...@gmail.com>: > > > >> Hi all > >> Thanks for your reply.I do some investigation for much time.and I will > >> post some logs of the 'top' and IO in a few days when the crash come > again. > >> > >> 2016-03-08 10:45 GMT+08:00 Shawn Heisey <apa...@elyograg.org>: > >> > >>> On 3/7/2016 2:23 AM, Toke Eskildsen wrote: > >>> > How does this relate to YouPeng reporting that the CPU usage > increases? > >>> > > >>> > This is not a snark. YouPeng mentions kernel issues. It might very > well > >>> > be that IO is the real problem, but that it manifests in a > >>> non-intuitive > >>> > way. Before memory-mapping it was easy: Just look at IO-Wait. Now I > am > >>> > not so sure. Can high kernel load (Sy% in *nix top) indicate that the > >>> IO > >>> > system is struggling, even if IO-Wait is low? > >>> > >>> It might turn out to be not directly related to memory, you're right > >>> about that. A very high query rate or particularly CPU-heavy queries > or > >>> analysis could cause high CPU usage even when memory is plentiful, but > >>> in that situation I would expect high user percentage, not kernel. I'm > >>> not completely sure what might cause high kernel usage if iowait is > low, > >>> but no specific information was given about iowait. I've seen iowait > >>> percentages of 10% or less with problems clearly caused by iowait. > >>> > >>> With the available information (especially seeing 700GB of index data), > >>> I believe that the "not enough memory" scenario is more likely than > >>> anything else. If the OP replies and says they have plenty of memory, > >>> then we can move on to the less common (IMHO) reasons for high CPU with > >>> a large index. > >>> > >>> If the OS is one that reports load average, I am curious what the 5 > >>> minute average is, and how many real (non-HT) CPU cores there are. > >>> > >>> Thanks, > >>> Shawn > >>> > >>> > >> > > >