yes i needed to see behavior ovr time from 14 servers and correlate it. Putting it into zabbix gave me behavior type graphs... like this unusual gc activity...
[image: image.png] On Mon, Mar 18, 2019 at 1:32 PM Erick Erickson <erickerick...@gmail.com> wrote: > GCViewer will work on the GC logging file created by Solr. It has some > nice summaries, particularly of stop-the-world GC events. > > From there, you can pinpoint the exact times from manual inspection of the > GC log. It usually looks like this: > > Your replica went into recovery when again? sometime between 14:00 and > 15:00. > > GCViewer: Oh, look. Sometime between 14:30 and 14:35 there was a 20 second > stop-the-world GC pause. > > Ok, back to the Solr log and I see “leader initiated recovery” sent by the > leader to this replica at 14:32. Oh look, the GC log shows the > stop-the-world GC pause starting at 14:31:45. Let's tune GC. > > That’s if you want to pin it exactly, but usually it looks more like this: > > Oh, look. GCViewer shows long stop-the-world GC pauses about the time your > replica went into recovery, let’s start tuning GC. > > There are other products that give you a good way to navigate the GC > events, GCViewer is free though. > > Best, > Erick > > > On Mar 18, 2019, at 10:17 AM, Jeff Courtade <courtadej...@gmail.com> > wrote: > > > > So, > > > > I had a problem when at a customer site. They use zabbix for data > > collection and alerting. > > > > The solr server had been setup to use only jmx metrics. > > > > the jvm was unstable and would lock up for a period of time and the > metrics > > and counters would be all screwed up. Because it was using jmx to alert > it > > was screwing up as the jvm needed to be working to be used. > > > > So I turned on gclogging and wrote a script to collect data points about > > for instance how long the jvm was stopped in the last minute. > > > > I eventually got the gc tuned and behaving well but it was difficult. > > > > > > turn on gcloging > > > > i use these options > > > > -Xloggc:../var/logs/gclog.log \ > > -XX:+PrintHeapAtGC \ > > -XX:+PrintGCDetails \ > > -XX:+PrintGCDateStamps \ > > -XX:+PrintGCTimeStamps \ > > -XX:+PrintTenuringDistribution \ > > -XX:+PrintGCApplicationStoppedTime \ > > -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 \ > > -XX:GCLogFileSize=100M > > > > in the solr runtme users crontab on each system .... > > > > * * * * * nohup /opt/scripts/getstats-gclog & > > > > this is the script > > > > /opt/scripts/getstats-gclog > > #/bin/bash -x > > # > > # get some statistics > > # > > # GC time stamp > > # 2018-06-27T12:52:57.200+0000 > > # > > FIVEMIN=`date --date '5 minutes ago' +'%Y-%m-%dT%H:%M'` > > FOURMIN=`date --date '4 minutes ago' +'%Y-%m-%dT%H:%M'` > > THREEMIN=`date --date '3 minutes ago' +'%Y-%m-%dT%H:%M'` > > TWOMIN=`date --date '2 minutes ago' +'%Y-%m-%dT%H:%M'` > > ONEMIN=`date --date '1 minute ago' +'%Y-%m-%dT%H:%M'` > > YEAR=`date --date '1 minute ago' +'%Y'` > > MONTH=`date --date '1 minute ago' +'%m'` > > DAY=`date --date '1 minute ago' +'%d'` > > HOUR=`date --date '1 minute ago' +'%H'` > > MINUTE=`date --date '1 minute ago' +'%M'` > > SECOND=`date --date '1 minute ago' +'%S'` > > # > > # > > STATSDIR=/opt/stats/ > > WORKDIR=/$STATSDIR/working_gc > > # > > # > > LOGDIR=/u01/app/solr/var/logs/ > > LOGNAME=gclog.log > > Prep() { > > mkdir -p $STATSDIR > > chmod 755 $STATSDIR > > mkdir -p $WORKDIR > > chmod 755 $WORKDIR > > } > > GetStats() { > > cd $WORKDIR > > grep $ONEMIN $LOGDIR/$LOGNAME.*|grep stopped|awk '{print $11}'>ALLGC > > COUNT=`cat ALLGC |wc -l` > > # number under .00X XDecimalplaces U3D for example > > U3D=`grep "^0\.00[1-9]" ALLGC|wc -l` > > if [ -z $U3D ] > > then > > U3D=0 > > fi > > U2D=`grep "^0\.0[1-9]" ALLGC|wc -l` > > if [ -z $U2D ] > > then > > U2D=0 > > fi > > U1D=`grep "^0\.[1-9]" ALLGC|wc -l` > > if [ -z $U1D ] > > then > > U1D=0 > > fi > > O1S=`cat ALLGC | grep -v "^0\."|wc -l` > > if [ -z $O1S ] > > then > > O1S=0 > > fi > > O10S=`grep "[0-9]\+[0-9]\.[0-9]*" ALLGC|wc -l` > > if [ -z $O10S ] > > then > > O10S=0 > > fi > > cat ALLGC | grep -v "^0\.">OVER1SECDATA > > > > TOTAL=0 > > COUNT=0 > > while read DAT > > do > > TOTAL=`echo "$TOTAL + $DAT"|bc` > > COUNT=`expr $COUNT + 1` > > done <$WORKDIR/ALLGC > > #AO1S=$(printf "%.2f\n" `echo "scale=10;$COUNT/60" | bc`) > > #AVGQT=$(printf "%.0f\n" `echo "scale=10;$TOTAL/$COUNT"|bc`) > > TOTSTOP=$TOTAL > > AVGSTOPT=`echo "scale=7;$TOTAL/$COUNT"|bc` > > if [ -z $AVGSTOPT ] > > then > > AVGSTOPT=0 > > fi > > # get top gc times > > # > > > > #echo 0.0000000>ALLGCU1S > > #echo 0.0000000>ALLGCO1S > > > > grep '^0.' $WORKDIR/ALLGC >ALLGCU1S > > grep -v '^0.' $WORKDIR/ALLGC >ALLGCO1S > > TOPGCTIMEU1S=`cat $WORKDIR/ALLGCU1S |sort |tail -1` > > if [ -z $TOPGCTIMEU1S ] > > then > > TOPGCTIMEU1S=0 > > fi > > TOPGCTIMEO1S=`cat $WORKDIR/ALLGCO1S |sort |tail -1` > > if [ -z $TOPGCTIMEO1S ] > > then > > TOPGCTIMEO1S=0 > > fi > > > > } > > PrintStats() { > > # > > ## stats > > #COUNT= total number of garbage collection this minute > > ## U3d = Total number of GC that are under 0.00Xseconds > > # U2D total number of GC that are under 0.0X seconds > > # U1D total number of GC that are under 0.X seconds > > # O1S total number of GC that are over 1 second but under 10 > > # O10S total number of GC that are over 10 seconds > > # TOTSTOPT the total time stopped for all GCs > > # AVGSTOPT the average time of all the GCs > > # TOPGCTIMEU1S the highest GC time Under 1 sec this minute > > # TOPGCTIMEO1S the highest GC time Over 1 sec this minute > > echo $COUNT $U3D $U2D $U1D $O1S $O10S $TOTSTOP $AVGSTOPT $TOPGCTIMEU1S > > $TOPGCTIMEO1S >$STATSDIR/GCSTATS > > #echo $COUNT $U3D $U2D $U1D $O1S $O10S $TOTSTOP $AVGSTOPT $TOPGCTIMEU1S > > $TOPGCTIMEO1S > > } > > > > Prep > > GetStats > > PrintStats > > > > then in the zabbix-agentd.conf > > add these parameters > > > > # total number of GC in the last minute > > UserParameter=gc-num,cat /opt/stats/GCSTATS |awk '{print $1}' > > # total number of GC 0.00[1-9] - 0.0000000 second in the last minute > > UserParameter=gc-n3d,cat /opt/stats/GCSTATS |awk '{print $2}' > > # total number of GC 0.0[1-9] - 0.009 second in the last minute > > UserParameter=gc-n2d,cat /opt/stats/GCSTATS |awk '{print $3}' > > # total number of GC 0.[1-9] - 0.09 second in the last minute > > UserParameter=gc-n1d,cat /opt/stats/GCSTATS |awk '{print $4}' > > # total number of GC [1-9].X seconds in the last minute > > UserParameter=gc-no1s,cat /opt/stats/GCSTATS |awk '{print $5}' > > # total number of GC OVER 10 seconds in the last minute > > UserParameter=gc-no10s,cat /opt/stats/GCSTATS |awk '{print $6}' > > # these are all 0.0000000 time > > # Total time the JVM was stopped for GC in the last minute > > UserParameter=gc-tst,cat /opt/stats/GCSTATS |awk '{print $7}' > > # Average time the JVM was stopped for ALL GC in the last minute > > UserParameter=gc-ast,cat /opt/stats/GCSTATS |awk '{print $8}' > > # Highest Time the JVM was stopped for GC under 1 second > > UserParameter=gc-ttu1s,cat /opt/stats/GCSTATS |awk '{print $9}' > > # Highest Time the JVM was stopped for GC OVER 1 second > > UserParameter=gc-tto1s,cat /opt/stats/GCSTATS |awk '{print $10}' > > > > you have to confgure zabbix items, triggers and graphs for each of the > data > > points > > > > > > > > On Mon, Mar 18, 2019 at 12:34 PM Erick Erickson <erickerick...@gmail.com > > > > wrote: > > > >> Attachments are pretty aggressively stripped by the apache mail server, > so > >> it didn’t come through. > >> > >> That said, I’m not sure how much use just the last GC time is. What do > you > >> want it for? This > >> sounds a bit like an XY problem. > >> > >> Best, > >> Erick > >> > >>> On Mar 17, 2019, at 2:43 PM, Karthik K G <kgkarthi...@gmail.com> > wrote: > >>> > >>> Hi Team, > >>> > >>> I was looking for Old GC duration time metrics, but all I could find > was > >> the API for this "/solr/admin/metrics?wt=json&group=jvm&prefix=gc.G1- > >> Old-Generation", but I am not sure if this is for > >> 'gc_g1_gen_o_lastgc_duration'. I tried to hookup the IP to the jconsole > and > >> was looking for the metrics, but all I could see was the collection time > >> but not last GC duration as attached in the screenshot. Can you please > help > >> here with finding the correct metrics. I strongly believe we are not > >> capturing this information. Please correct me if I am wrong. > >>> > >>> Thanks & Regards, > >>> Karthik > >> > >> > >