yes i needed to see behavior ovr time from 14 servers and correlate it.

Putting it into zabbix gave me behavior type graphs... like this unusual
gc activity...

[image: image.png]

On Mon, Mar 18, 2019 at 1:32 PM Erick Erickson <erickerick...@gmail.com>
wrote:

> GCViewer will work on the GC logging file created by Solr. It has some
> nice summaries, particularly of stop-the-world GC events.
>
> From there, you can pinpoint the exact times from manual inspection of the
> GC log. It usually looks like this:
>
> Your replica went into recovery when again? sometime between 14:00 and
> 15:00.
>
> GCViewer: Oh, look. Sometime between 14:30 and 14:35 there was a 20 second
> stop-the-world GC pause.
>
> Ok, back to the Solr log and I see “leader initiated recovery” sent by the
> leader to this replica at 14:32. Oh look, the GC log shows the
> stop-the-world GC pause starting at 14:31:45. Let's tune GC.
>
> That’s if you want to pin it exactly, but usually it looks more like this:
>
> Oh, look. GCViewer shows long stop-the-world GC pauses about the time your
> replica went into recovery, let’s start tuning GC.
>
> There are other products that give you a good way to navigate the GC
> events, GCViewer is free though.
>
> Best,
> Erick
>
> > On Mar 18, 2019, at 10:17 AM, Jeff Courtade <courtadej...@gmail.com>
> wrote:
> >
> > So,
> >
> > I had a problem when at a customer site. They use zabbix for data
> > collection and alerting.
> >
> > The solr server had been setup to use only jmx metrics.
> >
> > the jvm was unstable and would lock up for a period of time and the
> metrics
> > and counters would be all screwed up. Because it was using jmx to alert
> it
> > was screwing up as  the jvm needed to be working to be used.
> >
> > So I turned on gclogging and wrote a script to collect data points about
> > for instance how long the jvm was stopped in the last minute.
> >
> > I eventually got the gc tuned and behaving well but it was difficult.
> >
> >
> > turn on gcloging
> >
> > i use these options
> >
> > -Xloggc:../var/logs/gclog.log \
> > -XX:+PrintHeapAtGC \
> > -XX:+PrintGCDetails \
> > -XX:+PrintGCDateStamps \
> > -XX:+PrintGCTimeStamps \
> > -XX:+PrintTenuringDistribution \
> > -XX:+PrintGCApplicationStoppedTime \
> > -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 \
> > -XX:GCLogFileSize=100M
> >
> > in the solr runtme users crontab on each system ....
> >
> > * * * * * nohup /opt/scripts/getstats-gclog &
> >
> > this is the script
> >
> > /opt/scripts/getstats-gclog
> > #/bin/bash -x
> > #
> > # get some statistics
> > #
> > # GC time stamp
> > # 2018-06-27T12:52:57.200+0000
> > #
> > FIVEMIN=`date --date '5 minutes ago' +'%Y-%m-%dT%H:%M'`
> > FOURMIN=`date --date '4 minutes ago' +'%Y-%m-%dT%H:%M'`
> > THREEMIN=`date --date '3 minutes ago' +'%Y-%m-%dT%H:%M'`
> > TWOMIN=`date --date '2 minutes ago' +'%Y-%m-%dT%H:%M'`
> > ONEMIN=`date --date '1 minute ago' +'%Y-%m-%dT%H:%M'`
> > YEAR=`date --date '1 minute ago' +'%Y'`
> > MONTH=`date --date '1 minute ago' +'%m'`
> > DAY=`date --date '1 minute ago' +'%d'`
> > HOUR=`date --date '1 minute ago' +'%H'`
> > MINUTE=`date --date '1 minute ago' +'%M'`
> > SECOND=`date --date '1 minute ago' +'%S'`
> > #
> > #
> > STATSDIR=/opt/stats/
> > WORKDIR=/$STATSDIR/working_gc
> > #
> > #
> > LOGDIR=/u01/app/solr/var/logs/
> > LOGNAME=gclog.log
> > Prep() {
> > mkdir -p $STATSDIR
> > chmod 755 $STATSDIR
> > mkdir -p $WORKDIR
> > chmod 755 $WORKDIR
> > }
> > GetStats() {
> > cd $WORKDIR
> > grep $ONEMIN $LOGDIR/$LOGNAME.*|grep stopped|awk '{print $11}'>ALLGC
> > COUNT=`cat ALLGC |wc -l`
> > # number under .00X XDecimalplaces U3D for example
> > U3D=`grep "^0\.00[1-9]" ALLGC|wc -l`
> > if [ -z $U3D ]
> > then
> >  U3D=0
> > fi
> > U2D=`grep "^0\.0[1-9]" ALLGC|wc -l`
> > if [ -z $U2D ]
> > then
> >  U2D=0
> > fi
> > U1D=`grep "^0\.[1-9]" ALLGC|wc -l`
> > if [ -z $U1D ]
> > then
> >  U1D=0
> > fi
> > O1S=`cat ALLGC | grep -v "^0\."|wc -l`
> > if [ -z $O1S ]
> > then
> >  O1S=0
> > fi
> > O10S=`grep "[0-9]\+[0-9]\.[0-9]*" ALLGC|wc -l`
> > if [ -z $O10S ]
> > then
> >  O10S=0
> > fi
> > cat ALLGC | grep -v "^0\.">OVER1SECDATA
> >
> > TOTAL=0
> > COUNT=0
> > while read DAT
> > do
> > TOTAL=`echo "$TOTAL + $DAT"|bc`
> > COUNT=`expr $COUNT + 1`
> > done <$WORKDIR/ALLGC
> > #AO1S=$(printf "%.2f\n"  `echo "scale=10;$COUNT/60" | bc`)
> > #AVGQT=$(printf "%.0f\n"  `echo "scale=10;$TOTAL/$COUNT"|bc`)
> > TOTSTOP=$TOTAL
> > AVGSTOPT=`echo "scale=7;$TOTAL/$COUNT"|bc`
> > if [ -z $AVGSTOPT ]
> > then
> >  AVGSTOPT=0
> > fi
> > # get top gc times
> > #
> >
> > #echo 0.0000000>ALLGCU1S
> > #echo 0.0000000>ALLGCO1S
> >
> > grep '^0.' $WORKDIR/ALLGC >ALLGCU1S
> > grep -v '^0.' $WORKDIR/ALLGC >ALLGCO1S
> > TOPGCTIMEU1S=`cat $WORKDIR/ALLGCU1S |sort |tail -1`
> > if [ -z $TOPGCTIMEU1S ]
> > then
> >  TOPGCTIMEU1S=0
> > fi
> > TOPGCTIMEO1S=`cat $WORKDIR/ALLGCO1S |sort |tail -1`
> > if [ -z $TOPGCTIMEO1S ]
> > then
> >  TOPGCTIMEO1S=0
> > fi
> >
> > }
> > PrintStats() {
> > #
> > ## stats
> > #COUNT= total number of garbage collection this minute
> > ## U3d = Total number of GC that are under 0.00Xseconds
> > # U2D total number of GC that are under 0.0X seconds
> > # U1D total number of GC that are under 0.X seconds
> > # O1S total number of GC that are over 1 second but under 10
> > # O10S total number of GC that are over 10 seconds
> > # TOTSTOPT the total time stopped for all GCs
> > # AVGSTOPT the average time of all the GCs
> > # TOPGCTIMEU1S the highest GC time Under 1 sec this minute
> > # TOPGCTIMEO1S the highest GC time Over 1 sec this minute
> > echo $COUNT $U3D $U2D $U1D $O1S $O10S $TOTSTOP $AVGSTOPT $TOPGCTIMEU1S
> > $TOPGCTIMEO1S >$STATSDIR/GCSTATS
> > #echo $COUNT $U3D $U2D $U1D $O1S $O10S $TOTSTOP $AVGSTOPT $TOPGCTIMEU1S
> > $TOPGCTIMEO1S
> > }
> >
> > Prep
> > GetStats
> > PrintStats
> >
> > then in the zabbix-agentd.conf
> > add these parameters
> >
> > # total number of GC in the last minute
> > UserParameter=gc-num,cat /opt/stats/GCSTATS |awk '{print $1}'
> > # total number of GC 0.00[1-9] - 0.0000000 second in the last minute
> > UserParameter=gc-n3d,cat /opt/stats/GCSTATS |awk '{print $2}'
> > # total number of GC 0.0[1-9] - 0.009 second in the last minute
> > UserParameter=gc-n2d,cat /opt/stats/GCSTATS |awk '{print $3}'
> > # total number of GC  0.[1-9] - 0.09 second in the last minute
> > UserParameter=gc-n1d,cat /opt/stats/GCSTATS |awk '{print $4}'
> > # total number of GC [1-9].X seconds in the last minute
> > UserParameter=gc-no1s,cat /opt/stats/GCSTATS |awk '{print $5}'
> > # total number of GC OVER 10  seconds in the last minute
> > UserParameter=gc-no10s,cat /opt/stats/GCSTATS |awk '{print $6}'
> > # these are all 0.0000000 time
> > # Total time the JVM was stopped for GC in the last minute
> > UserParameter=gc-tst,cat /opt/stats/GCSTATS |awk '{print $7}'
> > # Average time the JVM was stopped for ALL GC in the last minute
> > UserParameter=gc-ast,cat /opt/stats/GCSTATS |awk '{print $8}'
> > # Highest Time the JVM was stopped for GC under 1 second
> > UserParameter=gc-ttu1s,cat /opt/stats/GCSTATS |awk '{print $9}'
> > # Highest Time the JVM was stopped for GC OVER 1 second
> > UserParameter=gc-tto1s,cat /opt/stats/GCSTATS |awk '{print $10}'
> >
> > you have to confgure zabbix items, triggers and graphs for each of the
> data
> > points
> >
> >
> >
> > On Mon, Mar 18, 2019 at 12:34 PM Erick Erickson <erickerick...@gmail.com
> >
> > wrote:
> >
> >> Attachments are pretty aggressively stripped by the apache mail server,
> so
> >> it didn’t come through.
> >>
> >> That said, I’m not sure how much use just the last GC time is. What do
> you
> >> want it for? This
> >> sounds a bit like an XY problem.
> >>
> >> Best,
> >> Erick
> >>
> >>> On Mar 17, 2019, at 2:43 PM, Karthik K G <kgkarthi...@gmail.com>
> wrote:
> >>>
> >>> Hi Team,
> >>>
> >>> I was looking for Old GC duration time metrics, but all I could find
> was
> >> the API for this "/solr/admin/metrics?wt=json&group=jvm&prefix=gc.G1-
> >> Old-Generation", but I am not sure if this is for
> >> 'gc_g1_gen_o_lastgc_duration'. I tried to hookup the IP to the jconsole
> and
> >> was looking for the metrics, but all I could see was the collection time
> >> but not last GC duration as attached in the screenshot. Can you please
> help
> >> here with finding the correct metrics. I strongly believe we are not
> >> capturing this information. Please correct me if I am wrong.
> >>>
> >>> Thanks & Regards,
> >>> Karthik
> >>
> >>
>
>

Reply via email to