Hi, btw. this is the bug that REMOVE_BOGUS_SPIKES is/was supposed to fix:
https://bugzilla.redhat.com/show_bug.cgi?id=515274 Cheers Martin ----- Original Message ---- > From: Martin Knoblauch <kn...@knobisoft.de> > To: 左扬 <weichon...@gmail.com>; ganglia-developers@lists.sourceforge.net > Sent: Wed, April 28, 2010 6:32:32 PM > Subject: Re: [Ganglia-developers] bogus spikes of network_report, is that a > bug on the kernel? > > Hi, can you tell us which NIC you are using (/sbin/lspci) and which > version of the driver? When I wrote that REMOVE_BOGUS_SPIKES hack, it was > because of a HW/FW problem in certain Broadcom devices. It was supposed to be > fixed after kernel 2.6.9. The debug output from gmond suggests the > overflow coming from the bytes_out counter (BO). And you are right, just > lowering the thresholds is not useful in > general. Cheers Martin > >From: 左扬 < > ymailto="mailto:weichon...@gmail.com" > href="mailto:weichon...@gmail.com">weichon...@gmail.com> >To: > ymailto="mailto:ganglia-developers@lists.sourceforge.net" > href="mailto:ganglia-developers@lists.sourceforge.net">ganglia-developers@lists.sourceforge.net >Sent: > Wed, April 28, 2010 1:48:58 PM >Subject: [Ganglia-developers] bogus spikes > of network_report, is that a bug on the kernel? > >hello dear > all~ > >we use the ganglia to generate the network traffic report, > > >bu because of the bogus spikes up to 400p, I can see > nothing...(as the graph in the attachment, i modified the > graph.d/network_report.php, change the unit from bytes/s to bits/s > ) > >and I read the code and then made some tests for > days > > >in the libmetrics/linux/metrics.c:line 287, there is > a switch, so i re-make ganglia with CFLAGS=DREMOVE_BOGUS_SPIKES, and restart > the > gmond, > >after days, i found there were still spkes (about > 4T) > >so I have to change the Line 292 from > >if > ((l_bin > 1.0e13) || (l_bout > 1.0e13) > || > >to > >if ((l_bin > 2.5e8) || (l_bout > > 2.5e8) || /* 2Gbps , there are 2 gigabit NIC on our > server) >> >to avoid the > spikes. > >I think that is not a good idea, the others may use the > faster NIC, and then I added some code in the update_ifdata() to log the > contents of '/proc/net/dev '(value of > proc_net_dev.buffer) > > >logs from > /var/log/message: >Apr 27 23:19:13 hostname > /opt/ganglia/sbin/gmond[18465]: >update_ifdata(BO) - Overflow in rbo: > 304634803029227 -> 630666266 >[1272381553] >>Apr 27 23:20:13 > hostname /opt/ganglia/sbin/gmond[18465]: >update_ifdata(BO) - Overflow in > rbi: 10458900526801464705 -> >38016437180368 [1272381613] > >>Apr 27 23:20:13 hostname > /opt/ganglia/sbin/gmond[18465]: >update_ifdata(BO) - Overflow in rpo: > 219388676028 -> 219365592250 >[1272381613] > > >logs > for the /proc/net/dev > >>------------------ 1272381433.117603 > ----------------- >>Inter-| Receive > > | > Transmit >>face |bytes packets errs drop fifo frame > compressed multicast|bytes packets errs drop fifo colls carrier > compressed >>lo:3143390051 39831988 0 0 > 0 0 0 > 0 3143390051 39831988 0 0 0 > 0 0 > 0 >>tunl0: 0 0 > 0 0 0 0 > 0 0 0 > 0 0 0 0 0 > 0 > 0 >>eth0:38015520377153 135587033135 0 8587116 > 0 0 0 > 6 304631801519418 219359254753 0 0 > 0 0 0 > 0 >>eth1: 0 0 > 0 0 0 0 > 0 0 0 > 0 0 0 0 > 0 0 > 0 > >>------------------ 1272381493.118502 > ----------------- >>Inter-| Receive > > | > Transmit >>face |bytes packets errs drop fifo frame > compressed multicast|bytes packets errs drop fifo colls carrier > compressed >>lo:3143407797 39832216 0 0 > 0 0 0 > 0 3143407797 39832216 0 0 0 > 0 0 > 0 >>tunl0: 0 0 > 0 0 0 0 > 0 0 0 > 0 0 0 0 0 > 0 > 0 >>eth0:38015973907827 135588437010 0 8587116 > 0 0 0 > 6 304634803029227 219361451245 0 0 > 0 0 0 > 0 >>eth1: 0 0 > 0 0 0 0 > 0 0 0 > 0 0 0 0 > 0 0 > 0 > >>------------------ 1272381553.121013 > ----------------- >>Inter-| Receive > > | > Transmit >>face |bytes packets errs drop fifo frame > compressed multicast|bytes packets errs drop fifo colls carrier > compressed >>lo:3143407797 39832216 0 0 > 0 0 0 > 0 3143407797 39832216 0 0 0 > 0 0 > 0 >>tunl0: 0 0 > 0 0 0 0 > 0 0 0 > 0 0 0 0 0 > 0 > 0 >>eth0:10458900526801464705 135564674293 0 8587116 > 0 0 0 219363599555 > 630666266 219388676028 7723 0 0 0 > 7723 0 >>eth1: > 0 0 0 0 > 0 0 0 > 0 0 0 0 > 0 0 0 0 > 0 > >>------------------ 1272381613.123535 > ----------------- >>Inter-| Receive > > | > Transmit >>face |bytes packets errs drop fifo frame > compressed multicast|bytes packets errs drop fifo colls carrier > compressed >>lo:3143444605 39832676 0 0 > 0 0 0 > 0 3143444605 39832676 0 0 0 > 0 0 > 0 >>tunl0: 0 0 > 0 0 0 0 > 0 0 0 > 0 0 0 0 0 > 0 > 0 >>eth0:38016437180368 135590918375 0 8587116 > 0 0 0 > 6 304640653909921 219365592250 0 0 > 0 0 0 > 0 >>eth1: 0 0 > 0 0 0 0 > 0 0 0 > 0 0 0 0 > 0 0 0 > >the > value at 1272381493 is ok, but the value at 1272381553 is abnormal, and then > the > value at 1272381613 recovered . > >I don't think this is caused by a > HW error, it seems a bug on the kernel. (we're using 2.6.20-pm and > 2.6.9-34.ELsmp, both are x86_64) > >but i don't know much about the > kernel... so is there anyone to confirm > ? > >thanks. > >-- > >墙角数枝梅,凌寒独自开。 >遥知不是雪,为有暗香来。 > ------------------------------------------------------------------------------ _______________________________________________ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers