Packet loss every 30.999 seconds
While trying to diagnose a packet loss problem in a RELENG_6 snapshot dated November 8, 2007 it looks like I've stumbled across a broken driver or kernel routine which stops interrupt processing long enough to severly degrade network performance every 30.99 seconds. Packets appear to make it as far as ether_input() then get lost. Test setup: A - ethernet_switch - B A sends UDP packets to B through an ethernet switch. The interface input packet count and output packet count on the switch match what A is sending and B should be receiving. A UDP receiver running on B sees windows of packet loss with a period of 30.99 seconds. The lost packets are counted based on an incrementing sequence number. On an isolated network the Ipkts counter on B matches what A is sending, but the packets never show up in any of the IP/UDP counters or the program trying to receive them. This behavior can be seen with both em and fxp interfaces. Problem is it only occurs after the receiving host has been up about a day. Reboot, problem clears. GENERIC kernel, nothing more than default daemons running. Behavior seen on three different motherboards so far. It also appears this is not just lost network interrupts. Whatever is spinning in the kernel also impacts syscall latency. An easy way to replicate what I'm seeing is to run gettimeofday() in a tight loop and note when the real time syscall delay exceeds some value (which is dependent on processor speed). As an example on an 3.20GHz CPU a small program will output when the syscall latency is > 5000 usecs. Note the periodic behavior at 30.99 seconds. These big jumps in latency correspond to when packets are being dropped. usecs (epoch)latency diff 1197861705805078 478199 0 1197861721012298 25926 15207220 1197861726332036 11729 5319738 1197861757331549 11691 30999513 1197861788331266 11878 30999717 1197861819330647 11708 30999381 1197861850330192 11698 30999545 1197861881329733 11667 30999541 1197861900018297 6516 18688564 1197861912329282 11684 12310985 1197861943328849 11699 30999567 1197861974328413 11692 30999564 1197862005328228 11916 30999815 1197862036327598 11684 30999370 1197862067327229 11680 30999631 1197862098326860 11667 30999631 1197862129326559 11704 30999699 1197862160326377 11844 30999818 1197862191325890 11674 30999513 (output from packet loss tester) window_start/window_end is packet counter time_start/time_end is absolute time in usecs. window_diff is # of packets missing The test is run at about 15.5Kpps / 132Mbits/second, certainly a lot less than this hardware is capable of running BSD4.X. :missing window_start=311510, time_start=1197861726332008,window_end=311638, time_end=1197861726332011, window_diff=128, time_diff=3 :missing window_start=794482, time_start=1197861757331505,window_end=794609, time_end=1197861757331509, window_diff=127, time_diff=4 :missing window_start=1277313, time_start=1197861788331245,window_end=1277444, time_end=1197861788331249, window_diff=131, time_diff=4 :missing window_start=1760104, time_start=1197861819330625,window_end=1760232, time_end=1197861819330629, window_diff=128, time_diff=4 :missing window_start=2242789, time_start=1197861850330170,window_end=2242916, time_end=1197861850330174, window_diff=127, time_diff=4 :missing window_start=2725818, time_start=1197861881329712,window_end=2725946, time_end=1197861881329715, window_diff=128, time_diff=3 :missing window_start=3208594, time_start=1197861912329261,window_end=3208722, time_end=1197861912329264, window_diff=128, time_diff=3 :missing window_start=3691395, time_start=1197861943328802,window_end=3691522, time_end=1197861943328805, window_diff=127, time_diff=3 :missing window_start=4173793, time_start=1197861974328369,window_end=4173921, time_end=1197861974328373, window_diff=128, time_diff=4 :missing window_start=4656236, time_start=1197862005328176,window_end=4656367, time_end=1197862005328179, window_diff=131, time_diff=3 :missing window_start=5139197, time_start=1197862036327576,window_end=5139325, time_end=1197862036327580, window_diff=128, time_diff=4 :missing window_start=5621958, time_start=1197862067327208,window_end=5622085, time_end=1197862067327211, window_diff=127, time_diff=3 :missing window_start=6104597, time_start=1197862098326839,window_end=6104725, time_end=1197862098326843, window_diff=128, time_diff=4 :missing window_start=6587241, time_start=1197862129326514,window_end=6587369, time_end=1197862129326534, window_diff=128, time_diff=20 :missing window_start=7070051, time_start=1197862160326368,window_end=7070183, time_end=1197862160326371, window_diff=132, time_diff=3 :missing window_start=7552828, time_start=1197862191325873,window_end=7552954, time_end=1197862191325876, window_diff=126, time_diff=3 :missing window_start=8035434, time_start=119786325572,window_end=8035560, time_end=119786325576,
Packet loss every 30.999 seconds
While trying to diagnose a packet loss problem in a RELENG_6 snapshot dated November 8, 2007 it looks like I've stumbled across a broken driver or kernel routine which stops interrupt processing long enough to severly degrade network performance every 30.99 seconds. Packets appear to make it as far as ether_input() then get lost. Test setup: A - ethernet_switch - B A sends UDP packets to B through an ethernet switch. The interface input packet count and output packet count on the switch match what A is sending and B should be receiving. A UDP receiver running on B sees windows of packet loss with a period of 30.99 seconds. The lost packets are counted based on an incrementing sequence number. On an isolated network the Ipkts counter on B matches what A is sending, but the packets never show up in any of the IP/UDP counters or the program trying to receive them. This behavior can be seen with both em and fxp interfaces. Problem is it only occurs after the receiving host has been up about a day. Reboot, problem clears. GENERIC kernel, nothing more than default daemons running. Behavior seen on three different motherboards so far. It also appears this is not just lost network interrupts. Whatever is spinning in the kernel also impacts syscall latency. An easy way to replicate what I'm seeing is to run gettimeofday() in a tight loop and note when the real time syscall delay exceeds some value (which is dependent on processor speed). As an example on an 3.20GHz CPU a small program will output when the syscall latency is > 5000 usecs. Note the periodic behavior at 30.99 seconds. These big jumps in latency correspond to when packets are being dropped. usecs (epoch)latency diff 1197861705805078 478199 0 1197861721012298 25926 15207220 1197861726332036 11729 5319738 1197861757331549 11691 30999513 1197861788331266 11878 30999717 1197861819330647 11708 30999381 1197861850330192 11698 30999545 1197861881329733 11667 30999541 1197861900018297 6516 18688564 1197861912329282 11684 12310985 1197861943328849 11699 30999567 1197861974328413 11692 30999564 1197862005328228 11916 30999815 1197862036327598 11684 30999370 1197862067327229 11680 30999631 1197862098326860 11667 30999631 1197862129326559 11704 30999699 1197862160326377 11844 30999818 1197862191325890 11674 30999513 (output from packet loss tester) window_start/window_end is packet counter time_start/time_end is absolute time in usecs. window_diff is # of packets missing The test is run at about 15.5Kpps / 132Mbits/second, certainly a lot less than this hardware is capable of running BSD4.X. :missing window_start=311510, time_start=1197861726332008,window_end=311638, time_end=1197861726332011, window_diff=128, time_diff=3 :missing window_start=794482, time_start=1197861757331505,window_end=794609, time_end=1197861757331509, window_diff=127, time_diff=4 :missing window_start=1277313, time_start=1197861788331245,window_end=1277444, time_end=1197861788331249, window_diff=131, time_diff=4 :missing window_start=1760104, time_start=1197861819330625,window_end=1760232, time_end=1197861819330629, window_diff=128, time_diff=4 :missing window_start=2242789, time_start=1197861850330170,window_end=2242916, time_end=1197861850330174, window_diff=127, time_diff=4 :missing window_start=2725818, time_start=1197861881329712,window_end=2725946, time_end=1197861881329715, window_diff=128, time_diff=3 :missing window_start=3208594, time_start=1197861912329261,window_end=3208722, time_end=1197861912329264, window_diff=128, time_diff=3 :missing window_start=3691395, time_start=1197861943328802,window_end=3691522, time_end=1197861943328805, window_diff=127, time_diff=3 :missing window_start=4173793, time_start=1197861974328369,window_end=4173921, time_end=1197861974328373, window_diff=128, time_diff=4 :missing window_start=4656236, time_start=1197862005328176,window_end=4656367, time_end=1197862005328179, window_diff=131, time_diff=3 :missing window_start=5139197, time_start=1197862036327576,window_end=5139325, time_end=1197862036327580, window_diff=128, time_diff=4 :missing window_start=5621958, time_start=1197862067327208,window_end=5622085, time_end=1197862067327211, window_diff=127, time_diff=3 :missing window_start=6104597, time_start=1197862098326839,window_end=6104725, time_end=1197862098326843, window_diff=128, time_diff=4 :missing window_start=6587241, time_start=1197862129326514,window_end=6587369, time_end=1197862129326534, window_diff=128, time_diff=20 :missing window_start=7070051, time_start=1197862160326368,window_end=7070183, time_end=1197862160326371, window_diff=132, time_diff=3 :missing window_start=7552828, time_start=1197862191325873,window_end=7552954, time_end=1197862191325876, window_diff=126, time_diff=3 :missing window_start=8035434, time_start=119786325572,window_end=8035560, time_end=119786325576,
Packet loss every 30.999 seconds
While trying to diagnose a packet loss problem in a RELENG_6 snapshot dated November 8, 2007 it looks like I've stumbled across a broken driver or kernel routine which stops interrupt processing long enough to severly degrade network performance every 30.99 seconds. Packets appear to make it as far as ether_input() then get lost. Test setup: A - ethernet_switch - B A sends UDP packets to B through an ethernet switch. The interface input packet count and output packet count on the switch match what A is sending and B should be receiving. A UDP receiver running on B sees windows of packet loss with a period of 30.99 seconds. The lost packets are counted based on an incrementing sequence number. On an isolated network the Ipkts counter on B matches what A is sending, but the packets never show up in any of the IP/UDP counters or the program trying to receive them. This behavior can be seen with both em and fxp interfaces. Problem is it only occurs after the receiving host has been up about a day. Reboot, problem clears. GENERIC kernel, nothing more than default daemons running. Behavior seen on three different motherboards so far. It also appears this is not just lost network interrupts. Whatever is spinning in the kernel also impacts syscall latency. An easy way to replicate what I'm seeing is to run gettimeofday() in a tight loop and note when the real time syscall delay exceeds some value (which is dependent on processor speed). As an example on an 3.20GHz CPU a small program will output when the syscall latency is > 5000 usecs. Note the periodic behavior at 30.99 seconds. These big jumps in latency correspond to when packets are being dropped. usecs (epoch)latency diff 1197861705805078 478199 0 1197861721012298 25926 15207220 1197861726332036 11729 5319738 1197861757331549 11691 30999513 1197861788331266 11878 30999717 1197861819330647 11708 30999381 1197861850330192 11698 30999545 1197861881329733 11667 30999541 1197861900018297 6516 18688564 1197861912329282 11684 12310985 1197861943328849 11699 30999567 1197861974328413 11692 30999564 1197862005328228 11916 30999815 1197862036327598 11684 30999370 1197862067327229 11680 30999631 1197862098326860 11667 30999631 1197862129326559 11704 30999699 1197862160326377 11844 30999818 1197862191325890 11674 30999513 (output from packet loss tester) window_start/window_end is packet counter time_start/time_end is absolute time in usecs. window_diff is # of packets missing The test is run at about 15.5Kpps / 132Mbits/second, certainly a lot less than this hardware is capable of running BSD4.X. :missing window_start=311510, time_start=1197861726332008,window_end=311638, time_end=1197861726332011, window_diff=128, time_diff=3 :missing window_start=794482, time_start=1197861757331505,window_end=794609, time_end=1197861757331509, window_diff=127, time_diff=4 :missing window_start=1277313, time_start=1197861788331245,window_end=1277444, time_end=1197861788331249, window_diff=131, time_diff=4 :missing window_start=1760104, time_start=1197861819330625,window_end=1760232, time_end=1197861819330629, window_diff=128, time_diff=4 :missing window_start=2242789, time_start=1197861850330170,window_end=2242916, time_end=1197861850330174, window_diff=127, time_diff=4 :missing window_start=2725818, time_start=1197861881329712,window_end=2725946, time_end=1197861881329715, window_diff=128, time_diff=3 :missing window_start=3208594, time_start=1197861912329261,window_end=3208722, time_end=1197861912329264, window_diff=128, time_diff=3 :missing window_start=3691395, time_start=1197861943328802,window_end=3691522, time_end=1197861943328805, window_diff=127, time_diff=3 :missing window_start=4173793, time_start=1197861974328369,window_end=4173921, time_end=1197861974328373, window_diff=128, time_diff=4 :missing window_start=4656236, time_start=1197862005328176,window_end=4656367, time_end=1197862005328179, window_diff=131, time_diff=3 :missing window_start=5139197, time_start=1197862036327576,window_end=5139325, time_end=1197862036327580, window_diff=128, time_diff=4 :missing window_start=5621958, time_start=1197862067327208,window_end=5622085, time_end=1197862067327211, window_diff=127, time_diff=3 :missing window_start=6104597, time_start=1197862098326839,window_end=6104725, time_end=1197862098326843, window_diff=128, time_diff=4 :missing window_start=6587241, time_start=1197862129326514,window_end=6587369, time_end=1197862129326534, window_diff=128, time_diff=20 :missing window_start=7070051, time_start=1197862160326368,window_end=7070183, time_end=1197862160326371, window_diff=132, time_diff=3 :missing window_start=7552828, time_start=1197862191325873,window_end=7552954, time_end=1197862191325876, window_diff=126, time_diff=3 :missing window_start=8035434, time_start=119786325572,window_end=8035560, time_end=119786325576,
Re: Packet loss every 30.999 seconds
On Mon, Dec 17, 2007 at 12:21:43AM -0500, Mark Fullmer wrote: > While trying to diagnose a packet loss problem in a RELENG_6 snapshot dated > November 8, 2007 it looks like I've stumbled across a broken driver or > kernel routine which stops interrupt processing long enough to severly > degrade network performance every 30.99 seconds. > > Packets appear to make it as far as ether_input() then get lost. Are you sure this isn't being caused by something the switch is doing, such as MAC/ARP cache clearing or LACP? I'm just speculating, but it would be worthwhile to remove the switch from the picture (crossover cable to the rescue). I know that at least in the case of fxp(4) and em(4), Jack Vogel does some through testing of throughput using a professional/high-end packet generator (some piece of hardware, I forget the name...) -- | Jeremy Chadwickjdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB | ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Packet loss every 30.999 seconds
I'm about 99% sure right now. I'll set this up in a lab tomorrow without an ethernet switch. It takes about a day of uptime before the problem shows up. Sorry for the duplicate messages, I misread a bounce notification. -- mark On Dec 17, 2007, at 12:43 AM, Jeremy Chadwick wrote: On Mon, Dec 17, 2007 at 12:21:43AM -0500, Mark Fullmer wrote: While trying to diagnose a packet loss problem in a RELENG_6 snapshot dated November 8, 2007 it looks like I've stumbled across a broken driver or kernel routine which stops interrupt processing long enough to severly degrade network performance every 30.99 seconds. Packets appear to make it as far as ether_input() then get lost. Are you sure this isn't being caused by something the switch is doing, such as MAC/ARP cache clearing or LACP? I'm just speculating, but it would be worthwhile to remove the switch from the picture (crossover cable to the rescue). I know that at least in the case of fxp(4) and em(4), Jack Vogel does some through testing of throughput using a professional/high-end packet generator (some piece of hardware, I forget the name...) -- | Jeremy Chadwickjdc at parodius.com | | Parodius Networking http:// www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB | ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable- [EMAIL PROTECTED]" ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Packet loss every 30.999 seconds
One more comment on my last email... The patch that I included is not meant as a real fix - it is just a bandaid. The real problem appears to be that a very large number of vnodes (all of them?) are getting synced (i.e. calling ffs_syncvnode()) every time. This should normally only happen for dirty vnodes. I suspect that something is broken with this check: if (vp->v_type == VNON || ((ip->i_flag & (IN_ACCESS | IN_CHANGE | IN_MODIFIED | IN_UPDATE)) == 0 && vp->v_bufobj.bo_dirty.bv_cnt == 0)) { VI_UNLOCK(vp); continue; } ...like the i_flag flags aren't ever getting properly cleared (or bv_cnt is always non-zero). ...but I don't have the time to chase this down. -DG David G. Lawrence President Download Technologies, Inc. - http://www.downloadtech.com - (866) 399 8500 The FreeBSD Project - http://www.freebsd.org Pave the road of life with opportunities. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Packet loss every 30.999 seconds
> While trying to diagnose a packet loss problem in a RELENG_6 snapshot > dated > November 8, 2007 it looks like I've stumbled across a broken driver or > kernel routine which stops interrupt processing long enough to severly > degrade network performance every 30.99 seconds. I noticed this as well some time ago. The problem has to do with the processing (syncing) of vnodes. When the total number of allocated vnodes in the system grows to tens of thousands, the ~31 second periodic sync process takes a long time to run. Try this patch and let people know if it helps your problem. It will periodically wait for one tick (1ms) every 500 vnodes of processing, which will allow other things to run. Index: ufs/ffs/ffs_vfsops.c === RCS file: /home/ncvs/src/sys/ufs/ffs/ffs_vfsops.c,v retrieving revision 1.290.2.16 diff -c -r1.290.2.16 ffs_vfsops.c *** ufs/ffs/ffs_vfsops.c9 Oct 2006 19:47:17 - 1.290.2.16 --- ufs/ffs/ffs_vfsops.c25 Apr 2007 01:58:15 - *** *** 1109,1114 --- 1109,1115 int softdep_deps; int softdep_accdeps; struct bufobj *bo; + int flushed_count = 0; fs = ump->um_fs; if (fs->fs_fmod != 0 && fs->fs_ronly != 0) {/* XXX */ *** *** 1174,1179 --- 1175,1184 allerror = error; vput(vp); MNT_ILOCK(mp); + if (flushed_count++ > 500) { + flushed_count = 0; + msleep(&flushed_count, MNT_MTX(mp), PZERO, "syncw", 1); + } } MNT_IUNLOCK(mp); /* -DG David G. Lawrence President Download Technologies, Inc. - http://www.downloadtech.com - (866) 399 8500 The FreeBSD Project - http://www.freebsd.org Pave the road of life with opportunities. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Packet loss every 30.999 seconds
Back to back test with no ethernet switch between two em interfaces, same result. The receiving side has been up > 1 day and exhibits the problem. These are also two different servers. The small gettimeofday() syscall tester also shows the same ~30 second pattern of high latency between syscalls. Receiver test application reports 3699 missed packets Sender netstat -i: (before test) em11500 00:04:23:cf:51:b7 20 0 15975785 0 0 em11500 10.1/24 10.1.0.237 - 15975801 - - (after test) em11500 00:04:23:cf:51:b7 22 0 25975822 0 0 em11500 10.1/24 10.1.0.239 - 25975838 - - total IP packets sent in during test = end - start 25975838-15975801 = 1037 (expected, 1,000,000 packets test + overhead) Receiver netstat -i: (before test) em11500 00:04:23:c4:cc:89 15975785 0 21 0 0 em11500 10.1/24 10.1.0.1 15969626 - 19 - - (after test) em11500 00:04:23:c4:cc:89 25975822 0 23 0 0 em11500 10.1/24 10.1.0.1 25965964 - 21 - - total ethernet frames received during test = end - start 25975822-15975785 = 1037 (as expected) total IP packets processed during test = end - start 25965964-15969626 = 9996338 (expecting 1037) Missed packets = expected - received 1037-9996338 = 3699 netstat -i accounts for the 3699 missed packets also reported by the application Looking closer at the tester output again shows the periodic ~30 second windows of packet loss. There's a second problem here in that packets are just disappearing before they make it to ip_input(), or there's a dropped packets counter I've not found yet. I can provide remote access to anyone who wants to take a look, this is very easy to duplicate. The ~ 1 day uptime before the behavior surfaces is not making this easy to isolate. -- mark On Dec 17, 2007, at 12:43 AM, Jeremy Chadwick wrote: On Mon, Dec 17, 2007 at 12:21:43AM -0500, Mark Fullmer wrote: While trying to diagnose a packet loss problem in a RELENG_6 snapshot dated November 8, 2007 it looks like I've stumbled across a broken driver or kernel routine which stops interrupt processing long enough to severly degrade network performance every 30.99 seconds. Packets appear to make it as far as ether_input() then get lost. Are you sure this isn't being caused by something the switch is doing, such as MAC/ARP cache clearing or LACP? I'm just speculating, but it would be worthwhile to remove the switch from the picture (crossover cable to the rescue). I know that at least in the case of fxp(4) and em(4), Jack Vogel does some through testing of throughput using a professional/high-end packet generator (some piece of hardware, I forget the name...) -- | Jeremy Chadwickjdc at parodius.com | | Parodius Networking http:// www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB | ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable- [EMAIL PROTECTED]" ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Packet loss every 30.999 seconds
Thanks. Have a kernel building now. It takes about a day of uptime after reboot before I'll see the problem. -- mark On Dec 17, 2007, at 5:24 AM, David G Lawrence wrote: While trying to diagnose a packet loss problem in a RELENG_6 snapshot dated November 8, 2007 it looks like I've stumbled across a broken driver or kernel routine which stops interrupt processing long enough to severly degrade network performance every 30.99 seconds. I noticed this as well some time ago. The problem has to do with the processing (syncing) of vnodes. When the total number of allocated vnodes in the system grows to tens of thousands, the ~31 second periodic sync process takes a long time to run. Try this patch and let people know if it helps your problem. It will periodically wait for one tick (1ms) every 500 vnodes of processing, which will allow other things to run. Index: ufs/ffs/ffs_vfsops.c === RCS file: /home/ncvs/src/sys/ufs/ffs/ffs_vfsops.c,v retrieving revision 1.290.2.16 diff -c -r1.290.2.16 ffs_vfsops.c *** ufs/ffs/ffs_vfsops.c9 Oct 2006 19:47:17 - 1.290.2.16 --- ufs/ffs/ffs_vfsops.c25 Apr 2007 01:58:15 - *** *** 1109,1114 --- 1109,1115 int softdep_deps; int softdep_accdeps; struct bufobj *bo; + int flushed_count = 0; fs = ump->um_fs; if (fs->fs_fmod != 0 && fs->fs_ronly != 0) { /* XXX */ *** *** 1174,1179 --- 1175,1184 allerror = error; vput(vp); MNT_ILOCK(mp); + if (flushed_count++ > 500) { + flushed_count = 0; + msleep(&flushed_count, MNT_MTX(mp), PZERO, "syncw", 1); + } } MNT_IUNLOCK(mp); /* -DG David G. Lawrence President Download Technologies, Inc. - http://www.downloadtech.com - (866) 399 8500 The FreeBSD Project - http://www.freebsd.org Pave the road of life with opportunities. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Packet loss every 30.999 seconds
On Mon, 17 Dec 2007, David G Lawrence wrote: One more comment on my last email... The patch that I included is not meant as a real fix - it is just a bandaid. The real problem appears to be that a very large number of vnodes (all of them?) are getting synced (i.e. calling ffs_syncvnode()) every time. This should normally only happen for dirty vnodes. I suspect that something is broken with this check: if (vp->v_type == VNON || ((ip->i_flag & (IN_ACCESS | IN_CHANGE | IN_MODIFIED | IN_UPDATE)) == 0 && vp->v_bufobj.bo_dirty.bv_cnt == 0)) { VI_UNLOCK(vp); continue; } Isn't it just the O(N) algorithm with N quite large? Under ~5.2, on a 2.2GHz A64 UP in 32-bit mode, I see a latency of 3 ms for 17500 vnodes, which would be explained by the above (and the VI_LOCK() and loop overhead) taking 171 ns per vnode. I would expect it to take more like 20 ns per vnode for UP and 60 for SMP. The comment before this code shows that the problem is known, and says that a subroutine call cannot be afforded unless there is work to do, but the, the locking accesses look like subroutine calls, have subroutine calls in their internals, and take longer than simple subroutine calls in the SMP case even when they don't make subroutine calls. (IIRC, on A64 a minimal subroutine call takes 4 cycles while a minimal locked instructions takes 18 cycles; subroutine calls are only slow when their branches are mispredicted.) Bruce ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Packet loss every 30.999 seconds
Bruce Evans wrote: On Mon, 17 Dec 2007, David G Lawrence wrote: One more comment on my last email... The patch that I included is not meant as a real fix - it is just a bandaid. The real problem appears to be that a very large number of vnodes (all of them?) are getting synced (i.e. calling ffs_syncvnode()) every time. This should normally only happen for dirty vnodes. I suspect that something is broken with this check: if (vp->v_type == VNON || ((ip->i_flag & (IN_ACCESS | IN_CHANGE | IN_MODIFIED | IN_UPDATE)) == 0 && vp->v_bufobj.bo_dirty.bv_cnt == 0)) { VI_UNLOCK(vp); continue; } Isn't it just the O(N) algorithm with N quite large? Under ~5.2, on a 2.2GHz A64 UP in 32-bit mode, I see a latency of 3 ms for 17500 vnodes, which would be explained by the above (and the VI_LOCK() and loop overhead) taking 171 ns per vnode. I would expect it to take more like 20 ns per vnode for UP and 60 for SMP. The comment before this code shows that the problem is known, and says that a subroutine call cannot be afforded unless there is work to do, but the, the locking accesses look like subroutine calls, have subroutine calls in their internals, and take longer than simple subroutine calls in the SMP case even when they don't make subroutine calls. (IIRC, on A64 a minimal subroutine call takes 4 cycles while a minimal locked instructions takes 18 cycles; subroutine calls are only slow when their branches are mispredicted.) Bruce Right, it's a non-optimal loop when N is very large, and that's a fairly well understood problem. I think what DG was getting at, though, is that this massive flush happens every time the syncer runs, which doesn't seem correct. Sure, maybe you just rsynced 100,000 files 20 seconds ago, so the upcoming flush is going to be expensive. But the next flush 30 seconds after that shouldn't be just as expensive, yet it appears to be so. This is further supported by the original poster's claim that it takes many hours of uptime before the problem becomes noticeable. If vnodes are never truly getting cleaned, or never getting their flags cleared so that this loop knows that they are clean, then it's feasible that they'll accumulate over time, keep on getting flushed every 30 seconds, keep on bogging down the loop, and so on. Scott ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Packet loss every 30.999 seconds
On Mon, 17 Dec 2007, David G Lawrence wrote: While trying to diagnose a packet loss problem in a RELENG_6 snapshot dated November 8, 2007 it looks like I've stumbled across a broken driver or kernel routine which stops interrupt processing long enough to severly degrade network performance every 30.99 seconds. I see the same behaviour under a heavily modified version of FreeBSD-5.2 (except the period was 2 ms longer and the latency was 7 ms instead of 11 ms when numvnodes was at a certain value. Now with numvnodes = 17500, the latency is 3 ms. I noticed this as well some time ago. The problem has to do with the processing (syncing) of vnodes. When the total number of allocated vnodes in the system grows to tens of thousands, the ~31 second periodic sync process takes a long time to run. Try this patch and let people know if it helps your problem. It will periodically wait for one tick (1ms) every 500 vnodes of processing, which will allow other things to run. However, the syncer should be running at a relative low priority and not cause packet loss. I don't see any packet loss even in ~5.2 where the network stack (but not drivers) is still Giant-locked. Other too-high latencies showed up: - syscons LED setting and vt switching gives a latency of 5.5 msec because syscons still uses busy-waiting for setting LEDs :-(. Oops, I do see packet loss -- this causes it under ~5.2 but not under -current. For the bge and/or em drivers, the packet loss shows up in netstat output as a few hundred errors for every LED setting on the receiving machine, while receiving tiny packets at the maximum possible rate of 640 kpps. sysctl is completely Giant-locked and so are upper layers of the network stack. The bge hardware rx ring size is 256 in -current and 512 in ~5.2. At 640 kpps, 512 packets take 800 us so bge wants to call the the upper layers with a latency of far below 800 us. I don't know exactly where the upper layers block on Giant. - a user CPU hog process gives a latency of over 200 ms every half a second or so when the hog starts up, and a 300-400 ms after the hog has been running for some time. Two user CPU hog processes double the latency. Reducing kern.sched.quantum from 100 ms to 10 ms and/or renicing the hogs don't seem to affect this. Running the hogs at idle priority fixes this. This won't affect packet loss, but it might affect user network processes -- they might need to run at real time priority to get low enough latency. They might need to do this anyway -- a scheduling quantum of 100 ms should give a latency of 100 ms per CPU hog quite often, though not usually since the hogs should never be prefered to a higher-prioerity process. Previously I've used a less specialized clock-watching program to determine the syscall latency. It showed similar problems for CPU hogs. I just remembered that I found the fix for these under ~5.2 -- remove a local hack that sacrifices latency for reduced context switches between user threads. -current with SCHED_4BSD does this non-hackishly, but seems to have a bug somehwhere that gives a latency that is large enough to be noticeable in interactive programs. Bruce ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Packet loss every 30.999 seconds
On Mon, 17 Dec 2007, Mark Fullmer wrote: Thanks. Have a kernel building now. It takes about a day of uptime after reboot before I'll see the problem. Yes run "find / >/dev/null" to see the problem if it is the syncer one. At least the syscall latency problem does seem to be this. Under ~5.2, with the above find and also "while :; do sync; done" (to give latency spike more often), your program (with some fflush(stdout)'s and args 1 7700) gives: % 1197976029041677 12696 0 % 1197976033196396 9761 4154719 % 1197976034060031 13360 863635 % 1197976039080632 13749 5020601 % 1197976043195594 8536 4114962 % 1197976044100601 13505 905007 % 1197976049121870 14562 5021269 % 1197976052195631 8192 3073761 % 1197976054141545 14024 1945914 % 1197976059162357 14623 5020812 % 1197976063195735 7830 4033378 % 1197976064182564 14618 986829 % 1197976069202982 14823 5020418 % 1197976074223722 15350 5020740 % 1197976079244311 15726 5020589 % 1197976084264690 15893 5020379 % 1197976089289409 15058 5024719 % 1197976094315433 16209 5026024 % 1197976095197277 8015 881844 % 1197976099335529 16092 4138252 % 1197976104356513 16863 5020984 % 1197976109376236 16373 5019723 % 1197976114396803 16727 5020567 % 1197976119416822 16533 5020019 % 1197976124437790 17288 5020968 % 1197976126200637 10060 1762847 % 1197976127198459 7839 997822 % 1197976129457321 16606 2258862 % 1197976134477582 16654 5020261 This clearly shows the spike every 5 seconds, and the latency creeping up as vfs.numvnodes increases. It started at about 2 and ended at about 64000. The syncer won't be fixed soon, so the fix for dropped packets requires figuring out why the syncer affects networking. Bruce ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Packet loss every 30.999 seconds
On Mon, 17 Dec 2007, Scott Long wrote: Bruce Evans wrote: On Mon, 17 Dec 2007, David G Lawrence wrote: One more comment on my last email... The patch that I included is not meant as a real fix - it is just a bandaid. The real problem appears to be that a very large number of vnodes (all of them?) are getting synced (i.e. calling ffs_syncvnode()) every time. This should normally only happen for dirty vnodes. I suspect that something is broken with this check: if (vp->v_type == VNON || ((ip->i_flag & (IN_ACCESS | IN_CHANGE | IN_MODIFIED | IN_UPDATE)) == 0 && vp->v_bufobj.bo_dirty.bv_cnt == 0)) { VI_UNLOCK(vp); continue; } Isn't it just the O(N) algorithm with N quite large? Under ~5.2, on Right, it's a non-optimal loop when N is very large, and that's a fairly well understood problem. I think what DG was getting at, though, is that this massive flush happens every time the syncer runs, which doesn't seem correct. Sure, maybe you just rsynced 100,000 files 20 seconds ago, so the upcoming flush is going to be expensive. But the next flush 30 seconds after that shouldn't be just as expensive, yet it appears to be so. I'm sure it doesn't cause many bogus flushes. iostat shows zero writes caused by calling this incessantly using "while :; do sync; done". This is further supported by the original poster's claim that it takes many hours of uptime before the problem becomes noticeable. If vnodes are never truly getting cleaned, or never getting their flags cleared so that this loop knows that they are clean, then it's feasible that they'll accumulate over time, keep on getting flushed every 30 seconds, keep on bogging down the loop, and so on. Using "find / >/dev/null" to grow the problem and make it bad after a few seconds of uptime, and profiling of a single sync(2) call to show that nothing much is done except the loop containing the above: under ~5.2, on a 2.2GHz A64 UP ini386 mode: after booting, with about 700 vnodes: % % cumulative self self total % time seconds secondscalls ns/call ns/call name % 30.8 0.0000.0000 100.00% mcount [4] % 14.9 0.0010.0000 100.00% mexitcount [5] % 5.5 0.0010.0000 100.00% cputime [16] % 5.0 0.0010.00061331213312 vfs_msync [18] % 4.3 0.0010.0000 100.00% user [21] % 3.5 0.0010.00051132111993 ffs_sync [23] after "find / >/dev/null" was stopped after saturating at 64000 vnodes (desiredvodes is 70240): % % cumulative self self total % time seconds secondscalls ns/call ns/call name % 50.7 0.0080.0085 1666427 1667246 ffs_sync [5] % 38.0 0.0150.0066 1041217 1041217 vfs_msync [6] % 3.1 0.0150.0010 100.00% mcount [7] % 1.5 0.0150.0000 100.00% mexitcount [8] % 0.6 0.0150.0000 100.00% cputime [22] % 0.6 0.0160.000 34 2660 2660 generic_bcopy [24] % 0.5 0.0160.0000 100.00% user [26] vfs_msync() is a problem too. It uses an almost identical loop for the case where the vnode is not dirty (but has a different condition for being dirty). ffs_sync() is called 5 times because there are 5 ffs file systems mounted r/w. There is another ffs file system mounted r/o and that combined with a missing r/o optimization might give the extra call to vfs_msync(). With 64000 vnodes, the calls take 1-2 ms each. That is already quite a lot, and there are many calls. Each call only looks at vnodes under the mount point so the number of mounted file systems doesn't affect the total time much. ffs_sync() i taking 125 ns per vnode. That is a more than I would have expected. Bruce ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Packet loss every 30.999 seconds
> >Right, it's a non-optimal loop when N is very large, and that's a fairly > >well understood problem. I think what DG was getting at, though, is > >that this massive flush happens every time the syncer runs, which > >doesn't seem correct. Sure, maybe you just rsynced 100,000 files 20 > >seconds ago, so the upcoming flush is going to be expensive. But the > >next flush 30 seconds after that shouldn't be just as expensive, yet it > >appears to be so. > > I'm sure it doesn't cause many bogus flushes. iostat shows zero writes > caused by calling this incessantly using "while :; do sync; done". I didn't say it caused any bogus disk I/O. My original problem (after a day or two of uptime) was an occasional large scheduling delay for a process that needed to process VoIP frames in real-time. It was happening every 31 seconds and was causing voice frames to be dropped due to the large latency causing the frame to be outside of the jitter window. I wrote a program that measures the scheduling delay by sleeping for one tick and then comparing the timeofday offset from what was expected. This revealed that every 31 seconds, the process was seeing a 17ms delay in scheduling. Further investigation found that 1) the syncer was the process that was running every 31 seconds and causing the delay (and it was the only one in the system with that timing interval), and that 2) lowering the kern.maxvnodes to something lowish (5000) would mostly mitigate the problem. The patch to limit the number of vnodes to process in the loop before sleeping was then developed and it completely resolved the problem. Since the wait that I added is at the bottom of the loop and the limit is 500 vnodes, this tells me that every 31 seconds, there are a whole lot of vnodes that are being "synced", when there shouldn't have been any (this fact wasn't apparent to me at the time, but when I later realized this, I had no time to investigate further). My tests and analysis have all been on an otherwise quiet system (no disk I/O), so the bottom of the ffs_sync vnode loop should not have been reached at all, let alone tens of thousands of times every 31 seconds. All machines were uni- processor, FreeBSD 6+. I don't know if this problem is present in 5.2. I didn't see ffs_syncvnode in your call graph, so it probably is not. Anyway, someone needs to instrument the vnode loop in ffs_sync and figure out what is going on. As you've pointed out, it is necessary to first read a lot of files (I use tar to /dev/null and make sure it reads at least 100K files) in order to get the vnodes allocated. As I mentioned previously, I suspect that either ip->i_flag is not getting completely cleared in ffs_syncvnode or its children or v_bufobj.bo_dirty.bv_cnt accounting is broken. -DG David G. Lawrence President Download Technologies, Inc. - http://www.downloadtech.com - (866) 399 8500 The FreeBSD Project - http://www.freebsd.org Pave the road of life with opportunities. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Packet loss every 30.999 seconds
> Thanks. Have a kernel building now. It takes about a day of uptime > after reboot before I'll see the problem. You may also wish to try to get the problem to occur sooner after boot on a non-patched system by doing a "tar cf /dev/null /" (note: substitute /dev/zero instead of /dev/null, if you use GNU tar, to disable its "optimization"). You can stop it after it has gone through a 100K files. Verify by looking at "sysctl vfs.numvnodes". Doing this would help to further prove that lots of allocated vnodes is the prerequisite for the problem. -DG David G. Lawrence President Download Technologies, Inc. - http://www.downloadtech.com - (866) 399 8500 The FreeBSD Project - http://www.freebsd.org Pave the road of life with opportunities. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Packet loss every 30.999 seconds
On Tue, 18 Dec 2007, David G Lawrence wrote: Thanks. Have a kernel building now. It takes about a day of uptime after reboot before I'll see the problem. You may also wish to try to get the problem to occur sooner after boot on a non-patched system by doing a "tar cf /dev/null /" (note: substitute /dev/zero instead of /dev/null, if you use GNU tar, to disable its "optimization"). You can stop it after it has gone through a 100K files. Verify by looking at "sysctl vfs.numvnodes". Hmm, I said to use "find /", but that is not so good since it only looks at directories and directories (and their inodes) are not packed as tightly as files (and their inodes). Optimized tar, or "find / -type f", or "ls -lR /", should work best, by doing not much more than stat()ing lots of files, while full tar wastes time reading file data. Bruce ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Packet loss every 30.999 seconds
> On Tue, 18 Dec 2007, David G Lawrence wrote: > > >>Thanks. Have a kernel building now. It takes about a day of uptime > >>after reboot before I'll see the problem. > > > > You may also wish to try to get the problem to occur sooner after boot > >on a non-patched system by doing a "tar cf /dev/null /" (note: substitute > >/dev/zero instead of /dev/null, if you use GNU tar, to disable its > >"optimization"). You can stop it after it has gone through a 100K files. > >Verify by looking at "sysctl vfs.numvnodes". > > Hmm, I said to use "find /", but that is not so good since it only > looks at directories and directories (and their inodes) are not packed > as tightly as files (and their inodes). Optimized tar, or "find / > -type f", or "ls -lR /", should work best, by doing not much more than > stat()ing lots of files, while full tar wastes time reading file data. I have no reason to believe that just reading directories will reproduce the problem with file vnodes. You need to open the files and read them. Nothing else will do. -DG David G. Lawrence President Download Technologies, Inc. - http://www.downloadtech.com - (866) 399 8500 The FreeBSD Project - http://www.freebsd.org Pave the road of life with opportunities. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Packet loss every 30.999 seconds
On Tue, 18 Dec 2007, David G Lawrence wrote: I didn't say it caused any bogus disk I/O. My original problem (after a day or two of uptime) was an occasional large scheduling delay for a process that needed to process VoIP frames in real-time. It was happening every 31 seconds and was causing voice frames to be dropped due to the large latency causing the frame to be outside of the jitter window. I wrote a program that measures the scheduling delay by sleeping for one tick and then comparing the timeofday offset from what was expected. This revealed that every 31 seconds, the process was seeing a 17ms delay in scheduling. Further investigation found that 1) the I got an almost identical delay (with 64000 vnodes). Now, 17ms isn't much. Delays much have been much longer when CPUs were many times slower and RAM/vnodes were not so many times smaller. High-priority threads just need to be able to preempt the syncer so that they don't lose data (unless really hard real time is supported, which it isn't). This should work starting with about FreeBSD-6 (probably need "options PREEMPT"). I doesn't work in ~5.2 due to Giant locking, but I find Giant locking to rarely matter for UP. Old versions of FreeBSD were only able to preempt to non-threads (interrupt handlers) yet they somehow survived the longer delays. They didn't have Giant locking to get in the way, and presumably avoided packet loss by doing lots in interrupt handlers (hardware isr and netisr). I just remembered that I have seen packet loss even under -current when I leave out or turn off "options PREEMPT". ... and it completely resolved the problem. Since the wait that I added is at the bottom of the loop and the limit is 500 vnodes, this tells me that every 31 seconds, there are a whole lot of vnodes that are being "synced", when there shouldn't have been any (this fact wasn't apparent to me at the time, but when I later realized this, I had no time to investigate further). My tests and analysis have all been on an otherwise quiet system (no disk I/O), so the bottom of the ffs_sync vnode loop should not have been reached at all, let alone tens of thousands of times every 31 seconds. All machines were uni- processor, FreeBSD 6+. I don't know if this problem is present in 5.2. I didn't see ffs_syncvnode in your call graph, so it probably is not. I chopped to a float profile with only top callers. Any significant calls from ffs_sync() would show up as top callers. I still have the data, and the call graph shows much more clearly that there was just one dirty vnode for the whole sync(): % 0.000.01 1/1 syscall [3] % [4] 88.70.000.01 1 sync [4] % 0.010.00 5/5 ffs_sync [5] % 0.010.00 6/6 vfs_msync [6] % 0.000.00 7/8 vfs_busy [260] % 0.000.00 7/8 vfs_unbusy [263] % 0.000.00 6/7 vn_finished_write [310] % 0.000.00 6/6 vn_start_write [413] % 0.000.00 1/1 vfs_stdnosync [472] % % --- % % 0.010.00 5/5 sync [4] % [5] 50.70.010.00 5 ffs_sync [5] % 0.000.00 1/1 ffs_fsync [278] % 0.000.00 1/60 vget [223] % 0.000.00 1/60 ufs_vnoperatespec [78] % 0.000.00 1/26 vrele [76] It passed the flags test just once to get to the vget(). ffs_syncvnode() doesn't exist in 5.2, and ffs_fsync() is called instead. % % --- % % 0.010.00 6/6 sync [4] % [6] 38.00.010.00 6 vfs_msync [6] % % --- % ... % % 0.000.00 1/1 ffs_sync [5] % [278]0.00.000.00 1 ffs_fsync [278] % 0.000.00 1/1 ffs_update [368] % 0.000.00 1/4 vn_isdisk [304] This is presumbly to sync the 1 dirty vnode. BTW I use noatime a lot, including for all file systems used in the test, so the tree walk didn't dirty any vnodes. A tar to /dev/zero would dirty all vnodes if everything were mounted without this option. % ... % % cumulative self self total % time seconds secondscalls ns/call ns/call name % 50.7 0.0080.0085 1666427 1667246 ffs_sync [5] % 38.0 0.0150.0066 1041217 1041217 vfs_msync [6] % 3.1 0.0150.0010 100.00% mcount [7] % 1.5 0.0150.0000 100.00%
Re: Packet loss every 30.999 seconds
> I got an almost identical delay (with 64000 vnodes). > > Now, 17ms isn't much. Says you. On modern systems, trying to run a pseudo real-time application on an otherwise quiescent system, 17ms is just short of an eternity. I agree that the syncer should be preemptable (which is what my bandaid patch attempts to do), but that probably wouldn't have helped my specific problem since my application was a user process, not a kernel thread. All of my systems have options PREEMPTION - that is the default in 6+. It doesn't affect this problem. On the other hand, the syncer shouldn't be consuming this much CPU in the first place. There is obviously a bug here. Of course looking through all of the vnodes in the system for something dirty is stupid in the first place; there should be a seperate list for that. ...but a simple fix is what is needed right now. I'm going to have to bow out of this discussion now. I just don't have the time for it. -DG David G. Lawrence President Download Technologies, Inc. - http://www.downloadtech.com - (866) 399 8500 The FreeBSD Project - http://www.freebsd.org Pave the road of life with opportunities. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Packet loss every 30.999 seconds
> > I got an almost identical delay (with 64000 vnodes). > > > > Now, 17ms isn't much. > >Says you. On modern systems, trying to run a pseudo real-time application > on an otherwise quiescent system, 17ms is just short of an eternity. I agree > that the syncer should be preemptable (which is what my bandaid patch > attempts to do), but that probably wouldn't have helped my specific problem > since my application was a user process, not a kernel thread. One more followup (I swear I'm done, really!)... I have a laptop here that runs at 150MHz when it is in the lowest running CPU power save mode. At that speed, this bug causes a delay of more than 300ms and is enough to cause loss of keyboard input. I have to switch into high speed mode before I try to type anything, else I end up with random typos. Very annoying. -DG David G. Lawrence President Download Technologies, Inc. - http://www.downloadtech.com - (866) 399 8500 The FreeBSD Project - http://www.freebsd.org Pave the road of life with opportunities. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Packet loss every 30.999 seconds
A little progress. I have a machine with a KTR enabled kernel running. Another machine is running David's ffs_vfsops.c's patch. I left two other machines (GENERIC kernels) running the packet loss test overnight. At ~ 32480 seconds of uptime the problem starts. This is really close to a 16 bit overflow... See http://www.eng.oar.net/~maf/bsd6/ p1.png and http://www.eng.oar.net/~maf/bsd6/p2.png. The missing impulses at 31 second marks are the intervals between test runs. The window of missing packets (timestamps between two packets where a sequence number is missing) is usually less than 4us, altough I'm not sure gettimeofday() can be trusted for measuring this. See https://www.eng.oar.net/~maf/bsd6/ p3.png Things I'll try tonight: o check on the patched kernel o Try KTR debugging enabled before and after an expected high latency period. o Dump all files to /dev/null to trigger the behavior. I would expect the vnode problem to look a little different on the packet loss graphs over time. If this leads anywher I'll add a counter before the msleep() and see how often it's getting there. On Dec 17, 2007, at 5:24 AM, David G Lawrence wrote: I noticed this as well some time ago. The problem has to do with the processing (syncing) of vnodes. When the total number of allocated vnodes in the system grows to tens of thousands, the ~31 second periodic sync process takes a long time to run. Try this patch and let people know if it helps your problem. It will periodically wait for one tick (1ms) every 500 vnodes of processing, which will allow other things to run. Index: ufs/ffs/ffs_vfsops.c === RCS file: /home/ncvs/src/sys/ufs/ffs/ffs_vfsops.c,v retrieving revision 1.290.2.16 diff -c -r1.290.2.16 ffs_vfsops.c *** ufs/ffs/ffs_vfsops.c9 Oct 2006 19:47:17 - 1.290.2.16 --- ufs/ffs/ffs_vfsops.c25 Apr 2007 01:58:15 - *** *** 1109,1114 --- 1109,1115 int softdep_deps; int softdep_accdeps; struct bufobj *bo; + int flushed_count = 0; fs = ump->um_fs; if (fs->fs_fmod != 0 && fs->fs_ronly != 0) { /* XXX */ *** *** 1174,1179 --- 1175,1184 allerror = error; vput(vp); MNT_ILOCK(mp); + if (flushed_count++ > 500) { + flushed_count = 0; + msleep(&flushed_count, MNT_MTX(mp), PZERO, "syncw", 1); + } } MNT_IUNLOCK(mp); /* -DG David G. Lawrence President Download Technologies, Inc. - http://www.downloadtech.com - (866) 399 8500 The FreeBSD Project - http://www.freebsd.org Pave the road of life with opportunities. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Packet loss every 30.999 seconds
On Tue, 18 Dec 2007, David G Lawrence wrote: I got an almost identical delay (with 64000 vnodes). Now, 17ms isn't much. Says you. On modern systems, trying to run a pseudo real-time application on an otherwise quiescent system, 17ms is just short of an eternity. I agree that the syncer should be preemptable (which is what my bandaid patch attempts to do), but that probably wouldn't have helped my specific problem since my application was a user process, not a kernel thread. FreeBSD isn't a real-time system, and 17ms isn't much for it. I saw lots of syscall delays of nearly 1 second while debugging this. (With another hat, I would say that 17 us was a long time in 1992. 17 us is hundreds of times longer now.) One more followup (I swear I'm done, really!)... I have a laptop here that runs at 150MHz when it is in the lowest running CPU power save mode. At that speed, this bug causes a delay of more than 300ms and is enough to cause loss of keyboard input. I have to switch into high speed mode before I try to type anything, else I end up with random typos. Very annoying. Yes, something is wrong if keystrokes are lost with CPUs that run at 150 kHz (sic) or faster. Debugging shows that the problem is like I said. The loop really does take 125 ns per iteration. This time is actually not very much. The the linked list of vnodes could hardly be designed better to maximize cache thrashing. My system has a fairly small L2 cache (512K or 1M), and even a few words from the vnode and the inode don't fit in the L2 cache when there are 64000 vnodes, but the vp and ip are also fairly well desgined to maximize cache thrashing, so L2 cache thrashing starts at just a few thousand vnodes. My system has fairly low latency main memory, else the problem would be larger: % Memory latencies in nanoseconds - smaller is better % (WARNING - may not be correct, check graphs) % --- % Host OS Mhz L1 $ L2 $Main memGuesses % - - - ----- % besplex.b FreeBSD 7.0-C 2205 1.361 5.6090 42.4 [PC3200 CL2.5 overclocked] % sledge.fr FreeBSD 8.0-C 1802 1.666 8.9420 99.8 % freefall. FreeBSD 7.0-C 2778 0.746 6.6310 155.5 The loop makes the following memory accesses, at least in 5.2: % loop: % for (vp = TAILQ_FIRST(&mp->mnt_nvnodelist); vp != NULL; vp = nvp) { % /* %* If the vnode that we are about to sync is no longer %* associated with this mount point, start over. %*/ % if (vp->v_mount != mp) % goto loop; % % /* %* Depend on the mntvnode_slock to keep things stable enough %* for a quick test. Since there might be hundreds of %* thousands of vnodes, we cannot afford even a subroutine %* call unless there's a good chance that we have work to do. %*/ % nvp = TAILQ_NEXT(vp, v_nmntvnodes); Access 1 word at vp offset 0x90. Costs 1 cache line. IIRC, my system has a cache line size of 0x40. Assume this, and that vp is aligned on a cache line boundary. So this access costs the cache line at vp offsets 0x80-0xbf. % VI_LOCK(vp); Access 1 word at vp offset 0x1c. Costs the cache line at vp offsets 0-0x3f. % if (vp->v_iflag & VI_XLOCK) { Access 1 word at vp offset 0x24. Cache hit. % VI_UNLOCK(vp); % continue; % } % ip = VTOI(vp); Access 1 word at vp offset 0xa8. Cache hit. % if (vp->v_type == VNON || ((ip->i_flag & Access 1 word at vp offset 0xa0. Cache hit. Access 1 word at ip offset 0x18. Assume that ip is aligned, as above. Costs the cache line at ip offsets 0-0x3f. % (IN_ACCESS | IN_CHANGE | IN_MODIFIED | IN_UPDATE)) == 0 && % TAILQ_EMPTY(&vp->v_dirtyblkhd))) { Access 1 word at vp offset 0x48. Costs the cache line at vp offsets 0x40- 0x7f. % VI_UNLOCK(vp); Reaccess 1 word at vp offset 0x1c. Cache hit. % continue; % } The total cost is 4 cache lines or 256 bytes per vnode. So with an L2 cache size of 1MB, the L2 cache will start thrashing at numvnodes = 4096. With thrashing, an at my main memory latency of 42.4 nsec, it might take 4*42.4 = 169.6 nsec to read main memory. This is similar to my observed time. Presumably things aren't quite that bad because there is some locality for the 3 lines in each vp. It might be possible to improve this a bit by accessing the lines sequentially and not interleaving the access to ip. Better, repack vp and move the IN* flags from ip to vp (a change that has other advantages), so that everything is in 1 cache line per vp. This isn't consistent with the delay increasing to 300 ms when the CPU is throttled -- memory should
Re: Packet loss every 30.999 seconds
On Tue, 18 Dec 2007, Mark Fullmer wrote: A little progress. I have a machine with a KTR enabled kernel running. Another machine is running David's ffs_vfsops.c's patch. I left two other machines (GENERIC kernels) running the packet loss test overnight. At ~ 32480 seconds of uptime the problem starts. This is really Try it with "find / -type f >/dev/null" to duplicate the problem almost instantly. marks are the intervals between test runs. The window of missing packets (timestamps between two packets where a sequence number is missing) is usually less than 4us, altough I'm not sure gettimeofday() can be trusted for measuring this. See https://www.eng.oar.net/~maf/bsd6/p3.png gettimeofday() can normally be trusted to better than 1 us for time differences of up to about 1 second. However, gettimeofday() should not be used in any program written after clock_gettime() became standard in 1994. clock_gettime() has a resolution of 1 ns. It isn't quite that accurate on current machines, but I trust it to measure differences of 10 nsec between back to back clock_gettime() calls here. Sample output from wollman@'s old clock-watching program converted to clock_gettime(): %%% 2007/12/05 (TSC) bde-current, -O2 -mcpu=athlon-xp min 238, max 99730, mean 240.025380, std 77.291436 1th: 239 (1203207 observations) 2th: 240 (556307 observations) 3th: 241 (190211 observations) 4th: 238 (50091 observations) 5th: 242 (20 observations) 2007/11/23 (TSC) bde-current min 247, max 11890, mean 247.857786, std 62.559317 1th: 247 (1274231 observations) 2th: 248 (668611 observations) 3th: 249 (56950 observations) 4th: 250 (23 observations) 5th: 263 (8 observations) 2007/05/19 (TSC) plain -current-noacpi min 262, max 286965, mean 263.941187, std 41.801400 1th: 264 (1343245 observations) 2th: 263 (626226 observations) 3th: 265 (26860 observations) 4th: 262 (3572 observations) 5th: 268 (8 observations) 2007/05/19 (TSC) plain -current-acpi min 261, max 68926, mean 279.848650, std 40.477440 1th: 261 (999391 observations) 2th: 320 (473325 observations) 3th: 262 (373831 observations) 4th: 321 (148126 observations) 5th: 312 (4759 observations) 2007/05/19 (ACPI-fast timecounter) plain -current-acpi min 558, max 285494, mean 827.597038, std 78.322301 1th: 838 (1685662 observations) 2th: 839 (136980 observations) 3th: 559 (72160 observations) 4th: 837 (48902 observations) 5th: 558 (31217 observations) 2007/05/19 (i8254) plain -current-acpi min 3352, max 288288, mean 4182.774148, std 257.977752 1th: 4190 (1423885 observations) 2th: 4191 (440158 observations) 3th: 3352 (65261 observations) 4th: 5028 (39202 observations) 5th: 5029 (15456 observations) %%% "min" here gives the minimum latency of a clock_gettime() syscall. The improvement from 247 nsec to 240 nsec in the "mean" due to -O2 -march-athlon-xp can be trusted to be measured very accurately since it is an average over more than 100 million trials, and the improvement from 247 nsec to 238 nsec for "min" can be trusted because it is consistent with the improvement in the mean. The program had to be converted to use clock_gettime() a few years ago when CPU speeds increased so much that the correct "min" became significantly less than 1. With gettimeofday(), it cannot distinguish between an overhead of 1 ns and an overhead of 1 us. For the ACPI and i8254 timecounter, you can see that the low-level timecounters have a low frequency clock from the large gaps between the observations. There is a gap of 279-280 ns for the acpi timecounter. This is the period of the acpi timecounter's clock (frequency 14318182/4 = period 279.3651 ns. Since we can observe this period to within 1 ns, we must have a basic accuracy of nearly 1 ns, but if we make only 2 observations we are likely to have an inaccuracy of 279 ns due to the granularity of the clock. The TSC has a clock granuarity of 6 ns on my CPU, and delivers almost that much accuracy with only 2 observations, but technical problems prevent general use of the TSC. Bruce ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Packet loss every 30.999 seconds
> On Tue, 18 Dec 2007, David G Lawrence wrote: > > >>>I got an almost identical delay (with 64000 vnodes). > >>> > >>>Now, 17ms isn't much. > >> > >> Says you. On modern systems, trying to run a pseudo real-time > >> application > >>on an otherwise quiescent system, 17ms is just short of an eternity. I > >>agree > >>that the syncer should be preemptable (which is what my bandaid patch > >>attempts to do), but that probably wouldn't have helped my specific > >>problem > >>since my application was a user process, not a kernel thread. > > FreeBSD isn't a real-time system, and 17ms isn't much for it. I saw lots I never said it was, but that doesn't stop us from using FreeBSD in pseudo real-time applications. This is made possible by fast CPUs and dedicated-task systems where the load is carefully controlled. > of syscall delays of nearly 1 second while debugging this. (With another I can make the delay several minutes by pushing the reset button. > Debugging shows that the problem is like I said. The loop really does > take 125 ns per iteration. This time is actually not very much. The Considering that the CPU clock cycle time is on the order of 300ps, I would say 125ns to do a few checks is pathetic. In any case, it appears that my patch is a no-op, at least for the problem I was trying to solve. This has me confused, however, because at one point the problem was mitigated with it. The patch has gone through several iterations, however, and it could be that it was made to the top of the loop, before any of the checks, in a previous version. Hmmm. -DG David G. Lawrence President Download Technologies, Inc. - http://www.downloadtech.com - (866) 399 8500 The FreeBSD Project - http://www.freebsd.org Pave the road of life with opportunities. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Packet loss every 30.999 seconds
> Try it with "find / -type f >/dev/null" to duplicate the problem almost > instantly. FreeBSD used to have some code that would cause vnodes with no cached pages to be recycled quickly (which would have made a simple find ineffective without reading the files at least a little bit). I guess that got removed when the size of the vnode pool was dramatically increased. -DG David G. Lawrence President Download Technologies, Inc. - http://www.downloadtech.com - (866) 399 8500 The FreeBSD Project - http://www.freebsd.org Pave the road of life with opportunities. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Packet loss every 30.999 seconds
>In any case, it appears that my patch is a no-op, at least for the > problem I was trying to solve. This has me confused, however, because at > one point the problem was mitigated with it. The patch has gone through > several iterations, however, and it could be that it was made to the top > of the loop, before any of the checks, in a previous version. Hmmm. (replying to myself) I just found an earlier version of the patch, and sure enough, it was to the top of the loop. Unfortunately, that version caused the system to crash because vp was occasionally invalid after the wakeup. Anyway, let's see if Mark's packet loss problem is indeed related to this code. If he does the find just after boot and immediately sees the problem, then I would say that is fairly conclusive. He could also release the cached vnodes by temporarily setting kern.maxvnodes=1 and then setting it back to whatever it was previously (probably 6-10). If the problem then goes away for awhile, that would be another good indicator. -DG David G. Lawrence President Download Technologies, Inc. - http://www.downloadtech.com - (866) 399 8500 The FreeBSD Project - http://www.freebsd.org Pave the road of life with opportunities. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Packet loss every 30.999 seconds
David G Lawrence wrote: Try it with "find / -type f >/dev/null" to duplicate the problem almost instantly. FreeBSD used to have some code that would cause vnodes with no cached pages to be recycled quickly (which would have made a simple find ineffective without reading the files at least a little bit). I guess that got removed when the size of the vnode pool was dramatically increased. You can decrease vfs.wantfreevnodes if caching files without cached data is not beneficial for your application. -DG David G. Lawrence President Download Technologies, Inc. - http://www.downloadtech.com - (866) 399 8500 The FreeBSD Project - http://www.freebsd.org Pave the road of life with opportunities. ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]" ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Packet loss every 30.999 seconds
On Wed, 19 Dec 2007, David G Lawrence wrote: Debugging shows that the problem is like I said. The loop really does take 125 ns per iteration. This time is actually not very much. The Considering that the CPU clock cycle time is on the order of 300ps, I would say 125ns to do a few checks is pathetic. As I said, 125 nsec is a short time in this context. It is approximately the time for a single L2 cache miss on a machine with slow memory like freefall (Xeon 2.8 GHz with L2 cache latency of 155.5 ns). As I said, the code is organized so as to give about 4 L2 cache misses per vnode if there are more than a few thousand vnodes, so it is doing very well to take only 125 nsec for a few checks. In any case, it appears that my patch is a no-op, at least for the problem I was trying to solve. This has me confused, however, because at one point the problem was mitigated with it. The patch has gone through several iterations, however, and it could be that it was made to the top of the loop, before any of the checks, in a previous version. Hmmm. The patch should work fine. IIRC, it yields voluntarily so that other things can run. I committed a similar hack for uiomove(). It was easy to make syscalls that take many seconds (now tenths of seconds insted of seconds?), and without yielding or PREEMPTION or multiple CPUs, everything except interrupts has to wait for these syscalls. Now the main problem is to figure out why PREEMPTION doesn't work. I'm not working on this directly since I'm running ~5.2 where nearly-full kernel preemption doesn't work due to Giant locking. Bruce ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Packet loss every 30.999 seconds
On Dec 19, 2007, at 9:54 AM, Bruce Evans wrote: On Tue, 18 Dec 2007, Mark Fullmer wrote: A little progress. I have a machine with a KTR enabled kernel running. Another machine is running David's ffs_vfsops.c's patch. I left two other machines (GENERIC kernels) running the packet loss test overnight. At ~ 32480 seconds of uptime the problem starts. This is really Try it with "find / -type f >/dev/null" to duplicate the problem almost instantly. I was able to verify last night that (cd /; tar -cpf -) > all.tar would trigger the problem. I'm working getting a test running with David's ffs_sync() workaround now, adding a few counters there should get this narrowed down a little more. Thanks for the other info on timer resolution, I overlooked clock_gettime(). -- mark ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Packet loss every 30.999 seconds
> > In any case, it appears that my patch is a no-op, at least for the > >problem I was trying to solve. This has me confused, however, because at > >one point the problem was mitigated with it. The patch has gone through > >several iterations, however, and it could be that it was made to the top > >of the loop, before any of the checks, in a previous version. Hmmm. > > The patch should work fine. IIRC, it yields voluntarily so that other > things can run. I committed a similar hack for uiomove(). It was It patches the bottom of the loop, which is only reached if the vnode is dirty. So it will only help if there are thousands of dirty vnodes. While that condition can certainly happen, it isn't the case that I'm particularly interested in. > CPUs, everything except interrupts has to wait for these syscalls. Now > the main problem is to figure out why PREEMPTION doesn't work. I'm > not working on this directly since I'm running ~5.2 where nearly-full > kernel preemption doesn't work due to Giant locking. I don't understand how PREEMPTION is supposed to work (I mean to any significant detail), so I can't really comment on that. -DG David G. Lawrence President Download Technologies, Inc. - http://www.downloadtech.com - (866) 399 8500 The FreeBSD Project - http://www.freebsd.org Pave the road of life with opportunities. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Packet loss every 30.999 seconds
> >Try it with "find / -type f >/dev/null" to duplicate the problem > >almost > >instantly. > > I was able to verify last night that (cd /; tar -cpf -) > all.tar would > trigger the problem. I'm working getting a test running with > David's ffs_sync() workaround now, adding a few counters there should > get this narrowed down a little more. Unfortunately, the version of the patch that I sent out isn't going to help your problem. It needs to yield at the top of the loop, but vp isn't necessarily valid after the wakeup from the msleep. That's a problem that I'm having trouble figuring out a solution to - the solutions that come to mind will all significantly increase the overhead of the loop. As a very inadequate work-around, you might consider lowering kern.maxvnodes to something like 2 - that might be low enough to not trigger the problem, but also be high enough to not significantly affect system I/O performance. -DG David G. Lawrence President Download Technologies, Inc. - http://www.downloadtech.com - (866) 399 8500 The FreeBSD Project - http://www.freebsd.org Pave the road of life with opportunities. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Packet loss every 30.999 seconds
On Wed, 19 Dec 2007, David G Lawrence wrote: Try it with "find / -type f >/dev/null" to duplicate the problem almost instantly. FreeBSD used to have some code that would cause vnodes with no cached pages to be recycled quickly (which would have made a simple find ineffective without reading the files at least a little bit). I guess that got removed when the size of the vnode pool was dramatically increased. It might still. The data should be cached somewhere, but caching it in both the buffer cache/VMIO and the vnode/inode is wasteful. I may have been only caching vnodes for directories. I switched to using a find or a tar on /home/ncvs/ports since that has a very high density of directories. Bruce ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Packet loss every 30.999 seconds
On Thu, 20 Dec 2007, Bruce Evans wrote: On Wed, 19 Dec 2007, David G Lawrence wrote: Considering that the CPU clock cycle time is on the order of 300ps, I would say 125ns to do a few checks is pathetic. As I said, 125 nsec is a short time in this context. It is approximately the time for a single L2 cache miss on a machine with slow memory like freefall (Xeon 2.8 GHz with L2 cache latency of 155.5 ns). As I said, Perfmon counts for the cache misses during sync(1); ==> /tmp/kg1/z0 <== vfs.numvnodes: 630 # s/kx-dc-accesses 484516 # s/kx-dc-misses 20852 misses = 4% ==> /tmp/kg1/z1 <== vfs.numvnodes: 9246 # s/kx-dc-accesses 884361 # s/kx-dc-misses 89833 misses = 10% ==> /tmp/kg1/z2 <== vfs.numvnodes: 20312 # s/kx-dc-accesses 1389959 # s/kx-dc-misses 178207 misses = 13% ==> /tmp/kg1/z3 <== vfs.numvnodes: 80802 # s/kx-dc-accesses 4122411 # s/kx-dc-misses 658740 misses = 16% ==> /tmp/kg1/z4 <== vfs.numvnodes: 138557 # s/kx-dc-accesses 7150726 # s/kx-dc-misses 1129997 misses = 16% === I forgot to only count active vnodes in the above. vfs.freevnodes was small (< 5%). I set kern.maxvnodes to 20, but vfs.numvnodes saturated at 138557 (probably all that fits in kvm or main memory on i386 with 1GB RAM). With 138557 vnodes, a null sync(2) takes 39673 us according to kdump -R. That is 35.1 ns per miss. This is consistent with lmbench2's estimate of 42.5 ns for main memory latency. Watching vfs.*vnodes confirmed that vnode caching still works like you said: o "find /home/ncvs/ports -type f" only gives a vnode for each directory o a repeated "find /home/ncvs/ports -type f" is fast because everything remains cached by VMIO. FreeBSD performed very badly at this benchmark before VMIO existed and was used for directories o "tar cf /dev/zero /home/ncvs/ports" gives a vnode for files too. Bruce ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Packet loss every 30.999 seconds
On Wed, Dec 19, 2007 at 09:13:31AM -0800, David G Lawrence wrote: > > >Try it with "find / -type f >/dev/null" to duplicate the problem > > >almost > > >instantly. > > > > I was able to verify last night that (cd /; tar -cpf -) > all.tar would > > trigger the problem. I'm working getting a test running with > > David's ffs_sync() workaround now, adding a few counters there should > > get this narrowed down a little more. > >Unfortunately, the version of the patch that I sent out isn't going to > help your problem. It needs to yield at the top of the loop, but vp isn't > necessarily valid after the wakeup from the msleep. That's a problem that > I'm having trouble figuring out a solution to - the solutions that come > to mind will all significantly increase the overhead of the loop. >As a very inadequate work-around, you might consider lowering > kern.maxvnodes to something like 2 - that might be low enough to > not trigger the problem, but also be high enough to not significantly > affect system I/O performance. I think the following may be safe. It counts only the clean scanned vnodes and does not evaluate the vp, that indeed may be reclaimed, after the sleep. I never booted with the change. diff --git a/sys/ufs/ffs/ffs_vfsops.c b/sys/ufs/ffs/ffs_vfsops.c index cbccc62..e686b97 100644 --- a/sys/ufs/ffs/ffs_vfsops.c +++ b/sys/ufs/ffs/ffs_vfsops.c @@ -1176,6 +1176,7 @@ ffs_sync(mp, waitfor, td) struct ufsmount *ump = VFSTOUFS(mp); struct fs *fs; int error, count, wait, lockreq, allerror = 0; + int yield_count; int suspend; int suspended; int secondary_writes; @@ -1216,6 +1217,7 @@ loop: softdep_get_depcounts(mp, &softdep_deps, &softdep_accdeps); MNT_ILOCK(mp); + yield_count = 0; MNT_VNODE_FOREACH(vp, mp, mvp) { /* * Depend on the mntvnode_slock to keep things stable enough @@ -1233,6 +1235,11 @@ loop: (IN_ACCESS | IN_CHANGE | IN_MODIFIED | IN_UPDATE)) == 0 && vp->v_bufobj.bo_dirty.bv_cnt == 0)) { VI_UNLOCK(vp); + if (yield_count++ == 500) { + yield_count = 0; + msleep(&yield_count, MNT_MTX(mp), PZERO, + "ffspause", 1); + } continue; } MNT_IUNLOCK(mp); pgp9eJFCMuFsx.pgp Description: PGP signature
Re: Packet loss every 30.999 seconds
On Wed, Dec 19, 2007 at 08:11:59PM +0200, Kostik Belousov wrote: > On Wed, Dec 19, 2007 at 09:13:31AM -0800, David G Lawrence wrote: > > > >Try it with "find / -type f >/dev/null" to duplicate the problem > > > >almost > > > >instantly. > > > > > > I was able to verify last night that (cd /; tar -cpf -) > all.tar would > > > trigger the problem. I'm working getting a test running with > > > David's ffs_sync() workaround now, adding a few counters there should > > > get this narrowed down a little more. > > > >Unfortunately, the version of the patch that I sent out isn't going to > > help your problem. It needs to yield at the top of the loop, but vp isn't > > necessarily valid after the wakeup from the msleep. That's a problem that > > I'm having trouble figuring out a solution to - the solutions that come > > to mind will all significantly increase the overhead of the loop. > >As a very inadequate work-around, you might consider lowering > > kern.maxvnodes to something like 2 - that might be low enough to > > not trigger the problem, but also be high enough to not significantly > > affect system I/O performance. > > I think the following may be safe. It counts only the clean scanned vnodes > and does not evaluate the vp, that indeed may be reclaimed, after the sleep. > > I never booted with the change. > > diff --git a/sys/ufs/ffs/ffs_vfsops.c b/sys/ufs/ffs/ffs_vfsops.c > index cbccc62..e686b97 100644 Or, better to use uio_yield(). See below. diff --git a/sys/ufs/ffs/ffs_vfsops.c b/sys/ufs/ffs/ffs_vfsops.c index cbccc62..5d2535f 100644 --- a/sys/ufs/ffs/ffs_vfsops.c +++ b/sys/ufs/ffs/ffs_vfsops.c @@ -1176,6 +1176,7 @@ ffs_sync(mp, waitfor, td) struct ufsmount *ump = VFSTOUFS(mp); struct fs *fs; int error, count, wait, lockreq, allerror = 0; + int yield_count; int suspend; int suspended; int secondary_writes; @@ -1216,6 +1217,7 @@ loop: softdep_get_depcounts(mp, &softdep_deps, &softdep_accdeps); MNT_ILOCK(mp); + yield_count = 0; MNT_VNODE_FOREACH(vp, mp, mvp) { /* * Depend on the mntvnode_slock to keep things stable enough @@ -1233,6 +1235,12 @@ loop: (IN_ACCESS | IN_CHANGE | IN_MODIFIED | IN_UPDATE)) == 0 && vp->v_bufobj.bo_dirty.bv_cnt == 0)) { VI_UNLOCK(vp); + if (yield_count++ == 500) { + MNT_IUNLOCK(mp); + yield_count = 0; + uio_yield(); + goto relock_mp; + } continue; } MNT_IUNLOCK(mp); @@ -1247,6 +1255,7 @@ loop: if ((error = ffs_syncvnode(vp, waitfor)) != 0) allerror = error; vput(vp); + relock_mp: MNT_ILOCK(mp); } MNT_IUNLOCK(mp); pgpmpNKoG2bTI.pgp Description: PGP signature
Re: Packet loss every 30.999 seconds
Just to confirm the patch did not change the behavior. I ran with it last night and double checked this morning to make sure. It looks like if you put the check at the top of the loop and the next node is changed during msleep() SLIST_NEXT will walk into the trash. I'm in over my head here Setting kern.maxvnodes=1000 does stop both the periodic packet loss and the high latency syscall's, so it does look like walking this chain without yielding the processor is part of the problem I'm seeing. The other behavior I don't understand is why the em driver is able to increment if_ipackets but still lose the packet. Dumping the internal stats with dev.em.1.stats=1: Dec 19 13:07:46 dytnq-nf1 kernel: em1: Excessive collisions = 0 Dec 19 13:07:46 dytnq-nf1 kernel: em1: Sequence errors = 0 Dec 19 13:07:46 dytnq-nf1 kernel: em1: Defer count = 0 Dec 19 13:07:46 dytnq-nf1 kernel: em1: Missed Packets = 0 Dec 19 13:07:46 dytnq-nf1 kernel: em1: Receive No Buffers = 0 Dec 19 13:07:46 dytnq-nf1 kernel: em1: Receive Length Errors = 0 Dec 19 13:07:46 dytnq-nf1 kernel: em1: Receive errors = 0 Dec 19 13:07:46 dytnq-nf1 kernel: em1: Crc errors = 0 Dec 19 13:07:46 dytnq-nf1 kernel: em1: Alignment errors = 0 Dec 19 13:07:46 dytnq-nf1 kernel: em1: Collision/Carrier extension errors = 0 Dec 19 13:07:46 dytnq-nf1 kernel: em1: RX overruns = 0 Dec 19 13:07:46 dytnq-nf1 kernel: em1: watchdog timeouts = 0 Dec 19 13:07:46 dytnq-nf1 kernel: em1: XON Rcvd = 0 Dec 19 13:07:46 dytnq-nf1 kernel: em1: XON Xmtd = 0 Dec 19 13:07:46 dytnq-nf1 kernel: em1: XOFF Rcvd = 0 Dec 19 13:07:46 dytnq-nf1 kernel: em1: XOFF Xmtd = 0 Dec 19 13:07:46 dytnq-nf1 kernel: em1: Good Packets Rcvd = 0 Dec 19 13:07:46 dytnq-nf1 kernel: em1: Good Packets Xmtd = 0 Dec 19 13:07:46 dytnq-nf1 kernel: em1: TSO Contexts Xmtd = 0 Dec 19 13:07:46 dytnq-nf1 kernel: em1: TSO Contexts Failed = 0 With FreeBSD 4 I was able to run a UDP data collector with rtprio set, kern.ipc.maxsockbuf=2048, then use setsockopt() with SO_RCVBUF in the application. If packets were dropped they would show up with netstat -s as "dropped due to full socket buffers". Since the packet never makes it to ip_input() I no longer have any way to count drops. There will always be corner cases where interrupts are lost and drops not accounted for if the adapter hardware can't report them, but right now I've got no way to estimate any loss. -- mark On Dec 19, 2007, at 12:13 PM, David G Lawrence wrote: Try it with "find / -type f >/dev/null" to duplicate the problem almost instantly. I was able to verify last night that (cd /; tar -cpf -) > all.tar would trigger the problem. I'm working getting a test running with David's ffs_sync() workaround now, adding a few counters there should get this narrowed down a little more. Unfortunately, the version of the patch that I sent out isn't going to help your problem. It needs to yield at the top of the loop, but vp isn't necessarily valid after the wakeup from the msleep. That's a problem that I'm having trouble figuring out a solution to - the solutions that come to mind will all significantly increase the overhead of the loop. As a very inadequate work-around, you might consider lowering kern.maxvnodes to something like 2 - that might be low enough to not trigger the problem, but also be high enough to not significantly affect system I/O performance. -DG David G. Lawrence President Download Technologies, Inc. - http://www.downloadtech.com - (866) 399 8500 The FreeBSD Project - http://www.freebsd.org Pave the road of life with opportunities. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Packet loss every 30.999 seconds
On Wed, 19 Dec 2007, David G Lawrence wrote: The patch should work fine. IIRC, it yields voluntarily so that other things can run. I committed a similar hack for uiomove(). It was It patches the bottom of the loop, which is only reached if the vnode is dirty. So it will only help if there are thousands of dirty vnodes. While that condition can certainly happen, it isn't the case that I'm particularly interested in. Oops. When it reaches the bottom of the loop, it will probably block on i/o sometimes, so that the problem is smaller anyway. CPUs, everything except interrupts has to wait for these syscalls. Now the main problem is to figure out why PREEMPTION doesn't work. I'm not working on this directly since I'm running ~5.2 where nearly-full kernel preemption doesn't work due to Giant locking. I don't understand how PREEMPTION is supposed to work (I mean to any significant detail), so I can't really comment on that. Me neither, but I will comment anyway :-). I think PREEMPTION should even preempt kernel threads in favor of (higher priority of course) user threads that are in the kernel, but doesn't do this now. Even interrupt threads should have dynamic priorities so that when they become too hoggish they can be preempted even by user threads subject to the this priority rule. This is further from happening. ffs_sync() can hold the mountpoint lock for a long time. That gives problems preempting it. To move your fix to the top of the loop, I think you just need to drop the mountpoint lock every few hundred iterations while yielding. This would help for PREEMPTION too. Dropping the lock must be safe because it is already done while flushing. Hmm, the loop is nicely obfuscated and pessimized in current (see rev.1.234). The fast (modulo no cache misses) path used to be just a TAILQ_NEXT() to reach the next vnode, but now unnecessarily joins the slow path at MNT_VNODE_FOREACH(), and MNT_VNODE_FOREACH() hides a function call. Bruce ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Packet loss every 30.999 seconds
David G Lawrence wrote: In any case, it appears that my patch is a no-op, at least for the problem I was trying to solve. This has me confused, however, because at one point the problem was mitigated with it. The patch has gone through several iterations, however, and it could be that it was made to the top of the loop, before any of the checks, in a previous version. Hmmm. The patch should work fine. IIRC, it yields voluntarily so that other things can run. I committed a similar hack for uiomove(). It was It patches the bottom of the loop, which is only reached if the vnode is dirty. So it will only help if there are thousands of dirty vnodes. While that condition can certainly happen, it isn't the case that I'm particularly interested in. CPUs, everything except interrupts has to wait for these syscalls. Now the main problem is to figure out why PREEMPTION doesn't work. I'm not working on this directly since I'm running ~5.2 where nearly-full kernel preemption doesn't work due to Giant locking. I don't understand how PREEMPTION is supposed to work (I mean to any significant detail), so I can't really comment on that. It's really very simple. When you do a "wakeup" (or anything else that puts a thread on a run queue) i.e. use setrunqueue() then if that thread has more priority than you do, (and in the general case is an interrupt thread), you immedialty call mi_switch so that it runs imediatly. You get guaranteed to run again when it finishes. (you are not just put back on the run queue at the end). the critical_enter()/critical_exit() calls disable this from happening to you if you really must not be interrupted by another thread. there is an option where it is not jsut interrupt threads that can jump in, but I think it's usually disabled. -DG David G. Lawrence President Download Technologies, Inc. - http://www.downloadtech.com - (866) 399 8500 The FreeBSD Project - http://www.freebsd.org Pave the road of life with opportunities. ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]" ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Packet loss every 30.999 seconds
On Wed, Dec 19, 2007 at 11:44:00AM -0800, Julian Elischer wrote: > David G Lawrence wrote: > >>> In any case, it appears that my patch is a no-op, at least for the > >>>problem I was trying to solve. This has me confused, however, because at > >>>one point the problem was mitigated with it. The patch has gone through > >>>several iterations, however, and it could be that it was made to the top > >>>of the loop, before any of the checks, in a previous version. Hmmm. > >>The patch should work fine. IIRC, it yields voluntarily so that other > >>things can run. I committed a similar hack for uiomove(). It was > > > > It patches the bottom of the loop, which is only reached if the vnode > >is dirty. So it will only help if there are thousands of dirty vnodes. > >While that condition can certainly happen, it isn't the case that I'm > >particularly interested in. > > > >>CPUs, everything except interrupts has to wait for these syscalls. Now > >>the main problem is to figure out why PREEMPTION doesn't work. I'm > >>not working on this directly since I'm running ~5.2 where nearly-full > >>kernel preemption doesn't work due to Giant locking. > > > > I don't understand how PREEMPTION is supposed to work (I mean > >to any significant detail), so I can't really comment on that. > > It's really very simple. > > When you do a "wakeup" > (or anything else that puts a thread on a run queue) > i.e. use setrunqueue() > then if that thread has more priority than you do, (and in the general case > is an interrupt thread), you immedialty call mi_switch so that it runs > imediatly. > You get guaranteed to run again when it finishes. > (you are not just put back on the run queue at the end). As far as I see it, only the interrupt threads can put the kernel thread off the CPU. More, the thread being forced out shall be an "idle user thread". See kern_switch.c, maybe_preempt(), the #ifndef FULL_PREEMPTION block. > > the critical_enter()/critical_exit() calls disable this from happening to > you if you really must not be interrupted by another thread. > > there is an option where it is not jsut interrupt threads that can jump in, > but I think it's usually disabled. Do you mean FULL_PREEMPTION ? pgp9yoTTNhHmz.pgp Description: PGP signature
Re: Packet loss every 30.999 seconds
On Wed, Dec 19, 2007 at 12:06:59PM -0500, Mark Fullmer wrote: >Thanks for the other info on timer resolution, I overlooked >clock_gettime(). If you have a UP system with a usable TSC (or equivalent) then using rdtsc() (or equivalent) is a much cheaper way to measure short durations with high resolution. -- Peter Jeremy Please excuse any delays as the result of my ISP's inability to implement an MTA that is either RFC2821-compliant or matches their claimed behaviour. pgpCXh5EuAxtY.pgp Description: PGP signature
Re: Packet loss every 30.999 seconds
Thanks, I'll test this later on today. On Dec 19, 2007, at 1:11 PM, Kostik Belousov wrote: On Wed, Dec 19, 2007 at 09:13:31AM -0800, David G Lawrence wrote: Try it with "find / -type f >/dev/null" to duplicate the problem almost instantly. I was able to verify last night that (cd /; tar -cpf -) > all.tar would trigger the problem. I'm working getting a test running with David's ffs_sync() workaround now, adding a few counters there should get this narrowed down a little more. Unfortunately, the version of the patch that I sent out isn't going to help your problem. It needs to yield at the top of the loop, but vp isn't necessarily valid after the wakeup from the msleep. That's a problem that I'm having trouble figuring out a solution to - the solutions that come to mind will all significantly increase the overhead of the loop. As a very inadequate work-around, you might consider lowering kern.maxvnodes to something like 2 - that might be low enough to not trigger the problem, but also be high enough to not significantly affect system I/O performance. I think the following may be safe. It counts only the clean scanned vnodes and does not evaluate the vp, that indeed may be reclaimed, after the sleep. I never booted with the change. diff --git a/sys/ufs/ffs/ffs_vfsops.c b/sys/ufs/ffs/ffs_vfsops.c index cbccc62..e686b97 100644 --- a/sys/ufs/ffs/ffs_vfsops.c +++ b/sys/ufs/ffs/ffs_vfsops.c @@ -1176,6 +1176,7 @@ ffs_sync(mp, waitfor, td) struct ufsmount *ump = VFSTOUFS(mp); struct fs *fs; int error, count, wait, lockreq, allerror = 0; + int yield_count; int suspend; int suspended; int secondary_writes; @@ -1216,6 +1217,7 @@ loop: softdep_get_depcounts(mp, &softdep_deps, &softdep_accdeps); MNT_ILOCK(mp); + yield_count = 0; MNT_VNODE_FOREACH(vp, mp, mvp) { /* * Depend on the mntvnode_slock to keep things stable enough @@ -1233,6 +1235,11 @@ loop: (IN_ACCESS | IN_CHANGE | IN_MODIFIED | IN_UPDATE)) == 0 && vp->v_bufobj.bo_dirty.bv_cnt == 0)) { VI_UNLOCK(vp); + if (yield_count++ == 500) { + yield_count = 0; + msleep(&yield_count, MNT_MTX(mp), PZERO, + "ffspause", 1); + } continue; } MNT_IUNLOCK(mp); ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Packet loss every 30.999 seconds
* David G Lawrence <[EMAIL PROTECTED]> [071219 09:12] wrote: > > >Try it with "find / -type f >/dev/null" to duplicate the problem > > >almost > > >instantly. > > > > I was able to verify last night that (cd /; tar -cpf -) > all.tar would > > trigger the problem. I'm working getting a test running with > > David's ffs_sync() workaround now, adding a few counters there should > > get this narrowed down a little more. > >Unfortunately, the version of the patch that I sent out isn't going to > help your problem. It needs to yield at the top of the loop, but vp isn't > necessarily valid after the wakeup from the msleep. That's a problem that > I'm having trouble figuring out a solution to - the solutions that come > to mind will all significantly increase the overhead of the loop. >As a very inadequate work-around, you might consider lowering > kern.maxvnodes to something like 2 - that might be low enough to > not trigger the problem, but also be high enough to not significantly > affect system I/O performance. I apologize for not reading the code as I am swamped, but a technique that Matt Dillon used for bufs might work here. Can you use a placeholder vnode as a place to restart the scan? you might have to mark it special so that other threads/things (getnewvnode()?) don't molest it, but it can provide for a convenient restart point. -- - Alfred Perlstein ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Packet loss every 30.999 seconds
> >Unfortunately, the version of the patch that I sent out isn't going to > > help your problem. It needs to yield at the top of the loop, but vp isn't > > necessarily valid after the wakeup from the msleep. That's a problem that > > I'm having trouble figuring out a solution to - the solutions that come > > to mind will all significantly increase the overhead of the loop. > > I apologize for not reading the code as I am swamped, but a technique > that Matt Dillon used for bufs might work here. > > Can you use a placeholder vnode as a place to restart the scan? > you might have to mark it special so that other threads/things > (getnewvnode()?) don't molest it, but it can provide for a convenient > restart point. That was one of the solutions that I considered and rejected since it would significantly increase the overhead of the loop. The solution provided by Kostik Belousov that uses uio_yield looks like a find solution. I intend to try it out on some servers RSN. -DG David G. Lawrence President Download Technologies, Inc. - http://www.downloadtech.com - (866) 399 8500 The FreeBSD Project - http://www.freebsd.org Pave the road of life with opportunities. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Packet loss every 30.999 seconds
* David G Lawrence <[EMAIL PROTECTED]> [071221 15:42] wrote: > > >Unfortunately, the version of the patch that I sent out isn't going to > > > help your problem. It needs to yield at the top of the loop, but vp isn't > > > necessarily valid after the wakeup from the msleep. That's a problem that > > > I'm having trouble figuring out a solution to - the solutions that come > > > to mind will all significantly increase the overhead of the loop. > > > > I apologize for not reading the code as I am swamped, but a technique > > that Matt Dillon used for bufs might work here. > > > > Can you use a placeholder vnode as a place to restart the scan? > > you might have to mark it special so that other threads/things > > (getnewvnode()?) don't molest it, but it can provide for a convenient > > restart point. > >That was one of the solutions that I considered and rejected since it > would significantly increase the overhead of the loop. >The solution provided by Kostik Belousov that uses uio_yield looks like > a find solution. I intend to try it out on some servers RSN. Out of curiosity's sake, why would it make the loop slower? one would only add the placeholder when yielding, not for every iteration. -- - Alfred Perlstein ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
RE: Packet loss every 30.999 seconds
I'm just an observer, and I may be confused, but it seems to me that this is motion in the wrong direction (at least, it's not going to fix the actual problem). As I understand the problem, once you reach a certain point, the system slows down *every* 30.999 seconds. Now, it's possible for the code to cause one slowdown as it cleans up, but why does it need to clean up so much 31 seconds later? Why not find/fix the actual bug? Then work on getting the yield right if it turns out there's an actual problem for it to fix. If the problem is that too much work is being done at a stretch and it turns out this is because work is being done erroneously or needlessly, fixing that should solve the whole problem. Doing the work that doesn't need to be done more slowly is at best an ugly workaround. Or am I misunderstanding? DS ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Packet loss every 30.999 seconds
The uio_yield() idea did not work. Still have the same 31 second interval packet loss. Is it safe to assume the vp will be valid after a msleep() or uio_yield()? If so can we do something a little different: Currently: /* this takes too long when list is large */ MNT_VNODE_FOREACH(vp, mp, mvp) { do work } Why not do this incrementally and call ffs_sync() more often, or break it out into ffs_isync() (incremental sync). static struct vnode *vp; /* first? */ if (!vp) vp = __mnt_vnode_first(&mvp, mp); for (vcount = 0; vp && (vcount != 500); ++vcount) { do work vp = __mnt_vnode_next(&mvp, mp); } The problem I see with this is a race condition where this list may change between the incremental calls. -- mark On Dec 21, 2007, at 6:43 PM, David G Lawrence wrote: Unfortunately, the version of the patch that I sent out isn't going to help your problem. It needs to yield at the top of the loop, but vp isn't necessarily valid after the wakeup from the msleep. That's a problem that I'm having trouble figuring out a solution to - the solutions that come to mind will all significantly increase the overhead of the loop. I apologize for not reading the code as I am swamped, but a technique that Matt Dillon used for bufs might work here. Can you use a placeholder vnode as a place to restart the scan? you might have to mark it special so that other threads/things (getnewvnode()?) don't molest it, but it can provide for a convenient restart point. That was one of the solutions that I considered and rejected since it would significantly increase the overhead of the loop. The solution provided by Kostik Belousov that uses uio_yield looks like a find solution. I intend to try it out on some servers RSN. -DG David G. Lawrence President Download Technologies, Inc. - http://www.downloadtech.com - (866) 399 8500 The FreeBSD Project - http://www.freebsd.org Pave the road of life with opportunities. ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "[EMAIL PROTECTED]" ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Packet loss every 30.999 seconds
On Fri, Dec 21, 2007 at 05:43:09PM -0800, David Schwartz wrote: > > > I'm just an observer, and I may be confused, but it seems to me that this is > motion in the wrong direction (at least, it's not going to fix the actual > problem). As I understand the problem, once you reach a certain point, the > system slows down *every* 30.999 seconds. Now, it's possible for the code to > cause one slowdown as it cleans up, but why does it need to clean up so much > 31 seconds later? > > Why not find/fix the actual bug? Then work on getting the yield right if it > turns out there's an actual problem for it to fix. > > If the problem is that too much work is being done at a stretch and it turns > out this is because work is being done erroneously or needlessly, fixing > that should solve the whole problem. Doing the work that doesn't need to be > done more slowly is at best an ugly workaround. > > Or am I misunderstanding? Yes, rewriting the syncer is the right solution. It probably cannot be done quickly enough. If the yield workaround provide mitigation for now, it shall go in. pgp9KeKRxAVmb.pgp Description: PGP signature
Re: Packet loss every 30.999 seconds
On Fri, Dec 21, 2007 at 10:30:51PM -0500, Mark Fullmer wrote: > The uio_yield() idea did not work. Still have the same 31 second > interval packet loss. What patch you have used ? Lets check whether the syncer is the culprit for you. Please, change the value of the syncdelay at the sys/kern/vfs_subr.c around the line 238 from 30 to some other value, e.g., 45. After that, check the interval of the effect you have observed. It would be interesting to check whether completely disabling the syncer eliminates the packet loss, but such system have to be operated with extreme caution. > > Is it safe to assume the vp will be valid after a msleep() or > uio_yield()? If No. pgpOR5uacte11.pgp Description: PGP signature
Re: Packet loss every 30.999 seconds
On Fri, Dec 21, 2007 at 04:24:32PM -0800, Alfred Perlstein wrote: > * David G Lawrence <[EMAIL PROTECTED]> [071221 15:42] wrote: > > > >Unfortunately, the version of the patch that I sent out isn't going > > > > to > > > > help your problem. It needs to yield at the top of the loop, but vp > > > > isn't > > > > necessarily valid after the wakeup from the msleep. That's a problem > > > > that > > > > I'm having trouble figuring out a solution to - the solutions that come > > > > to mind will all significantly increase the overhead of the loop. > > > > > > I apologize for not reading the code as I am swamped, but a technique > > > that Matt Dillon used for bufs might work here. > > > > > > Can you use a placeholder vnode as a place to restart the scan? > > > you might have to mark it special so that other threads/things > > > (getnewvnode()?) don't molest it, but it can provide for a convenient > > > restart point. > > > >That was one of the solutions that I considered and rejected since it > > would significantly increase the overhead of the loop. > >The solution provided by Kostik Belousov that uses uio_yield looks like > > a find solution. I intend to try it out on some servers RSN. > > Out of curiosity's sake, why would it make the loop slower? one > would only add the placeholder when yielding, not for every iteration. Marker is already reinserted into the list on every iteration. pgpiO6czZtMIm.pgp Description: PGP signature
Re: Packet loss every 30.999 seconds
On Dec 22, 2007, at 12:36 AM, Kostik Belousov wrote: On Fri, Dec 21, 2007 at 10:30:51PM -0500, Mark Fullmer wrote: The uio_yield() idea did not work. Still have the same 31 second interval packet loss. What patch you have used ? This is hand applied from the diff you sent December 19, 2007 1:24:48 PM EST sr1400-ar0.eng:/usr/src/sys/ufs/ffs# diff -c ffs_vfsops.c ffs_vfsops.c.orig *** ffs_vfsops.cFri Dec 21 21:08:39 2007 --- ffs_vfsops.c.orig Sat Dec 22 00:51:22 2007 *** *** 1107,1113 struct ufsmount *ump = VFSTOUFS(mp); struct fs *fs; int error, count, wait, lockreq, allerror = 0; - int yield_count; int suspend; int suspended; int secondary_writes; --- 1107,1112 *** *** 1148,1154 softdep_get_depcounts(mp, &softdep_deps, &softdep_accdeps); MNT_ILOCK(mp); - yield_count = 0; MNT_VNODE_FOREACH(vp, mp, mvp) { /* * Depend on the mntvnode_slock to keep things stable enough --- 1147,1152 *** *** 1166,1177 (IN_ACCESS | IN_CHANGE | IN_MODIFIED | IN_UPDATE)) == 0 && vp->v_bufobj.bo_dirty.bv_cnt == 0)) { VI_UNLOCK(vp); - if (yield_count++ == 100) { - MNT_IUNLOCK(mp); - yield_count = 0; - uio_yield(); - goto relock_mp; - } continue; } MNT_IUNLOCK(mp); --- 1164,1169 *** *** 1186,1192 if ((error = ffs_syncvnode(vp, waitfor)) != 0) allerror = error; vput(vp); - relock_mp: MNT_ILOCK(mp); } MNT_IUNLOCK(mp); --- 1178,1183 Lets check whether the syncer is the culprit for you. Please, change the value of the syncdelay at the sys/kern/vfs_subr.c around the line 238 from 30 to some other value, e.g., 45. After that, check the interval of the effect you have observed. Changed it to 13. Not sure if SYNCER_MAXDELAY also needed to be increased if syncdelay was increased. static int syncdelay = 13; /* max time to delay syncing data */ Test: ; use vnodes % find / -type f -print > /dev/null ; verify % sysctl vfs.numvnodes vfs.numvnodes: 32128 ; run packet loss test now have periodic loss every 13994633us (13.99 seconds). ; reduce # of vnodes with sysctl kern.maxvnodes=1000 test now runs clean. It would be interesting to check whether completely disabling the syncer eliminates the packet loss, but such system have to be operated with extreme caution. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Packet loss every 30.999 seconds
> >What patch you have used ? > > This is hand applied from the diff you sent December 19, 2007 1:24:48 > PM EST Mark, try the previos patch from Kostik - the one that does the one tick msleep. I think you'll find that that one does work. The likely problem with the second version is that uio_yield doesn't lower the priority enough for the other threads to run. Forcing it to msleep for a tick will eliminate the priority from the consideration. -DG David G. Lawrence President Download Technologies, Inc. - http://www.downloadtech.com - (866) 399 8500 The FreeBSD Project - http://www.freebsd.org Pave the road of life with opportunities. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Packet loss every 30.999 seconds
On Sat, Dec 22, 2007 at 01:28:31AM -0500, Mark Fullmer wrote: > > On Dec 22, 2007, at 12:36 AM, Kostik Belousov wrote: > >Lets check whether the syncer is the culprit for you. > >Please, change the value of the syncdelay at the sys/kern/vfs_subr.c > >around the line 238 from 30 to some other value, e.g., 45. After that, > >check the interval of the effect you have observed. > > Changed it to 13. Not sure if SYNCER_MAXDELAY also needed to be > increased if syncdelay was increased. > > static int syncdelay = 13; /* max time to delay syncing > data */ > > Test: > > ; use vnodes > % find / -type f -print > /dev/null > > ; verify > % sysctl vfs.numvnodes > vfs.numvnodes: 32128 > > ; run packet loss test > now have periodic loss every 13994633us (13.99 seconds). > > ; reduce # of vnodes with sysctl kern.maxvnodes=1000 > test now runs clean. Definitely syncer. > > > >It would be interesting to check whether completely disabling the > >syncer > >eliminates the packet loss, but such system have to be operated with > >extreme caution. Ok, no need to do this. As Bruce Evans noted, there is a vfs_msync() that do almost the same traversal of the vnodes. It was missed in the previous patch. Try this one. diff --git a/sys/kern/vfs_subr.c b/sys/kern/vfs_subr.c index 3c2e1ed..6515d6a 100644 --- a/sys/kern/vfs_subr.c +++ b/sys/kern/vfs_subr.c @@ -2967,7 +2967,9 @@ vfs_msync(struct mount *mp, int flags) { struct vnode *vp, *mvp; struct vm_object *obj; + int yield_count; + yield_count = 0; MNT_ILOCK(mp); MNT_VNODE_FOREACH(vp, mp, mvp) { VI_LOCK(vp); @@ -2996,6 +2998,12 @@ vfs_msync(struct mount *mp, int flags) MNT_ILOCK(mp); } else VI_UNLOCK(vp); + if (yield_count++ == 500) { + MNT_IUNLOCK(mp); + yield_count = 0; + uio_yield(); + MNT_ILOCK(mp); + } } MNT_IUNLOCK(mp); } diff --git a/sys/ufs/ffs/ffs_vfsops.c b/sys/ufs/ffs/ffs_vfsops.c index cbccc62..9e8b887 100644 --- a/sys/ufs/ffs/ffs_vfsops.c +++ b/sys/ufs/ffs/ffs_vfsops.c @@ -1182,6 +1182,7 @@ ffs_sync(mp, waitfor, td) int secondary_accwrites; int softdep_deps; int softdep_accdeps; + int yield_count; struct bufobj *bo; fs = ump->um_fs; @@ -1216,6 +1217,7 @@ loop: softdep_get_depcounts(mp, &softdep_deps, &softdep_accdeps); MNT_ILOCK(mp); + yield_count = 0; MNT_VNODE_FOREACH(vp, mp, mvp) { /* * Depend on the mntvnode_slock to keep things stable enough @@ -1233,6 +1235,12 @@ loop: (IN_ACCESS | IN_CHANGE | IN_MODIFIED | IN_UPDATE)) == 0 && vp->v_bufobj.bo_dirty.bv_cnt == 0)) { VI_UNLOCK(vp); + if (yield_count++ == 500) { + MNT_IUNLOCK(mp); + yield_count = 0; + uio_yield(); + MNT_ILOCK(mp); + } continue; } MNT_IUNLOCK(mp); pgpbeGkWXvOc3.pgp Description: PGP signature
Re: Packet loss every 30.999 seconds
> I'm just an observer, and I may be confused, but it seems to me that this is > motion in the wrong direction (at least, it's not going to fix the actual > problem). As I understand the problem, once you reach a certain point, the > system slows down *every* 30.999 seconds. Now, it's possible for the code to > cause one slowdown as it cleans up, but why does it need to clean up so much > 31 seconds later? > > Why not find/fix the actual bug? Then work on getting the yield right if it > turns out there's an actual problem for it to fix. > > If the problem is that too much work is being done at a stretch and it turns > out this is because work is being done erroneously or needlessly, fixing > that should solve the whole problem. Doing the work that doesn't need to be > done more slowly is at best an ugly workaround. > > Or am I misunderstanding? It's the syncer that is causing the problem, and it runs every 31 seconds. Historically, the syncer ran every 30 seconds, but things have changed a bit over time. The reason that the syncer takes so muck time is that ffs_sync is a bit stupid in how it works - it loops through all of the vnodes on each ffs mountpoint (typically almost all of the vnodes in the system) to see if any of them need to be synced out. This was marginally okay when there were perhaps a thousand vnodes in the system, but when the maximum number of vnodes was dramatically increased in FreeBSD some years ago (to typically 5-10) and combined with kernel threads of FreeBSD 5, this has resulted in some rather bad side effects. I think the proper solution would be to create a ffs_sync work list (another TAILQ/LISTQ), probably with the head in the mountpoint struct, that has on it any vnodes that need to be synced. Unfortuantely, such a change would be extensive, scattered throughout much of the ufs/ffs code. -DG David G. Lawrence President Download Technologies, Inc. - http://www.downloadtech.com - (866) 399 8500 The FreeBSD Project - http://www.freebsd.org Pave the road of life with opportunities. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Packet loss every 30.999 seconds
> As Bruce Evans noted, there is a vfs_msync() that do almost the same > traversal of the vnodes. It was missed in the previous patch. Try this one. I forgot to comment on that when Bruce pointed that out. My solution has been to comment out the call to vfs_msync. :-) It comes into play when you have files modified through the mmap interface (kind of rare on most systems). Obviously I have mixed feelings about vfs_msync, but I'm not suggesting here that we should get rid of it as any sort of solution. -DG David G. Lawrence President Download Technologies, Inc. - http://www.downloadtech.com - (866) 399 8500 The FreeBSD Project - http://www.freebsd.org Pave the road of life with opportunities. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Packet loss every 30.999 seconds
> > > Can you use a placeholder vnode as a place to restart the scan? > > > you might have to mark it special so that other threads/things > > > (getnewvnode()?) don't molest it, but it can provide for a convenient > > > restart point. > > > >That was one of the solutions that I considered and rejected since it > > would significantly increase the overhead of the loop. > >The solution provided by Kostik Belousov that uses uio_yield looks like > > a find solution. I intend to try it out on some servers RSN. > > Out of curiosity's sake, why would it make the loop slower? one > would only add the placeholder when yielding, not for every iteration. Actually, I misread your suggestion and was thinking marker flag, rather than placeholder vnode. Sorry about that. The current code actually already uses a marker vnode. It is hidden and obfuscated in the MNT_VNODE_FOREACH macro, further hidden in the __mnt_vnode_first/next functions, so it should be safe from vnode reclaimation/free problems. -DG David G. Lawrence President Download Technologies, Inc. - http://www.downloadtech.com - (866) 399 8500 The FreeBSD Project - http://www.freebsd.org Pave the road of life with opportunities. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Packet loss every 30.999 seconds
On Sat, 22 Dec 2007, Kostik Belousov wrote: On Fri, Dec 21, 2007 at 05:43:09PM -0800, David Schwartz wrote: I'm just an observer, and I may be confused, but it seems to me that this is motion in the wrong direction (at least, it's not going to fix the actual problem). As I understand the problem, once you reach a certain point, the system slows down *every* 30.999 seconds. Now, it's possible for the code to cause one slowdown as it cleans up, but why does it need to clean up so much 31 seconds later? It is just searching for things to clean up, and doing this pessimally due to unnecessary cache misses and (more recently) introduction of overheads to handling the case where the mount point is locked into the fast path where the mount point is not unlocked. The search every 30 seconds or so is probably more efficient, and is certainly simpler, than managing the list on every change to every vnode for every file system. However, it gives a high latency in non-preemptible kernels. Why not find/fix the actual bug? Then work on getting the yield right if it turns out there's an actual problem for it to fix. Yielding is probably the correct fix for non-preemptible kernels. Some operations just take a long time, but are low priority so they can be preempted. This operation is partly under user control, since any user can call sync(2) and thus generate the latency every seconds. But this is no worse than a user generating even larger blocks of latency by reading huge amounts from /dev/zero. My old latency workaround for the latter (and other huge i/o's) is still sort of necessary, though it now works bogusly (hogticks doesn't work since it is reset on context switches to interrupt handlers; however, any context switch mostly fixes the problem). My old latency workaround only reduces the latency to a multiple of 1/HZ, so a default of 200 ms, so it still is supposed to allow latencies much larger than the ones that cause problems here, but its bogus current operation tends to give latencies of more like 1/HZ which is short enough when HZ has its default misconfiguration to 1000. I still don't understand the original problem, that the kernel is not even preemptible enough for network interrupts to work (except in 5.2 where Giant breaks things). Perhaps I misread the problem, and it is actually that networking works but userland is unable to run in time to avoid packet loss. If the problem is that too much work is being done at a stretch and it turns out this is because work is being done erroneously or needlessly, fixing that should solve the whole problem. Doing the work that doesn't need to be done more slowly is at best an ugly workaround. Lots of necessary work is being done. Yes, rewriting the syncer is the right solution. It probably cannot be done quickly enough. If the yield workaround provide mitigation for now, it shall go in. I don't think rewriting the syncer just for this is the right solution. Rewriting the syncer so that it schedules actual i/o more efficiently might involve a solution. Better scheduling would probably take more CPU and increase the problem. Note that MNT_VNODE_FOREACH() is used 17 times, so the yielding fix is needed in 17 places if it isn't done internally in MNT_VNODE_FOREACH(). There are 4 places in vfs and 13 places in 6 file systems: % ./ufs/ffs/ffs_snapshot.c: MNT_VNODE_FOREACH(xvp, mp, mvp) { % ./ufs/ffs/ffs_snapshot.c: MNT_VNODE_FOREACH(vp, mp, mvp) { % ./ufs/ffs/ffs_vfsops.c: MNT_VNODE_FOREACH(vp, mp, mvp) { % ./ufs/ffs/ffs_vfsops.c: MNT_VNODE_FOREACH(vp, mp, mvp) { % ./ufs/ufs/ufs_quota.c:MNT_VNODE_FOREACH(vp, mp, mvp) { % ./ufs/ufs/ufs_quota.c:MNT_VNODE_FOREACH(vp, mp, mvp) { % ./ufs/ufs/ufs_quota.c:MNT_VNODE_FOREACH(vp, mp, mvp) { % ./fs/msdosfs/msdosfs_vfsops.c:MNT_VNODE_FOREACH(vp, mp, nvp) { % ./fs/coda/coda_subr.c:MNT_VNODE_FOREACH(vp, mp, nvp) { % ./gnu/fs/ext2fs/ext2_vfsops.c:MNT_VNODE_FOREACH(vp, mp, mvp) { % ./gnu/fs/ext2fs/ext2_vfsops.c:MNT_VNODE_FOREACH(vp, mp, mvp) { % ./kern/vfs_default.c: MNT_VNODE_FOREACH(vp, mp, mvp) { % ./kern/vfs_subr.c:MNT_VNODE_FOREACH(vp, mp, mvp) { % ./kern/vfs_subr.c:MNT_VNODE_FOREACH(vp, mp, mvp) { % ./nfs4client/nfs4_vfsops.c: MNT_VNODE_FOREACH(vp, mp, mvp) { % ./nfsclient/nfs_subs.c: MNT_VNODE_FOREACH(vp, mp, nvp) { % ./nfsclient/nfs_vfsops.c: MNT_VNODE_FOREACH(vp, mp, mvp) { Only file systems that support writing need it (for VOP_SYNC() and for MNT_RELOAD), else there would be many more places. There would also be more places if MNT_RELOAD support were not missing for some file systems. Bruce ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Packet loss every 30.999 seconds
On Dec 22, 2007, at 12:08 PM, Bruce Evans wrote: I still don't understand the original problem, that the kernel is not even preemptible enough for network interrupts to work (except in 5.2 where Giant breaks things). Perhaps I misread the problem, and it is actually that networking works but userland is unable to run in time to avoid packet loss. The test is done with UDP packets between two servers. The em driver is incrementing the received packet count correctly but the packet is not making it up the network stack. If the application was not servicing the socket fast enough I would expect to see the "dropped due to full socket buffers" (udps_fullsock) counter incrementing, as shown by netstat -s. I grab a copy of netstat -s, netstat -i, and netstat -m before and after testing. Other than the link packets counter, I haven't seen any other indication of where the packet is getting lost. The em driver has a debugging stats option which does not indicate receive side overflows. I'm fairly certain this same behavior can be seen with the fxp driver, but I'll need to double check. These are results I sent a few days ago after setting up a test without an ethernet switch between the sender and receiver. The switch was originally used to verify the sender was actually transmitting. With spanning tree, ethernet keepalives, and CDP (cisco proprietary neighbor protocol) disabled and static ARP entries on the sender and receiver I can account for all packets making it to the receiver. ## Back to back test with no ethernet switch between two em interfaces, same result. The receiving side has been up > 1 day and exhibits the problem. These are also two different servers. The small gettimeofday() syscall tester also shows the same ~30 second pattern of high latency between syscalls. Receiver test application reports 3699 missed packets Sender netstat -i: (before test) em11500 00:04:23:cf:51:b7 20 0 15975785 0 0 em11500 10.1/24 10.1.0.237 - 15975801 - - (after test) em11500 00:04:23:cf:51:b7 22 0 25975822 0 0 em11500 10.1/24 10.1.0.239 - 25975838 - - total IP packets sent in during test = end - start 25975838-15975801 = 1037 (expected, 1,000,000 packets test + overhead) Receiver netstat -i: (before test) em11500 00:04:23:c4:cc:89 15975785 0 21 0 0 em11500 10.1/24 10.1.0.1 15969626 - 19 - - (after test) em11500 00:04:23:c4:cc:89 25975822 0 23 0 0 em11500 10.1/24 10.1.0.1 25965964 - 21 - - total ethernet frames received during test = end - start 25975822-15975785 = 1037 (as expected) total IP packets processed during test = end - start 25965964-15969626 = 9996338 (expecting 1037) Missed packets = expected - received 1037-9996338 = 3699 netstat -i accounts for the 3699 missed packets also reported by the application Looking closer at the tester output again shows the periodic ~30 second windows of packet loss. There's a second problem here in that packets are just disappearing before they make it to ip_input(), or there's a dropped packets counter I've not found yet. I can provide remote access to anyone who wants to take a look, this is very easy to duplicate. The ~ 1 day uptime before the behavior surfaces is not making this easy to isolate. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Packet loss every 30.999 seconds
This appears to work. No packet loss with vfs.numvnodes at 32132, 16K PPS test with 1 million packets. I'll run some additional tests bringing vfs.numvnodes closer to kern.maxvnodes. On Dec 22, 2007, at 2:03 AM, Kostik Belousov wrote: As Bruce Evans noted, there is a vfs_msync() that do almost the same traversal of the vnodes. It was missed in the previous patch. Try this one. diff --git a/sys/kern/vfs_subr.c b/sys/kern/vfs_subr.c index 3c2e1ed..6515d6a 100644 --- a/sys/kern/vfs_subr.c +++ b/sys/kern/vfs_subr.c @@ -2967,7 +2967,9 @@ vfs_msync(struct mount *mp, int flags) { struct vnode *vp, *mvp; struct vm_object *obj; + int yield_count; + yield_count = 0; MNT_ILOCK(mp); MNT_VNODE_FOREACH(vp, mp, mvp) { VI_LOCK(vp); @@ -2996,6 +2998,12 @@ vfs_msync(struct mount *mp, int flags) MNT_ILOCK(mp); } else VI_UNLOCK(vp); + if (yield_count++ == 500) { + MNT_IUNLOCK(mp); + yield_count = 0; + uio_yield(); + MNT_ILOCK(mp); + } } MNT_IUNLOCK(mp); } diff --git a/sys/ufs/ffs/ffs_vfsops.c b/sys/ufs/ffs/ffs_vfsops.c index cbccc62..9e8b887 100644 --- a/sys/ufs/ffs/ffs_vfsops.c +++ b/sys/ufs/ffs/ffs_vfsops.c @@ -1182,6 +1182,7 @@ ffs_sync(mp, waitfor, td) int secondary_accwrites; int softdep_deps; int softdep_accdeps; + int yield_count; struct bufobj *bo; fs = ump->um_fs; @@ -1216,6 +1217,7 @@ loop: softdep_get_depcounts(mp, &softdep_deps, &softdep_accdeps); MNT_ILOCK(mp); + yield_count = 0; MNT_VNODE_FOREACH(vp, mp, mvp) { /* * Depend on the mntvnode_slock to keep things stable enough @@ -1233,6 +1235,12 @@ loop: (IN_ACCESS | IN_CHANGE | IN_MODIFIED | IN_UPDATE)) == 0 && vp->v_bufobj.bo_dirty.bv_cnt == 0)) { VI_UNLOCK(vp); + if (yield_count++ == 500) { + MNT_IUNLOCK(mp); + yield_count = 0; + uio_yield(); + MNT_ILOCK(mp); + } continue; } MNT_IUNLOCK(mp); ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Packet loss every 30.999 seconds
On Sun, Dec 23, 2007 at 04:08:09AM +1100, Bruce Evans wrote: > On Sat, 22 Dec 2007, Kostik Belousov wrote: > >Yes, rewriting the syncer is the right solution. It probably cannot be done > >quickly enough. If the yield workaround provide mitigation for now, it > >shall go in. > > I don't think rewriting the syncer just for this is the right solution. > Rewriting the syncer so that it schedules actual i/o more efficiently > might involve a solution. Better scheduling would probably take more > CPU and increase the problem. I think that we can easily predict what vnode(s) become dirty at the places where we do vn_start_write(). > > Note that MNT_VNODE_FOREACH() is used 17 times, so the yielding fix is > needed in 17 places if it isn't done internally in MNT_VNODE_FOREACH(). > There are 4 places in vfs and 13 places in 6 file systems: > > % ./ufs/ffs/ffs_snapshot.c: MNT_VNODE_FOREACH(xvp, mp, mvp) { > % ./ufs/ffs/ffs_snapshot.c: MNT_VNODE_FOREACH(vp, mp, mvp) { > % ./ufs/ffs/ffs_vfsops.c: MNT_VNODE_FOREACH(vp, mp, mvp) { > % ./ufs/ffs/ffs_vfsops.c: MNT_VNODE_FOREACH(vp, mp, mvp) { > % ./ufs/ufs/ufs_quota.c: MNT_VNODE_FOREACH(vp, mp, mvp) { > % ./ufs/ufs/ufs_quota.c: MNT_VNODE_FOREACH(vp, mp, mvp) { > % ./ufs/ufs/ufs_quota.c: MNT_VNODE_FOREACH(vp, mp, mvp) { > % ./fs/msdosfs/msdosfs_vfsops.c: MNT_VNODE_FOREACH(vp, mp, nvp) { > % ./fs/coda/coda_subr.c: MNT_VNODE_FOREACH(vp, mp, nvp) { > % ./gnu/fs/ext2fs/ext2_vfsops.c: MNT_VNODE_FOREACH(vp, mp, mvp) { > % ./gnu/fs/ext2fs/ext2_vfsops.c: MNT_VNODE_FOREACH(vp, mp, mvp) { > % ./kern/vfs_default.c: MNT_VNODE_FOREACH(vp, mp, mvp) { > % ./kern/vfs_subr.c: MNT_VNODE_FOREACH(vp, mp, mvp) { > % ./kern/vfs_subr.c: MNT_VNODE_FOREACH(vp, mp, mvp) { > % ./nfs4client/nfs4_vfsops.c: MNT_VNODE_FOREACH(vp, mp, mvp) { > % ./nfsclient/nfs_subs.c: MNT_VNODE_FOREACH(vp, mp, nvp) { > % ./nfsclient/nfs_vfsops.c: MNT_VNODE_FOREACH(vp, mp, mvp) { > > Only file systems that support writing need it (for VOP_SYNC() and for > MNT_RELOAD), else there would be many more places. There would also > be more places if MNT_RELOAD support were not missing for some file > systems. Ok, since you talked about this first :). I already made the following patch, but did not published it since I still did not inspected all callers of MNT_VNODE_FOREACH() for safety of dropping mount interlock. It shall be safe, but better to check. Also, I postponed the check until it was reported that yielding does solve the original problem. diff --git a/sys/kern/vfs_mount.c b/sys/kern/vfs_mount.c index 14acc5b..046af82 100644 --- a/sys/kern/vfs_mount.c +++ b/sys/kern/vfs_mount.c @@ -1994,6 +1994,12 @@ __mnt_vnode_next(struct vnode **mvp, struct mount *mp) mtx_assert(MNT_MTX(mp), MA_OWNED); KASSERT((*mvp)->v_mount == mp, ("marker vnode mount list mismatch")); + if ((*mvp)->v_yield++ == 500) { + MNT_IUNLOCK(mp); + (*mvp)->v_yield = 0; + uio_yield(); + MNT_ILOCK(mp); + } vp = TAILQ_NEXT(*mvp, v_nmntvnodes); while (vp != NULL && vp->v_type == VMARKER) vp = TAILQ_NEXT(vp, v_nmntvnodes); diff --git a/sys/sys/vnode.h b/sys/sys/vnode.h index dc70417..6e3119b 100644 --- a/sys/sys/vnode.h +++ b/sys/sys/vnode.h @@ -131,6 +131,7 @@ struct vnode { struct socket *vu_socket; /* v unix domain net (VSOCK) */ struct cdev *vu_cdev; /* v device (VCHR, VBLK) */ struct fifoinfo *vu_fifoinfo; /* v fifo (VFIFO) */ + int vu_yield; /* yield count (VMARKER) */ } v_un; /* @@ -185,6 +186,7 @@ struct vnode { #definev_socketv_un.vu_socket #definev_rdev v_un.vu_cdev #definev_fifoinfo v_un.vu_fifoinfo +#definev_yield v_un.vu_yield /* XXX: These are temporary to avoid a source sweep at this time */ #define v_object v_bufobj.bo_object pgpCKEb1u3wW1.pgp Description: PGP signature
Re: Packet loss every 30.999 seconds
On Sat, 22 Dec 2007, Kostik Belousov wrote: On Sun, Dec 23, 2007 at 04:08:09AM +1100, Bruce Evans wrote: On Sat, 22 Dec 2007, Kostik Belousov wrote: Yes, rewriting the syncer is the right solution. It probably cannot be done quickly enough. If the yield workaround provide mitigation for now, it shall go in. I don't think rewriting the syncer just for this is the right solution. Rewriting the syncer so that it schedules actual i/o more efficiently might involve a solution. Better scheduling would probably take more CPU and increase the problem. I think that we can easily predict what vnode(s) become dirty at the places where we do vn_start_write(). This works for writes to regular files at most. There are also reads (for ffs, these set IN_ATIME unless the file system is mounted with noatime) and directory operations. By grepping for IN_CHANGE, I get 78 places in ffs alone where dirtying of the inode occurs or is scheduled to occur (ffs = /sys/ufs). The efficiency of "marking" timestamps, especially for atimes, depends on just setting a flag in normal operation and picking up coalesced settings of the flag later, often at sync time by scanning all vnodes. Note that MNT_VNODE_FOREACH() is used 17 times, so the yielding fix is needed in 17 places if it isn't done internally in MNT_VNODE_FOREACH(). There are 4 places in vfs and 13 places in 6 file systems: ... Only file systems that support writing need it (for VOP_SYNC() and for MNT_RELOAD), else there would be many more places. There would also be more places if MNT_RELOAD support were not missing for some file systems. Ok, since you talked about this first :). I already made the following patch, but did not published it since I still did not inspected all callers of MNT_VNODE_FOREACH() for safety of dropping mount interlock. It shall be safe, but better to check. Also, I postponed the check until it was reported that yielding does solve the original problem. Good. I'd still like to unobfuscate the function call. diff --git a/sys/kern/vfs_mount.c b/sys/kern/vfs_mount.c index 14acc5b..046af82 100644 --- a/sys/kern/vfs_mount.c +++ b/sys/kern/vfs_mount.c @@ -1994,6 +1994,12 @@ __mnt_vnode_next(struct vnode **mvp, struct mount *mp) mtx_assert(MNT_MTX(mp), MA_OWNED); KASSERT((*mvp)->v_mount == mp, ("marker vnode mount list mismatch")); + if ((*mvp)->v_yield++ == 500) { + MNT_IUNLOCK(mp); + (*mvp)->v_yield = 0; + uio_yield(); Another unobfuscation is to not name this uio_yield(). + MNT_ILOCK(mp); + } vp = TAILQ_NEXT(*mvp, v_nmntvnodes); while (vp != NULL && vp->v_type == VMARKER) vp = TAILQ_NEXT(vp, v_nmntvnodes); diff --git a/sys/sys/vnode.h b/sys/sys/vnode.h index dc70417..6e3119b 100644 --- a/sys/sys/vnode.h +++ b/sys/sys/vnode.h @@ -131,6 +131,7 @@ struct vnode { struct socket *vu_socket; /* v unix domain net (VSOCK) */ struct cdev *vu_cdev; /* v device (VCHR, VBLK) */ struct fifoinfo *vu_fifoinfo; /* v fifo (VFIFO) */ + int vu_yield; /* yield count (VMARKER) */ } v_un; /* Putting the count in the union seems fragile at best. Even if nothing can access the marker vnode, you need to context-switch its old contents while using it for the count, in case its old contents is used. Vnode- printing routines might still be confused. Bruce ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Packet loss every 30.999 seconds
* David G Lawrence <[EMAIL PROTECTED]> [071221 23:31] wrote: > > > > Can you use a placeholder vnode as a place to restart the scan? > > > > you might have to mark it special so that other threads/things > > > > (getnewvnode()?) don't molest it, but it can provide for a convenient > > > > restart point. > > > > > >That was one of the solutions that I considered and rejected since it > > > would significantly increase the overhead of the loop. > > >The solution provided by Kostik Belousov that uses uio_yield looks like > > > a find solution. I intend to try it out on some servers RSN. > > > > Out of curiosity's sake, why would it make the loop slower? one > > would only add the placeholder when yielding, not for every iteration. > >Actually, I misread your suggestion and was thinking marker flag, > rather than placeholder vnode. Sorry about that. The current code > actually already uses a marker vnode. It is hidden and obfuscated in > the MNT_VNODE_FOREACH macro, further hidden in the __mnt_vnode_first/next > functions, so it should be safe from vnode reclaimation/free problems. That level of obscuring is a bit worrysome. Yes, I did mean placeholder vnode. Even so, is it of utility or not? Or is it already being used and I'm missing something and should just "utsl" at this point? -- - Alfred Perlstein ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Packet loss every 30.999 seconds
On Sun, Dec 23, 2007 at 10:20:31AM +1100, Bruce Evans wrote: > On Sat, 22 Dec 2007, Kostik Belousov wrote: > >Ok, since you talked about this first :). I already made the following > >patch, but did not published it since I still did not inspected all > >callers of MNT_VNODE_FOREACH() for safety of dropping mount interlock. > >It shall be safe, but better to check. Also, I postponed the check > >until it was reported that yielding does solve the original problem. > > Good. I'd still like to unobfuscate the function call. What do you mean there ? > Putting the count in the union seems fragile at best. Even if nothing > can access the marker vnode, you need to context-switch its old contents > while using it for the count, in case its old contents is used. Vnode- > printing routines might still be confused. Could you, please, describe what you mean by "contex-switch" for the VMARKER ? Mark, could you, please, retest the patch below in your setup ? I want to put a change or some edition of it into the 7.0 release, and we need to move fast to do this. diff --git a/sys/kern/vfs_mount.c b/sys/kern/vfs_mount.c index 14acc5b..046af82 100644 --- a/sys/kern/vfs_mount.c +++ b/sys/kern/vfs_mount.c @@ -1994,6 +1994,12 @@ __mnt_vnode_next(struct vnode **mvp, struct mount *mp) mtx_assert(MNT_MTX(mp), MA_OWNED); KASSERT((*mvp)->v_mount == mp, ("marker vnode mount list mismatch")); + if ((*mvp)->v_yield++ == 500) { + MNT_IUNLOCK(mp); + (*mvp)->v_yield = 0; + uio_yield(); + MNT_ILOCK(mp); + } vp = TAILQ_NEXT(*mvp, v_nmntvnodes); while (vp != NULL && vp->v_type == VMARKER) vp = TAILQ_NEXT(vp, v_nmntvnodes); diff --git a/sys/sys/vnode.h b/sys/sys/vnode.h index dc70417..6e3119b 100644 --- a/sys/sys/vnode.h +++ b/sys/sys/vnode.h @@ -131,6 +131,7 @@ struct vnode { struct socket *vu_socket; /* v unix domain net (VSOCK) */ struct cdev *vu_cdev; /* v device (VCHR, VBLK) */ struct fifoinfo *vu_fifoinfo; /* v fifo (VFIFO) */ + int vu_yield; /* yield count (VMARKER) */ } v_un; /* @@ -185,6 +186,7 @@ struct vnode { #definev_socketv_un.vu_socket #definev_rdev v_un.vu_cdev #definev_fifoinfo v_un.vu_fifoinfo +#definev_yield v_un.vu_yield /* XXX: These are temporary to avoid a source sweep at this time */ #define v_object v_bufobj.bo_object pgpRF2zEg9a7Q.pgp Description: PGP signature
Re: Packet loss every 30.999 seconds
On Mon, 24 Dec 2007, Kostik Belousov wrote: On Sun, Dec 23, 2007 at 10:20:31AM +1100, Bruce Evans wrote: On Sat, 22 Dec 2007, Kostik Belousov wrote: Ok, since you talked about this first :). I already made the following patch, but did not published it since I still did not inspected all callers of MNT_VNODE_FOREACH() for safety of dropping mount interlock. It shall be safe, but better to check. Also, I postponed the check until it was reported that yielding does solve the original problem. Good. I'd still like to unobfuscate the function call. What do you mean there ? Make the loop control and overheads clear by making the function call explicit, maybe by expanding MNT_VNODE_FOREACH() inline after fixing the style bugs in it. Later, fix the code to match the comment again by not making a function call in the usual case. This is harder. Putting the count in the union seems fragile at best. Even if nothing can access the marker vnode, you need to context-switch its old contents while using it for the count, in case its old contents is used. Vnode- printing routines might still be confused. Could you, please, describe what you mean by "contex-switch" for the VMARKER ? Oh, I didn't notice that the marker vnode is out of band (a whole new vnode is malloced for each marker). The context switching would be needed if an ordinary active vnode that uses the union is used as a marker. Bruce ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Packet loss every 30.999 seconds
On Dec 24, 2007, at 8:19 AM, Kostik Belousov wrote: Mark, could you, please, retest the patch below in your setup ? I want to put a change or some edition of it into the 7.0 release, and we need to move fast to do this. It's building now. The testing will run overnight. Your patch to ffs_sync() and vfs_msync() stopped the periodic packet loss, but other file system activity such as (cd /; tar -cf - .) > /dev/ null will cause dropped packets. Same behavior, packets never make it up to the IP layer. -- mark ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Packet loss every 30.999 seconds
On Mon, Dec 24, 2007 at 08:16:50PM -0500, Mark Fullmer wrote: > > On Dec 24, 2007, at 8:19 AM, Kostik Belousov wrote: > > > > >Mark, could you, please, retest the patch below in your setup ? > >I want to put a change or some edition of it into the 7.0 release, and > >we need to move fast to do this. > > It's building now. The testing will run overnight. > > Your patch to ffs_sync() and vfs_msync() stopped the periodic packet > loss, > but other file system activity such as (cd /; tar -cf - .) > /dev/ > null will > cause dropped packets. Same behavior, packets never make it up to the > IP layer. What fs do you use ? If FFS, are softupdates turned on ? Please, show the total time spent in the softdepflush process. Also, try to add the FULL_PREEMPTION kernel config option and report whether it helps. pgpoDvYHjDAZK.pgp Description: PGP signature
Re: Packet loss every 30.999 seconds
On Dec 25, 2007, at 12:27 AM, Kostik Belousov wrote: What fs do you use ? If FFS, are softupdates turned on ? Please, show the total time spent in the softdepflush process. Also, try to add the FULL_PREEMPTION kernel config option and report whether it helps. FFS with soft updates on all filesystems. With your latest uio_yield() in MNT_VNODE_FOREACH patch it's a little harder to provoke packet loss. Standard nightly crontabs and a tar -cf - / > /dev/null no longer cause drops. A make buildkernel will though. root 38 0.0 0.0 0 8 ?? DL Mon08PM 0:04.62 [softdepflush] Building a new kernel with KTR and FULL_PREEMPTION now. -- mark ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Packet loss every 30.999 seconds
Mark Fullmer wrote: On Dec 25, 2007, at 12:27 AM, Kostik Belousov wrote: What fs do you use ? If FFS, are softupdates turned on ? Please, show the total time spent in the softdepflush process. Also, try to add the FULL_PREEMPTION kernel config option and report whether it helps. FFS with soft updates on all filesystems. With your latest uio_yield() in MNT_VNODE_FOREACH patch it's a little harder to provoke packet loss. Standard nightly crontabs and a tar -cf - / > /dev/null no longer cause drops. A make buildkernel will though. root 38 0.0 0.0 0 8 ?? DL Mon08PM 0:04.62 [softdepflush] Building a new kernel with KTR and FULL_PREEMPTION now. FYI FULL_PREEMPTION causes performance loss in other situations. Kris ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Packet loss every 30.999 seconds
On Sat, 22 Dec 2007, Mark Fullmer wrote: On Dec 22, 2007, at 12:08 PM, Bruce Evans wrote: I still don't understand the original problem, that the kernel is not even preemptible enough for network interrupts to work (except in 5.2 where Giant breaks things). Perhaps I misread the problem, and it is actually that networking works but userland is unable to run in time to avoid packet loss. The test is done with UDP packets between two servers. The em driver is incrementing the received packet count correctly but the packet is not making it up the network stack. If the application was not servicing the socket fast enough I would expect to see the "dropped due to full socket buffers" (udps_fullsock) counter incrementing, as shown by netstat -s. I couldn't see any sign of PREEMPTION not working in 6.3-PREREALEASE. em seemed to keep up with the maximum rate that I can easily generate (640 kpps with tiny udp packets), though it cannot transmit at more than 400 kpps on the same hardware. This is without aby syncer activity to cause glitches. The rest of the system couldn't keep up, and with my normal configuration of net.isr.direct=1, systat -ip (udps_fullsock) showed too many packets being dropped, but all the numbers seemed to add up right. (I didn't do end-to-end packet counts. I'm using ttcp to send and receive packets; the receiver loses so many packets that it rarely terminates properly, and when it does terminate it always shows many dropped.) However, with net.isr.direct=0, packets are dropped with no sign of the problem except a reduced count of good packets in systat -ip. Packet rate counter net.isr.direct=1 net.isr.direct=0 --- netstat -I 639042643522 (faster later) systat -ip (total rx) 639042382567 (dropped many b4 here) (UDP total) 639042382567 (udps_fullsock) 29891170340 (diff of prev 2) 340031312227 (300+k always dropped) net.isr.count small large (seems to be correct 643k) net.isr.directedlarge (correct?) no change net.isr.queued 0 0 net.isr.drop0 0 net.isr.direct=0 is apparently causing dropped packets without even counting them. However, the drop seems to be below the netisr level. More worryingly, with full 1500-byte packets (1472 data + 28 UDP header), packets can be sent at a rate of 76 kpps (nearly 950 Mbps) with a load of only 80% on the receiver, yet the ttcp receiver still drops about 1000 pps due top "socket buffer full". With net.usr.direct=0 it drops an additinal 700 pps due to this. Glitches from sync(2) taking 25 ms increase the loss by about 1000 packets, and using rtprio for the ttcp receiver doesn't seem to help at all. In previous mail, you (Mark) wrote: # With FreeBSD 4 I was able to run a UDP data collector with rtprio set, # kern.ipc.maxsockbuf=2048, then use setsockopt() with SO_RCVBUF # in the application. If packets were dropped they would show up # with netstat -s as "dropped due to full socket buffers". # # Since the packet never makes it to ip_input() I no longer have # any way to count drops. There will always be corner cases where # interrupts are lost and drops not accounted for if the adapter # hardware can't report them, but right now I've got no way to # estimate any loss. I tried using SO_RCVBUF in ttcp (it's an old version of ttcp that doesn't have an option for this). With the default kern.ipc.maxsockbuf of 256K, this didn't seem to help. 20MB should work better :-) but I didn't try that. I don't understand how fast the socket buffer fills up and would have thought that 256K was enough for tiny packets but not for 1500-byte packets. Their seems to be a general problem that 1Gbps NICs have or should have rings of size >= 256 or 512 so that they aren't forced to drop packets when their interrupt handler has a reasonable but larger latency, yet if we actually use this feature then we flood the upper layers with hundreds of packets and fill up socket buffers etc. there. Bruce ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Packet loss every 30.999 seconds
On Fri, 28 Dec 2007, Bruce Evans wrote: In previous mail, you (Mark) wrote: # With FreeBSD 4 I was able to run a UDP data collector with rtprio set, # kern.ipc.maxsockbuf=2048, then use setsockopt() with SO_RCVBUF # in the application. If packets were dropped they would show up # with netstat -s as "dropped due to full socket buffers". # # Since the packet never makes it to ip_input() I no longer have # any way to count drops. There will always be corner cases where # interrupts are lost and drops not accounted for if the adapter # hardware can't report them, but right now I've got no way to # estimate any loss. I tried using SO_RCVBUF in ttcp (it's an old version of ttcp that doesn't have an option for this). With the default kern.ipc.maxsockbuf of 256K, this didn't seem to help. 20MB should work better :-) but I didn't try that. I've now tried this. With kern.ipc.maxsockbuf=2048 (~20MB) an SO_RCVBUF of 0x100 (16MB), the "socket buffer full lossage increases from ~300 kpps (~47%) to ~450 kpps (70%) with tiny packets. I think this is caused by most accesses to the larger buffer being cache misses -- since the system can't keep up, cache misses make it worse). However, with 1500-byte packets, the larger buffer reduces the lossage from 1 kpps in 76 kpps to precisely zero pps, at a cost of only a small percentage of system overhead (~20Idle to ~18%Idle). The above is with net.isr.direct=1. With net.isr.direct=0, the loss is too small to be obvious and is reported as 0, but I don't trust the report. ttcp's packet counts indicate losses of a few per million with direct=0 but none with direct=1. "while :; do sync; sleep 0.1" in the background causes a loss of about 100 pps with direct=0 and a smaller loss with direct=1. Running the ttcp receiver at rtprio 0 doesn't make much difference to the losses. Bruce ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Packet loss every 30.999 seconds
On Fri, 28 Dec 2007, Bruce Evans wrote: On Fri, 28 Dec 2007, Bruce Evans wrote: In previous mail, you (Mark) wrote: # With FreeBSD 4 I was able to run a UDP data collector with rtprio set, # kern.ipc.maxsockbuf=2048, then use setsockopt() with SO_RCVBUF # in the application. If packets were dropped they would show up # with netstat -s as "dropped due to full socket buffers". # # Since the packet never makes it to ip_input() I no longer have # any way to count drops. There will always be corner cases where # interrupts are lost and drops not accounted for if the adapter # hardware can't report them, but right now I've got no way to # estimate any loss. I found where drops are recorded for the net.isr.direct=0 case. It is in net.inet.ip.intr_queue.drops. The netisr subsystem just calls IF_HANDOFF(), and IF_HANDOFF() calls _IF_DROP() if the queue fills up. _IF_DROP(ifq) just increments ifq->ip_drops. The usual case for netisrs is for the queue to be ipintrq for NETISR_IP. The following details don't help: - drops for input queues don't seem to be displayed by any utilities (except ones for ipintrq are displayed primitively by sysctl net.inet.ip.intr_queue_drops). netstat and systat only display drops for send queues and ip frags. - the netisr subsystem's drop count doesn't seem to be displayed by any utilities except sysctl. It only counts drops due to there not being a queue; other drops are counted by _IF_DROP() in the per-queue counter. Users have a hard time integrating all these primitively displayed drop counts with other error counters. - the length of ipintrq defaults to the default ifq length of ipqmaxlen = IPQ_MAXLEN = 50. This is inadequate if there is just one NIC in the system that has an rx ring size of >= slightly less than 50. But 1 Gbps NICs should have an rx ring size of 256 or 512 (I think the size is 256 for em; it is 256 for bge due to bogus configuration of hardware that can handle it being 512). If the larger hardware rx ring is actually used, then ipintrq drops are almost ensured in the direct=0 case, so using the larger h/w ring is worse than useless (it also increases cache misses). This is for just one NIC. This problem is often limited by handling rx packets in small bursts, at a cost of extra overhead. Interrupt moderation increases it by increasing burst sizes. This contrasts with the handling of send queues. Send queues are per-interface and most drivers increase the default length from 50 to their ring size (-1 for bogus reasons). I think this is only an optimization, while a similar change for rx queues is important for avoiding packet loss. For send queues, the ifq acts mainly as a primitive implementation of watermarks. I have found that tx queue lengths need to be more like 5000 than 50 or 500 to provide enough buffering when applications are delayed by other applications or just by sleeping until the next clock tick, and use tx queues of length ~2 (a couple of clock ticks at HZ = 100), but now think queue lengths should be restricted to more like 50 since long queues cannot fit in L2 caches (not to mention they are bad for latency). The length of ipintrq can be changed using sysctl net.inet.ip.intrq_queue_maxlen. Changing it from 50 to 1024 turns most or all ipintrq drops into "socket buffer full" drops (640 kpps input packets and 434 kpps socket buffer fulls with direct=0; 640 kpps input packets and 324 kpps socket buffer fulls with direct=1). Bruce ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[EMAIL PROTECTED]"