Packet loss every 30.999 seconds

2007-12-16 Thread Mark Fullmer
While trying to diagnose a packet loss problem in a RELENG_6 snapshot  
dated

November 8, 2007 it looks like I've stumbled across a broken driver or
kernel routine which stops interrupt processing long enough to severly
degrade network performance every 30.99 seconds.

Packets appear to make it as far as ether_input() then get lost.

Test setup:

A - ethernet_switch - B

A sends UDP packets to B through an ethernet switch.  The interface  
input

packet count and output packet count on the switch match what A
is sending and B should be receiving.  A UDP receiver running on B
sees windows of packet loss with a period of 30.99 seconds.  The lost
packets are counted based on an incrementing sequence number.  On an
isolated network the  Ipkts counter on B matches what A is
sending, but the packets never show up in any of the IP/UDP counters
or the program trying to receive them.

This behavior can be seen with both em and fxp interfaces.  Problem  
is it

only occurs after the receiving host has been up about a day.  Reboot,
problem clears.  GENERIC kernel, nothing more than default daemons
running.  Behavior seen on three different motherboards so far.

It also appears this is not just lost network interrupts.  Whatever
is spinning in the kernel also impacts syscall latency.  An easy way to
replicate what I'm seeing is to run gettimeofday() in a tight loop
and note when the real time syscall delay exceeds some value (which
is dependent on processor speed).  As an example on an 3.20GHz CPU a
small program will output when the syscall latency is > 5000 usecs.
Note the periodic behavior at 30.99 seconds.  These big jumps in
latency correspond to when packets are being dropped.

usecs (epoch)latency diff

1197861705805078 478199 0
1197861721012298 25926 15207220
1197861726332036 11729 5319738
1197861757331549 11691 30999513
1197861788331266 11878 30999717
1197861819330647 11708 30999381
1197861850330192 11698 30999545
1197861881329733 11667 30999541
1197861900018297 6516 18688564
1197861912329282 11684 12310985
1197861943328849 11699 30999567
1197861974328413 11692 30999564
1197862005328228 11916 30999815
1197862036327598 11684 30999370
1197862067327229 11680 30999631
1197862098326860 11667 30999631
1197862129326559 11704 30999699
1197862160326377 11844 30999818
1197862191325890 11674 30999513

(output from packet loss tester)

window_start/window_end is packet counter
time_start/time_end is absolute time in usecs.
window_diff is # of packets missing

The test is run at about 15.5Kpps / 132Mbits/second, certainly a lot  
less than this hardware is capable of running BSD4.X.


:missing window_start=311510,  
time_start=1197861726332008,window_end=311638,  
time_end=1197861726332011, window_diff=128, time_diff=3
:missing window_start=794482,  
time_start=1197861757331505,window_end=794609,  
time_end=1197861757331509, window_diff=127, time_diff=4
:missing window_start=1277313,  
time_start=1197861788331245,window_end=1277444,  
time_end=1197861788331249, window_diff=131, time_diff=4
:missing window_start=1760104,  
time_start=1197861819330625,window_end=1760232,  
time_end=1197861819330629, window_diff=128, time_diff=4
:missing window_start=2242789,  
time_start=1197861850330170,window_end=2242916,  
time_end=1197861850330174, window_diff=127, time_diff=4
:missing window_start=2725818,  
time_start=1197861881329712,window_end=2725946,  
time_end=1197861881329715, window_diff=128, time_diff=3
:missing window_start=3208594,  
time_start=1197861912329261,window_end=3208722,  
time_end=1197861912329264, window_diff=128, time_diff=3
:missing window_start=3691395,  
time_start=1197861943328802,window_end=3691522,  
time_end=1197861943328805, window_diff=127, time_diff=3
:missing window_start=4173793,  
time_start=1197861974328369,window_end=4173921,  
time_end=1197861974328373, window_diff=128, time_diff=4
:missing window_start=4656236,  
time_start=1197862005328176,window_end=4656367,  
time_end=1197862005328179, window_diff=131, time_diff=3
:missing window_start=5139197,  
time_start=1197862036327576,window_end=5139325,  
time_end=1197862036327580, window_diff=128, time_diff=4
:missing window_start=5621958,  
time_start=1197862067327208,window_end=5622085,  
time_end=1197862067327211, window_diff=127, time_diff=3
:missing window_start=6104597,  
time_start=1197862098326839,window_end=6104725,  
time_end=1197862098326843, window_diff=128, time_diff=4
:missing window_start=6587241,  
time_start=1197862129326514,window_end=6587369,  
time_end=1197862129326534, window_diff=128, time_diff=20
:missing window_start=7070051,  
time_start=1197862160326368,window_end=7070183,  
time_end=1197862160326371, window_diff=132, time_diff=3
:missing window_start=7552828,  
time_start=1197862191325873,window_end=7552954,  
time_end=1197862191325876, window_diff=126, time_diff=3
:missing window_start=8035434,  
time_start=119786325572,window_end=8035560,  
time_end=119786325576,

Packet loss every 30.999 seconds

2007-12-16 Thread Mark Fullmer
While trying to diagnose a packet loss problem in a RELENG_6 snapshot  
dated

November 8, 2007 it looks like I've stumbled across a broken driver or
kernel routine which stops interrupt processing long enough to severly
degrade network performance every 30.99 seconds.

Packets appear to make it as far as ether_input() then get lost.

Test setup:

A - ethernet_switch - B

A sends UDP packets to B through an ethernet switch.  The interface  
input

packet count and output packet count on the switch match what A
is sending and B should be receiving.  A UDP receiver running on B
sees windows of packet loss with a period of 30.99 seconds.  The lost
packets are counted based on an incrementing sequence number.  On an
isolated network the  Ipkts counter on B matches what A is
sending, but the packets never show up in any of the IP/UDP counters
or the program trying to receive them.

This behavior can be seen with both em and fxp interfaces.  Problem  
is it

only occurs after the receiving host has been up about a day.  Reboot,
problem clears.  GENERIC kernel, nothing more than default daemons
running.  Behavior seen on three different motherboards so far.

It also appears this is not just lost network interrupts.  Whatever
is spinning in the kernel also impacts syscall latency.  An easy way to
replicate what I'm seeing is to run gettimeofday() in a tight loop
and note when the real time syscall delay exceeds some value (which
is dependent on processor speed).  As an example on an 3.20GHz CPU a
small program will output when the syscall latency is > 5000 usecs.
Note the periodic behavior at 30.99 seconds.  These big jumps in
latency correspond to when packets are being dropped.

usecs (epoch)latency diff

1197861705805078 478199 0
1197861721012298 25926 15207220
1197861726332036 11729 5319738
1197861757331549 11691 30999513
1197861788331266 11878 30999717
1197861819330647 11708 30999381
1197861850330192 11698 30999545
1197861881329733 11667 30999541
1197861900018297 6516 18688564
1197861912329282 11684 12310985
1197861943328849 11699 30999567
1197861974328413 11692 30999564
1197862005328228 11916 30999815
1197862036327598 11684 30999370
1197862067327229 11680 30999631
1197862098326860 11667 30999631
1197862129326559 11704 30999699
1197862160326377 11844 30999818
1197862191325890 11674 30999513

(output from packet loss tester)

window_start/window_end is packet counter
time_start/time_end is absolute time in usecs.
window_diff is # of packets missing

The test is run at about 15.5Kpps / 132Mbits/second, certainly a lot  
less than this hardware is capable of running BSD4.X.


:missing window_start=311510,  
time_start=1197861726332008,window_end=311638,  
time_end=1197861726332011, window_diff=128, time_diff=3
:missing window_start=794482,  
time_start=1197861757331505,window_end=794609,  
time_end=1197861757331509, window_diff=127, time_diff=4
:missing window_start=1277313,  
time_start=1197861788331245,window_end=1277444,  
time_end=1197861788331249, window_diff=131, time_diff=4
:missing window_start=1760104,  
time_start=1197861819330625,window_end=1760232,  
time_end=1197861819330629, window_diff=128, time_diff=4
:missing window_start=2242789,  
time_start=1197861850330170,window_end=2242916,  
time_end=1197861850330174, window_diff=127, time_diff=4
:missing window_start=2725818,  
time_start=1197861881329712,window_end=2725946,  
time_end=1197861881329715, window_diff=128, time_diff=3
:missing window_start=3208594,  
time_start=1197861912329261,window_end=3208722,  
time_end=1197861912329264, window_diff=128, time_diff=3
:missing window_start=3691395,  
time_start=1197861943328802,window_end=3691522,  
time_end=1197861943328805, window_diff=127, time_diff=3
:missing window_start=4173793,  
time_start=1197861974328369,window_end=4173921,  
time_end=1197861974328373, window_diff=128, time_diff=4
:missing window_start=4656236,  
time_start=1197862005328176,window_end=4656367,  
time_end=1197862005328179, window_diff=131, time_diff=3
:missing window_start=5139197,  
time_start=1197862036327576,window_end=5139325,  
time_end=1197862036327580, window_diff=128, time_diff=4
:missing window_start=5621958,  
time_start=1197862067327208,window_end=5622085,  
time_end=1197862067327211, window_diff=127, time_diff=3
:missing window_start=6104597,  
time_start=1197862098326839,window_end=6104725,  
time_end=1197862098326843, window_diff=128, time_diff=4
:missing window_start=6587241,  
time_start=1197862129326514,window_end=6587369,  
time_end=1197862129326534, window_diff=128, time_diff=20
:missing window_start=7070051,  
time_start=1197862160326368,window_end=7070183,  
time_end=1197862160326371, window_diff=132, time_diff=3
:missing window_start=7552828,  
time_start=1197862191325873,window_end=7552954,  
time_end=1197862191325876, window_diff=126, time_diff=3
:missing window_start=8035434,  
time_start=119786325572,window_end=8035560,  
time_end=119786325576,

Packet loss every 30.999 seconds

2007-12-16 Thread Mark Fullmer
While trying to diagnose a packet loss problem in a RELENG_6 snapshot  
dated

November 8, 2007 it looks like I've stumbled across a broken driver or
kernel routine which stops interrupt processing long enough to severly
degrade network performance every 30.99 seconds.

Packets appear to make it as far as ether_input() then get lost.

Test setup:

A - ethernet_switch - B

A sends UDP packets to B through an ethernet switch.  The interface  
input

packet count and output packet count on the switch match what A
is sending and B should be receiving.  A UDP receiver running on B
sees windows of packet loss with a period of 30.99 seconds.  The lost
packets are counted based on an incrementing sequence number.  On an
isolated network the  Ipkts counter on B matches what A is
sending, but the packets never show up in any of the IP/UDP counters
or the program trying to receive them.

This behavior can be seen with both em and fxp interfaces.  Problem  
is it

only occurs after the receiving host has been up about a day.  Reboot,
problem clears.  GENERIC kernel, nothing more than default daemons
running.  Behavior seen on three different motherboards so far.

It also appears this is not just lost network interrupts.  Whatever
is spinning in the kernel also impacts syscall latency.  An easy way to
replicate what I'm seeing is to run gettimeofday() in a tight loop
and note when the real time syscall delay exceeds some value (which
is dependent on processor speed).  As an example on an 3.20GHz CPU a
small program will output when the syscall latency is > 5000 usecs.
Note the periodic behavior at 30.99 seconds.  These big jumps in
latency correspond to when packets are being dropped.

usecs (epoch)latency diff

1197861705805078 478199 0
1197861721012298 25926 15207220
1197861726332036 11729 5319738
1197861757331549 11691 30999513
1197861788331266 11878 30999717
1197861819330647 11708 30999381
1197861850330192 11698 30999545
1197861881329733 11667 30999541
1197861900018297 6516 18688564
1197861912329282 11684 12310985
1197861943328849 11699 30999567
1197861974328413 11692 30999564
1197862005328228 11916 30999815
1197862036327598 11684 30999370
1197862067327229 11680 30999631
1197862098326860 11667 30999631
1197862129326559 11704 30999699
1197862160326377 11844 30999818
1197862191325890 11674 30999513

(output from packet loss tester)

window_start/window_end is packet counter
time_start/time_end is absolute time in usecs.
window_diff is # of packets missing

The test is run at about 15.5Kpps / 132Mbits/second, certainly a lot  
less than this hardware is capable of running BSD4.X.


:missing window_start=311510,  
time_start=1197861726332008,window_end=311638,  
time_end=1197861726332011, window_diff=128, time_diff=3
:missing window_start=794482,  
time_start=1197861757331505,window_end=794609,  
time_end=1197861757331509, window_diff=127, time_diff=4
:missing window_start=1277313,  
time_start=1197861788331245,window_end=1277444,  
time_end=1197861788331249, window_diff=131, time_diff=4
:missing window_start=1760104,  
time_start=1197861819330625,window_end=1760232,  
time_end=1197861819330629, window_diff=128, time_diff=4
:missing window_start=2242789,  
time_start=1197861850330170,window_end=2242916,  
time_end=1197861850330174, window_diff=127, time_diff=4
:missing window_start=2725818,  
time_start=1197861881329712,window_end=2725946,  
time_end=1197861881329715, window_diff=128, time_diff=3
:missing window_start=3208594,  
time_start=1197861912329261,window_end=3208722,  
time_end=1197861912329264, window_diff=128, time_diff=3
:missing window_start=3691395,  
time_start=1197861943328802,window_end=3691522,  
time_end=1197861943328805, window_diff=127, time_diff=3
:missing window_start=4173793,  
time_start=1197861974328369,window_end=4173921,  
time_end=1197861974328373, window_diff=128, time_diff=4
:missing window_start=4656236,  
time_start=1197862005328176,window_end=4656367,  
time_end=1197862005328179, window_diff=131, time_diff=3
:missing window_start=5139197,  
time_start=1197862036327576,window_end=5139325,  
time_end=1197862036327580, window_diff=128, time_diff=4
:missing window_start=5621958,  
time_start=1197862067327208,window_end=5622085,  
time_end=1197862067327211, window_diff=127, time_diff=3
:missing window_start=6104597,  
time_start=1197862098326839,window_end=6104725,  
time_end=1197862098326843, window_diff=128, time_diff=4
:missing window_start=6587241,  
time_start=1197862129326514,window_end=6587369,  
time_end=1197862129326534, window_diff=128, time_diff=20
:missing window_start=7070051,  
time_start=1197862160326368,window_end=7070183,  
time_end=1197862160326371, window_diff=132, time_diff=3
:missing window_start=7552828,  
time_start=1197862191325873,window_end=7552954,  
time_end=1197862191325876, window_diff=126, time_diff=3
:missing window_start=8035434,  
time_start=119786325572,window_end=8035560,  
time_end=119786325576,

Re: Packet loss every 30.999 seconds

2007-12-16 Thread Jeremy Chadwick
On Mon, Dec 17, 2007 at 12:21:43AM -0500, Mark Fullmer wrote:
> While trying to diagnose a packet loss problem in a RELENG_6 snapshot dated
> November 8, 2007 it looks like I've stumbled across a broken driver or
> kernel routine which stops interrupt processing long enough to severly
> degrade network performance every 30.99 seconds.
>
> Packets appear to make it as far as ether_input() then get lost.

Are you sure this isn't being caused by something the switch is doing,
such as MAC/ARP cache clearing or LACP?  I'm just speculating, but it
would be worthwhile to remove the switch from the picture (crossover
cable to the rescue).

I know that at least in the case of fxp(4) and em(4), Jack Vogel does
some through testing of throughput using a professional/high-end packet
generator (some piece of hardware, I forget the name...)

-- 
| Jeremy Chadwickjdc at parodius.com |
| Parodius Networking   http://www.parodius.com/ |
| UNIX Systems Administrator  Mountain View, CA, USA |
| Making life hard for others since 1977.  PGP: 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Packet loss every 30.999 seconds

2007-12-16 Thread Mark Fullmer

I'm about 99% sure right now.   I'll set this up in a lab tomorrow
without an ethernet switch.  It takes about a day of uptime before
the problem shows up.

Sorry for the duplicate messages, I misread a bounce notification.

--
mark

On Dec 17, 2007, at 12:43 AM, Jeremy Chadwick wrote:


On Mon, Dec 17, 2007 at 12:21:43AM -0500, Mark Fullmer wrote:
While trying to diagnose a packet loss problem in a RELENG_6  
snapshot dated
November 8, 2007 it looks like I've stumbled across a broken  
driver or
kernel routine which stops interrupt processing long enough to  
severly

degrade network performance every 30.99 seconds.

Packets appear to make it as far as ether_input() then get lost.


Are you sure this isn't being caused by something the switch is doing,
such as MAC/ARP cache clearing or LACP?  I'm just speculating, but it
would be worthwhile to remove the switch from the picture (crossover
cable to the rescue).

I know that at least in the case of fxp(4) and em(4), Jack Vogel does
some through testing of throughput using a professional/high-end  
packet

generator (some piece of hardware, I forget the name...)

--
| Jeremy Chadwickjdc at  
parodius.com |
| Parodius Networking   http:// 
www.parodius.com/ |
| UNIX Systems Administrator  Mountain View,  
CA, USA |
| Making life hard for others since 1977.  PGP:  
4BD6C0CB |


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable- 
[EMAIL PROTECTED]"




___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Packet loss every 30.999 seconds

2007-12-17 Thread David G Lawrence
   One more comment on my last email... The patch that I included is not
meant as a real fix - it is just a bandaid. The real problem appears to
be that a very large number of vnodes (all of them?) are getting synced
(i.e. calling ffs_syncvnode()) every time. This should normally only
happen for dirty vnodes. I suspect that something is broken with this
check:

if (vp->v_type == VNON || ((ip->i_flag &
(IN_ACCESS | IN_CHANGE | IN_MODIFIED | IN_UPDATE)) == 0 &&
 vp->v_bufobj.bo_dirty.bv_cnt == 0)) {
VI_UNLOCK(vp);
continue;
}


   ...like the i_flag flags aren't ever getting properly cleared (or bv_cnt
is always non-zero).

   ...but I don't have the time to chase this down.

-DG

David G. Lawrence
President
Download Technologies, Inc. - http://www.downloadtech.com - (866) 399 8500
The FreeBSD Project - http://www.freebsd.org
Pave the road of life with opportunities.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Packet loss every 30.999 seconds

2007-12-17 Thread David G Lawrence
> While trying to diagnose a packet loss problem in a RELENG_6 snapshot  
> dated
> November 8, 2007 it looks like I've stumbled across a broken driver or
> kernel routine which stops interrupt processing long enough to severly
> degrade network performance every 30.99 seconds.

   I noticed this as well some time ago. The problem has to do with the
processing (syncing) of vnodes. When the total number of allocated vnodes
in the system grows to tens of thousands, the ~31 second periodic sync
process takes a long time to run. Try this patch and let people know if
it helps your problem. It will periodically wait for one tick (1ms) every
500 vnodes of processing, which will allow other things to run.

Index: ufs/ffs/ffs_vfsops.c
===
RCS file: /home/ncvs/src/sys/ufs/ffs/ffs_vfsops.c,v
retrieving revision 1.290.2.16
diff -c -r1.290.2.16 ffs_vfsops.c
*** ufs/ffs/ffs_vfsops.c9 Oct 2006 19:47:17 -   1.290.2.16
--- ufs/ffs/ffs_vfsops.c25 Apr 2007 01:58:15 -
***
*** 1109,1114 
--- 1109,1115 
int softdep_deps;
int softdep_accdeps;
struct bufobj *bo;
+   int flushed_count = 0;
  
fs = ump->um_fs;
if (fs->fs_fmod != 0 && fs->fs_ronly != 0) {/* XXX */
***
*** 1174,1179 
--- 1175,1184 
allerror = error;
vput(vp);
MNT_ILOCK(mp);
+   if (flushed_count++ > 500) {
+   flushed_count = 0;
+   msleep(&flushed_count, MNT_MTX(mp), PZERO, "syncw", 1);
+   }
}
MNT_IUNLOCK(mp);
/*

-DG

David G. Lawrence
President
Download Technologies, Inc. - http://www.downloadtech.com - (866) 399 8500
The FreeBSD Project - http://www.freebsd.org
Pave the road of life with opportunities.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Packet loss every 30.999 seconds

2007-12-17 Thread Mark Fullmer

Back to back test with no ethernet switch between two em interfaces,
same result.  The receiving side has been up > 1 day and exhibits
the problem.  These are also two different servers.  The small
gettimeofday() syscall tester also shows the same ~30
second pattern of high latency between syscalls.

Receiver test application reports 3699 missed packets

Sender netstat -i:

(before test)
em11500   00:04:23:cf:51:b7   20 0  
15975785 0 0
em11500 10.1/24   10.1.0.237 -  
15975801 - -


(after test)
em11500   00:04:23:cf:51:b7   22 0  
25975822 0 0
em11500 10.1/24   10.1.0.239 -  
25975838 - -


total IP packets sent in during test = end - start
25975838-15975801 =  1037 (expected, 1,000,000 packets test +  
overhead)


Receiver netstat -i:

(before test)
em11500   00:04:23:c4:cc:89 15975785 0
21 0 0
em11500 10.1/24   10.1.0.1  15969626 -
19 - -


(after test)
em11500   00:04:23:c4:cc:89 25975822 0
23 0 0
em11500 10.1/24   10.1.0.1  25965964 -
21 - -


total ethernet frames received during test = end - start
25975822-15975785 = 1037 (as expected)

total IP packets processed during test = end - start
25965964-15969626 = 9996338 (expecting 1037)

Missed packets = expected - received
1037-9996338 = 3699

netstat -i accounts for the 3699 missed packets also reported by the
application

Looking closer at the tester output again shows the periodic
~30 second windows of packet loss.

There's a second problem here in that packets are just disappearing
before they make it to ip_input(), or there's a dropped packets
counter I've not found yet.

I can provide remote access to anyone who wants to take a look, this
is very easy to duplicate.  The ~ 1 day uptime before the behavior
surfaces is not making this easy to isolate.

--
mark

On Dec 17, 2007, at 12:43 AM, Jeremy Chadwick wrote:


On Mon, Dec 17, 2007 at 12:21:43AM -0500, Mark Fullmer wrote:
While trying to diagnose a packet loss problem in a RELENG_6  
snapshot dated
November 8, 2007 it looks like I've stumbled across a broken  
driver or
kernel routine which stops interrupt processing long enough to  
severly

degrade network performance every 30.99 seconds.

Packets appear to make it as far as ether_input() then get lost.


Are you sure this isn't being caused by something the switch is doing,
such as MAC/ARP cache clearing or LACP?  I'm just speculating, but it
would be worthwhile to remove the switch from the picture (crossover
cable to the rescue).

I know that at least in the case of fxp(4) and em(4), Jack Vogel does
some through testing of throughput using a professional/high-end  
packet

generator (some piece of hardware, I forget the name...)

--
| Jeremy Chadwickjdc at  
parodius.com |
| Parodius Networking   http:// 
www.parodius.com/ |
| UNIX Systems Administrator  Mountain View,  
CA, USA |
| Making life hard for others since 1977.  PGP:  
4BD6C0CB |


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable- 
[EMAIL PROTECTED]"




___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Packet loss every 30.999 seconds

2007-12-17 Thread Mark Fullmer
Thanks.  Have a kernel building now.  It takes about a day of uptime  
after reboot before I'll see the problem.


--
mark

On Dec 17, 2007, at 5:24 AM, David G Lawrence wrote:


While trying to diagnose a packet loss problem in a RELENG_6 snapshot
dated
November 8, 2007 it looks like I've stumbled across a broken  
driver or
kernel routine which stops interrupt processing long enough to  
severly

degrade network performance every 30.99 seconds.


   I noticed this as well some time ago. The problem has to do with  
the
processing (syncing) of vnodes. When the total number of allocated  
vnodes

in the system grows to tens of thousands, the ~31 second periodic sync
process takes a long time to run. Try this patch and let people  
know if
it helps your problem. It will periodically wait for one tick (1ms)  
every

500 vnodes of processing, which will allow other things to run.

Index: ufs/ffs/ffs_vfsops.c
===
RCS file: /home/ncvs/src/sys/ufs/ffs/ffs_vfsops.c,v
retrieving revision 1.290.2.16
diff -c -r1.290.2.16 ffs_vfsops.c
*** ufs/ffs/ffs_vfsops.c9 Oct 2006 19:47:17 -   1.290.2.16
--- ufs/ffs/ffs_vfsops.c25 Apr 2007 01:58:15 -
***
*** 1109,1114 
--- 1109,1115 
int softdep_deps;
int softdep_accdeps;
struct bufobj *bo;
+   int flushed_count = 0;

fs = ump->um_fs;
if (fs->fs_fmod != 0 && fs->fs_ronly != 0) {  /* XXX */
***
*** 1174,1179 
--- 1175,1184 
allerror = error;
vput(vp);
MNT_ILOCK(mp);
+   if (flushed_count++ > 500) {
+   flushed_count = 0;
+   msleep(&flushed_count, MNT_MTX(mp), PZERO, "syncw", 1);
+   }
}
MNT_IUNLOCK(mp);
/*

-DG

David G. Lawrence
President
Download Technologies, Inc. - http://www.downloadtech.com - (866)  
399 8500

The FreeBSD Project - http://www.freebsd.org
Pave the road of life with opportunities.


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Packet loss every 30.999 seconds

2007-12-17 Thread Bruce Evans

On Mon, 17 Dec 2007, David G Lawrence wrote:


  One more comment on my last email... The patch that I included is not
meant as a real fix - it is just a bandaid. The real problem appears to
be that a very large number of vnodes (all of them?) are getting synced
(i.e. calling ffs_syncvnode()) every time. This should normally only
happen for dirty vnodes. I suspect that something is broken with this
check:

   if (vp->v_type == VNON || ((ip->i_flag &
   (IN_ACCESS | IN_CHANGE | IN_MODIFIED | IN_UPDATE)) == 0 &&
vp->v_bufobj.bo_dirty.bv_cnt == 0)) {
   VI_UNLOCK(vp);
   continue;
   }


Isn't it just the O(N) algorithm with N quite large?  Under ~5.2, on
a 2.2GHz A64 UP in 32-bit mode, I see a latency of 3 ms for 17500 vnodes,
which would be explained by the above (and the VI_LOCK() and loop
overhead) taking 171 ns per vnode.  I would expect it to take more like
20 ns per vnode for UP and 60 for SMP.

The comment before this code shows that the problem is known, and says
that a subroutine call cannot be afforded unless there is work to do,
but the, the locking accesses look like subroutine calls, have subroutine
calls in their internals, and take longer than simple subroutine calls
in the SMP case even when they don't make subroutine calls.  (IIRC, on
A64 a minimal subroutine call takes 4 cycles while a minimal locked
instructions takes 18 cycles; subroutine calls are only slow when their
branches are mispredicted.)

Bruce
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Packet loss every 30.999 seconds

2007-12-17 Thread Scott Long

Bruce Evans wrote:

On Mon, 17 Dec 2007, David G Lawrence wrote:


  One more comment on my last email... The patch that I included is not
meant as a real fix - it is just a bandaid. The real problem appears to
be that a very large number of vnodes (all of them?) are getting synced
(i.e. calling ffs_syncvnode()) every time. This should normally only
happen for dirty vnodes. I suspect that something is broken with this
check:

   if (vp->v_type == VNON || ((ip->i_flag &
   (IN_ACCESS | IN_CHANGE | IN_MODIFIED | IN_UPDATE)) == 0 &&
vp->v_bufobj.bo_dirty.bv_cnt == 0)) {
   VI_UNLOCK(vp);
   continue;
   }


Isn't it just the O(N) algorithm with N quite large?  Under ~5.2, on
a 2.2GHz A64 UP in 32-bit mode, I see a latency of 3 ms for 17500 vnodes,
which would be explained by the above (and the VI_LOCK() and loop
overhead) taking 171 ns per vnode.  I would expect it to take more like
20 ns per vnode for UP and 60 for SMP.

The comment before this code shows that the problem is known, and says
that a subroutine call cannot be afforded unless there is work to do,
but the, the locking accesses look like subroutine calls, have subroutine
calls in their internals, and take longer than simple subroutine calls
in the SMP case even when they don't make subroutine calls.  (IIRC, on
A64 a minimal subroutine call takes 4 cycles while a minimal locked
instructions takes 18 cycles; subroutine calls are only slow when their
branches are mispredicted.)

Bruce


Right, it's a non-optimal loop when N is very large, and that's a fairly
well understood problem.  I think what DG was getting at, though, is
that this massive flush happens every time the syncer runs, which
doesn't seem correct.  Sure, maybe you just rsynced 100,000 files 20
seconds ago, so the upcoming flush is going to be expensive.  But the
next flush 30 seconds after that shouldn't be just as expensive, yet it
appears to be so.  This is further supported by the original poster's
claim that it takes many hours of uptime before the problem becomes
noticeable.  If vnodes are never truly getting cleaned, or never getting
their flags cleared so that this loop knows that they are clean, then
it's feasible that they'll accumulate over time, keep on getting flushed
every 30 seconds, keep on bogging down the loop, and so on.

Scott
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Packet loss every 30.999 seconds

2007-12-18 Thread Bruce Evans

On Mon, 17 Dec 2007, David G Lawrence wrote:


While trying to diagnose a packet loss problem in a RELENG_6 snapshot
dated
November 8, 2007 it looks like I've stumbled across a broken driver or
kernel routine which stops interrupt processing long enough to severly
degrade network performance every 30.99 seconds.


I see the same behaviour under a heavily modified version of FreeBSD-5.2
(except the period was 2 ms longer and the latency was 7 ms instead
of 11 ms when numvnodes was at a certain value.  Now with numvnodes =
17500, the latency is 3 ms.


  I noticed this as well some time ago. The problem has to do with the
processing (syncing) of vnodes. When the total number of allocated vnodes
in the system grows to tens of thousands, the ~31 second periodic sync
process takes a long time to run. Try this patch and let people know if
it helps your problem. It will periodically wait for one tick (1ms) every
500 vnodes of processing, which will allow other things to run.


However, the syncer should be running at a relative low priority and not
cause packet loss.  I don't see any packet loss even in ~5.2 where the
network stack (but not drivers) is still Giant-locked.

Other too-high latencies showed up:
- syscons LED setting and vt switching gives a latency of 5.5 msec because
  syscons still uses busy-waiting for setting LEDs :-(.  Oops, I do see
  packet loss -- this causes it under ~5.2 but not under -current.  For
  the bge and/or em drivers, the packet loss shows up in netstat output
  as a few hundred errors for every LED setting on the receiving machine,
  while receiving tiny packets at the maximum possible rate of 640 kpps.
  sysctl is completely Giant-locked and so are upper layers of the
  network stack.  The bge hardware rx ring size is 256 in -current and
  512 in ~5.2.  At 640 kpps, 512 packets take 800 us so bge wants to
  call the the upper layers with a latency of far below 800 us.  I
  don't know exactly where the upper layers block on Giant.
- a user CPU hog process gives a latency of over 200 ms every half a
  second or so when the hog starts up, and a 300-400 ms after the
  hog has been running for some time.  Two user CPU hog processes
  double the latency.  Reducing kern.sched.quantum from 100 ms to 10
  ms and/or renicing the hogs don't seem to affect this.  Running the
  hogs at idle priority fixes this.  This won't affect packet loss,
  but it might affect user network processes -- they might need to
  run at real time priority to get low enough latency.  They might need
  to do this anyway -- a scheduling quantum of 100 ms should give a
  latency of 100 ms per CPU hog quite often, though not usually since
  the hogs should never be prefered to a higher-prioerity process.

Previously I've used a less specialized clock-watching program to
determine the syscall latency.  It showed similar problems for CPU
hogs.  I just remembered that I found the fix for these under ~5.2 --
remove a local hack that sacrifices latency for reduced context
switches between user threads.  -current with SCHED_4BSD does this
non-hackishly, but seems to have a bug somehwhere that gives a latency
that is large enough to be noticeable in interactive programs.

Bruce
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Packet loss every 30.999 seconds

2007-12-18 Thread Bruce Evans

On Mon, 17 Dec 2007, Mark Fullmer wrote:

Thanks.  Have a kernel building now.  It takes about a day of uptime after 
reboot before I'll see the problem.


Yes run "find / >/dev/null" to see the problem if it is the syncer one.

At least the syscall latency problem does seem to be this.  Under ~5.2,
with the above find and also "while :; do sync; done" (to give latency
spike more often), your program (with some fflush(stdout)'s and args
1 7700) gives:

% 1197976029041677 12696 0
% 1197976033196396 9761 4154719
% 1197976034060031 13360 863635
% 1197976039080632 13749 5020601
% 1197976043195594 8536 4114962
% 1197976044100601 13505 905007
% 1197976049121870 14562 5021269
% 1197976052195631 8192 3073761
% 1197976054141545 14024 1945914
% 1197976059162357 14623 5020812
% 1197976063195735 7830 4033378
% 1197976064182564 14618 986829
% 1197976069202982 14823 5020418
% 1197976074223722 15350 5020740
% 1197976079244311 15726 5020589
% 1197976084264690 15893 5020379
% 1197976089289409 15058 5024719
% 1197976094315433 16209 5026024
% 1197976095197277 8015 881844
% 1197976099335529 16092 4138252
% 1197976104356513 16863 5020984
% 1197976109376236 16373 5019723
% 1197976114396803 16727 5020567
% 1197976119416822 16533 5020019
% 1197976124437790 17288 5020968
% 1197976126200637 10060 1762847
% 1197976127198459 7839 997822
% 1197976129457321 16606 2258862
% 1197976134477582 16654 5020261

This clearly shows the spike every 5 seconds, and the latency creeping
up as vfs.numvnodes increases.  It started at about 2 and ended at
about 64000.

The syncer won't be fixed soon, so the fix for dropped packets requires
figuring out why the syncer affects networking.

Bruce
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Packet loss every 30.999 seconds

2007-12-18 Thread Bruce Evans

On Mon, 17 Dec 2007, Scott Long wrote:


Bruce Evans wrote:

On Mon, 17 Dec 2007, David G Lawrence wrote:


  One more comment on my last email... The patch that I included is not
meant as a real fix - it is just a bandaid. The real problem appears to
be that a very large number of vnodes (all of them?) are getting synced
(i.e. calling ffs_syncvnode()) every time. This should normally only
happen for dirty vnodes. I suspect that something is broken with this
check:

   if (vp->v_type == VNON || ((ip->i_flag &
   (IN_ACCESS | IN_CHANGE | IN_MODIFIED | IN_UPDATE)) == 0 &&
vp->v_bufobj.bo_dirty.bv_cnt == 0)) {
   VI_UNLOCK(vp);
   continue;
   }


Isn't it just the O(N) algorithm with N quite large?  Under ~5.2, on



Right, it's a non-optimal loop when N is very large, and that's a fairly
well understood problem.  I think what DG was getting at, though, is
that this massive flush happens every time the syncer runs, which
doesn't seem correct.  Sure, maybe you just rsynced 100,000 files 20
seconds ago, so the upcoming flush is going to be expensive.  But the
next flush 30 seconds after that shouldn't be just as expensive, yet it
appears to be so.


I'm sure it doesn't cause many bogus flushes.  iostat shows zero writes
caused by calling this incessantly using "while :; do sync; done".


This is further supported by the original poster's
claim that it takes many hours of uptime before the problem becomes
noticeable.  If vnodes are never truly getting cleaned, or never getting
their flags cleared so that this loop knows that they are clean, then
it's feasible that they'll accumulate over time, keep on getting flushed
every 30 seconds, keep on bogging down the loop, and so on.


Using "find / >/dev/null" to grow the problem and make it bad after a
few seconds of uptime, and profiling of a single sync(2) call to show
that nothing much is done except the loop containing the above:

under ~5.2, on a 2.2GHz A64 UP ini386 mode:

after booting, with about 700 vnodes:

%   %   cumulative   self  self total 
%  time   seconds   secondscalls  ns/call  ns/call  name 
%  30.8  0.0000.0000  100.00%   mcount [4]

%  14.9  0.0010.0000  100.00%   mexitcount [5]
%   5.5  0.0010.0000  100.00%   cputime [16]
%   5.0  0.0010.00061331213312  vfs_msync [18]
%   4.3  0.0010.0000  100.00%   user [21]
%   3.5  0.0010.00051132111993  ffs_sync [23]

after "find / >/dev/null" was stopped after saturating at 64000 vnodes
(desiredvodes is 70240):

%   %   cumulative   self  self total 
%  time   seconds   secondscalls  ns/call  ns/call  name 
%  50.7  0.0080.0085  1666427  1667246  ffs_sync [5]

%  38.0  0.0150.0066  1041217  1041217  vfs_msync [6]
%   3.1  0.0150.0010  100.00%   mcount [7]
%   1.5  0.0150.0000  100.00%   mexitcount [8]
%   0.6  0.0150.0000  100.00%   cputime [22]
%   0.6  0.0160.000   34 2660 2660  generic_bcopy [24]
%   0.5  0.0160.0000  100.00%   user [26]

vfs_msync() is a problem too.  It uses an almost identical loop for
the case where the vnode is not dirty (but has a different condition
for being dirty).  ffs_sync() is called 5 times because there are 5
ffs file systems mounted r/w.  There is another ffs file system mounted
r/o and that combined with a missing r/o optimization might give the
extra call to vfs_msync().  With 64000 vnodes, the calls take 1-2 ms
each.  That is already quite a lot, and there are many calls.  Each
call only looks at vnodes under the mount point so the number of mounted
file systems doesn't affect the total time much.

ffs_sync() i taking 125 ns per vnode.  That is a more than I would have
expected.

Bruce
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Packet loss every 30.999 seconds

2007-12-18 Thread David G Lawrence
> >Right, it's a non-optimal loop when N is very large, and that's a fairly
> >well understood problem.  I think what DG was getting at, though, is
> >that this massive flush happens every time the syncer runs, which
> >doesn't seem correct.  Sure, maybe you just rsynced 100,000 files 20
> >seconds ago, so the upcoming flush is going to be expensive.  But the
> >next flush 30 seconds after that shouldn't be just as expensive, yet it
> >appears to be so.
> 
> I'm sure it doesn't cause many bogus flushes.  iostat shows zero writes
> caused by calling this incessantly using "while :; do sync; done".

   I didn't say it caused any bogus disk I/O. My original problem
(after a day or two of uptime) was an occasional large scheduling delay
for a process that needed to process VoIP frames in real-time. It was
happening every 31 seconds and was causing voice frames to be dropped
due to the large latency causing the frame to be outside of the jitter
window. I wrote a program that measures the scheduling delay by sleeping
for one tick and then comparing the timeofday offset from what was
expected. This revealed that every 31 seconds, the process was seeing
a 17ms delay in scheduling. Further investigation found that 1) the
syncer was the process that was running every 31 seconds and causing
the delay (and it was the only one in the system with that timing
interval), and that 2) lowering the kern.maxvnodes to something lowish
(5000) would mostly mitigate the problem. The patch to limit the number
of vnodes to process in the loop before sleeping was then developed
and it completely resolved the problem. Since the wait that I added
is at the bottom of the loop and the limit is 500 vnodes, this tells
me that every 31 seconds, there are a whole lot of vnodes that are
being "synced", when there shouldn't have been any (this fact wasn't
apparent to me at the time, but when I later realized this, I had
no time to investigate further). My tests and analysis have all been
on an otherwise quiet system (no disk I/O), so the bottom of the
ffs_sync vnode loop should not have been reached at all, let alone
tens of thousands of times every 31 seconds. All machines were uni-
processor, FreeBSD 6+. I don't know if this problem is present in 5.2.
I didn't see ffs_syncvnode in your call graph, so it probably is not.
   Anyway, someone needs to instrument the vnode loop in ffs_sync and
figure out what is going on. As you've pointed out, it is necessary
to first read a lot of files (I use tar to /dev/null and make sure it
reads at least 100K files) in order to get the vnodes allocated. As
I mentioned previously, I suspect that either ip->i_flag is not getting
completely cleared in ffs_syncvnode or its children or
v_bufobj.bo_dirty.bv_cnt accounting is broken.

-DG

David G. Lawrence
President
Download Technologies, Inc. - http://www.downloadtech.com - (866) 399 8500
The FreeBSD Project - http://www.freebsd.org
Pave the road of life with opportunities.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Packet loss every 30.999 seconds

2007-12-18 Thread David G Lawrence
> Thanks.  Have a kernel building now.  It takes about a day of uptime  
> after reboot before I'll see the problem.

   You may also wish to try to get the problem to occur sooner after boot
on a non-patched system by doing a "tar cf /dev/null /" (note: substitute
/dev/zero instead of /dev/null, if you use GNU tar, to disable its
"optimization"). You can stop it after it has gone through a 100K files.
Verify by looking at "sysctl vfs.numvnodes".
   Doing this would help to further prove that lots of allocated vnodes
is the prerequisite for the problem.

-DG

David G. Lawrence
President
Download Technologies, Inc. - http://www.downloadtech.com - (866) 399 8500
The FreeBSD Project - http://www.freebsd.org
Pave the road of life with opportunities.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Packet loss every 30.999 seconds

2007-12-18 Thread Bruce Evans

On Tue, 18 Dec 2007, David G Lawrence wrote:


Thanks.  Have a kernel building now.  It takes about a day of uptime
after reboot before I'll see the problem.


  You may also wish to try to get the problem to occur sooner after boot
on a non-patched system by doing a "tar cf /dev/null /" (note: substitute
/dev/zero instead of /dev/null, if you use GNU tar, to disable its
"optimization"). You can stop it after it has gone through a 100K files.
Verify by looking at "sysctl vfs.numvnodes".


Hmm, I said to use "find /", but that is not so good since it only
looks at directories and directories (and their inodes) are not packed
as tightly as files (and their inodes).  Optimized tar, or "find /
-type f", or "ls -lR /", should work best, by doing not much more than
stat()ing lots of files, while full tar wastes time reading file data.

Bruce
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Packet loss every 30.999 seconds

2007-12-18 Thread David G Lawrence
> On Tue, 18 Dec 2007, David G Lawrence wrote:
> 
> >>Thanks.  Have a kernel building now.  It takes about a day of uptime
> >>after reboot before I'll see the problem.
> >
> >  You may also wish to try to get the problem to occur sooner after boot
> >on a non-patched system by doing a "tar cf /dev/null /" (note: substitute
> >/dev/zero instead of /dev/null, if you use GNU tar, to disable its
> >"optimization"). You can stop it after it has gone through a 100K files.
> >Verify by looking at "sysctl vfs.numvnodes".
> 
> Hmm, I said to use "find /", but that is not so good since it only
> looks at directories and directories (and their inodes) are not packed
> as tightly as files (and their inodes).  Optimized tar, or "find /
> -type f", or "ls -lR /", should work best, by doing not much more than
> stat()ing lots of files, while full tar wastes time reading file data.

   I have no reason to believe that just reading directories will
reproduce the problem with file vnodes. You need to open the files
and read them. Nothing else will do.

-DG

David G. Lawrence
President
Download Technologies, Inc. - http://www.downloadtech.com - (866) 399 8500
The FreeBSD Project - http://www.freebsd.org
Pave the road of life with opportunities.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Packet loss every 30.999 seconds

2007-12-18 Thread Bruce Evans

On Tue, 18 Dec 2007, David G Lawrence wrote:


  I didn't say it caused any bogus disk I/O. My original problem
(after a day or two of uptime) was an occasional large scheduling delay
for a process that needed to process VoIP frames in real-time. It was
happening every 31 seconds and was causing voice frames to be dropped
due to the large latency causing the frame to be outside of the jitter
window. I wrote a program that measures the scheduling delay by sleeping
for one tick and then comparing the timeofday offset from what was
expected. This revealed that every 31 seconds, the process was seeing
a 17ms delay in scheduling. Further investigation found that 1) the


I got an almost identical delay (with 64000 vnodes).

Now, 17ms isn't much.  Delays much have been much longer when CPUs
were many times slower and RAM/vnodes were not so many times smaller.
High-priority threads just need to be able to preempt the syncer so
that they don't lose data (unless really hard real time is supported,
which it isn't).  This should work starting with about FreeBSD-6
(probably need "options PREEMPT").  I doesn't work in ~5.2 due to Giant
locking, but I find Giant locking to rarely matter for UP.  Old
versions of FreeBSD were only able to preempt to non-threads (interrupt
handlers) yet they somehow survived the longer delays.  They didn't
have Giant locking to get in the way, and presumably avoided packet
loss by doing lots in interrupt handlers (hardware isr and netisr).

I just remembered that I have seen packet loss even under -current
when I leave out or turn off "options PREEMPT".


...
and it completely resolved the problem. Since the wait that I added
is at the bottom of the loop and the limit is 500 vnodes, this tells
me that every 31 seconds, there are a whole lot of vnodes that are
being "synced", when there shouldn't have been any (this fact wasn't
apparent to me at the time, but when I later realized this, I had
no time to investigate further). My tests and analysis have all been
on an otherwise quiet system (no disk I/O), so the bottom of the
ffs_sync vnode loop should not have been reached at all, let alone
tens of thousands of times every 31 seconds. All machines were uni-
processor, FreeBSD 6+. I don't know if this problem is present in 5.2.
I didn't see ffs_syncvnode in your call graph, so it probably is not.


I chopped to a float profile with only top callers.  Any significant
calls from ffs_sync() would show up as top callers.  I still have the
data, and the call graph shows much more clearly that there was just
one dirty vnode for the whole sync():

% 0.000.01   1/1   syscall [3]
% [4] 88.70.000.01   1 sync [4]
% 0.010.00   5/5   ffs_sync [5]
% 0.010.00   6/6   vfs_msync [6]
% 0.000.00   7/8   vfs_busy [260]
% 0.000.00   7/8   vfs_unbusy [263]
% 0.000.00   6/7   vn_finished_write [310]
% 0.000.00   6/6   vn_start_write [413]
% 0.000.00   1/1   vfs_stdnosync [472]
% 
% ---
% 
% 0.010.00   5/5   sync [4]

% [5] 50.70.010.00   5 ffs_sync [5]
% 0.000.00   1/1   ffs_fsync [278]
% 0.000.00   1/60  vget  [223]
% 0.000.00   1/60  ufs_vnoperatespec  [78]
% 0.000.00   1/26  vrele [76]

It passed the flags test just once to get to the vget().  ffs_syncvnode()
doesn't exist in 5.2, and ffs_fsync() is called instead.

% 
% ---
% 
% 0.010.00   6/6   sync [4]

% [6] 38.00.010.00   6 vfs_msync [6]
% 
% ---

% ...
% 
% 0.000.00   1/1   ffs_sync [5]

% [278]0.00.000.00   1 ffs_fsync [278]
% 0.000.00   1/1   ffs_update [368]
% 0.000.00   1/4   vn_isdisk [304]

This is presumbly to sync the 1 dirty vnode.

BTW I use noatime a lot, including for all file systems used in the test,
so the tree walk didn't dirty any vnodes.  A tar to /dev/zero would dirty
all vnodes if everything were mounted without this option.

% ...
%   %   cumulative   self  self total 
%  time   seconds   secondscalls  ns/call  ns/call  name 
%  50.7  0.0080.0085  1666427  1667246  ffs_sync [5]

%  38.0  0.0150.0066  1041217  1041217  vfs_msync [6]
%   3.1  0.0150.0010  100.00%   mcount [7]
%   1.5  0.0150.0000  100.00%  

Re: Packet loss every 30.999 seconds

2007-12-18 Thread David G Lawrence
> I got an almost identical delay (with 64000 vnodes).
> 
> Now, 17ms isn't much.

   Says you. On modern systems, trying to run a pseudo real-time application
on an otherwise quiescent system, 17ms is just short of an eternity. I agree
that the syncer should be preemptable (which is what my bandaid patch
attempts to do), but that probably wouldn't have helped my specific problem
since my application was a user process, not a kernel thread.
   All of my systems have options PREEMPTION - that is the default in
6+. It doesn't affect this problem.
   On the other hand, the syncer shouldn't be consuming this much CPU in
the first place. There is obviously a bug here. Of course looking through
all of the vnodes in the system for something dirty is stupid in the
first place; there should be a seperate list for that. ...but a simple
fix is what is needed right now.
   I'm going to have to bow out of this discussion now. I just don't have
the time for it.

-DG

David G. Lawrence
President
Download Technologies, Inc. - http://www.downloadtech.com - (866) 399 8500
The FreeBSD Project - http://www.freebsd.org
Pave the road of life with opportunities.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Packet loss every 30.999 seconds

2007-12-18 Thread David G Lawrence
> > I got an almost identical delay (with 64000 vnodes).
> > 
> > Now, 17ms isn't much.
> 
>Says you. On modern systems, trying to run a pseudo real-time application
> on an otherwise quiescent system, 17ms is just short of an eternity. I agree
> that the syncer should be preemptable (which is what my bandaid patch
> attempts to do), but that probably wouldn't have helped my specific problem
> since my application was a user process, not a kernel thread.

   One more followup (I swear I'm done, really!)... I have a laptop here
that runs at 150MHz when it is in the lowest running CPU power save mode.
At that speed, this bug causes a delay of more than 300ms and is enough
to cause loss of keyboard input. I have to switch into high speed mode
before I try to type anything, else I end up with random typos. Very
annoying.

-DG

David G. Lawrence
President
Download Technologies, Inc. - http://www.downloadtech.com - (866) 399 8500
The FreeBSD Project - http://www.freebsd.org
Pave the road of life with opportunities.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Packet loss every 30.999 seconds

2007-12-18 Thread Mark Fullmer

A little progress.

I have a machine with a KTR enabled kernel running.

Another machine is running David's ffs_vfsops.c's patch.

I left two other machines (GENERIC kernels) running the packet loss test
overnight.  At ~ 32480 seconds of uptime the problem starts.  This is  
really
close to a 16 bit overflow... See http://www.eng.oar.net/~maf/bsd6/ 
p1.png and
http://www.eng.oar.net/~maf/bsd6/p2.png.  The missing impulses at 31  
second
marks are the intervals between test runs.  The window of missing  
packets

(timestamps between two packets where a sequence number is missing)
is usually less than 4us, altough I'm not sure gettimeofday() can be
trusted for measuring this.  See https://www.eng.oar.net/~maf/bsd6/ 
p3.png


Things I'll try tonight:

  o check on the patched kernel

  o Try KTR debugging enabled before and after an expected high  
latency period.


  o Dump all files to /dev/null to trigger the behavior.

I would expect the vnode problem to look a little different on the  
packet

loss graphs over time.  If this leads anywher I'll add a counter
before the msleep() and see how often it's getting there.

On Dec 17, 2007, at 5:24 AM, David G Lawrence wrote:
   I noticed this as well some time ago. The problem has to do with  
the
processing (syncing) of vnodes. When the total number of allocated  
vnodes

in the system grows to tens of thousands, the ~31 second periodic sync
process takes a long time to run. Try this patch and let people  
know if
it helps your problem. It will periodically wait for one tick (1ms)  
every

500 vnodes of processing, which will allow other things to run.

Index: ufs/ffs/ffs_vfsops.c
===
RCS file: /home/ncvs/src/sys/ufs/ffs/ffs_vfsops.c,v
retrieving revision 1.290.2.16
diff -c -r1.290.2.16 ffs_vfsops.c
*** ufs/ffs/ffs_vfsops.c9 Oct 2006 19:47:17 -   1.290.2.16
--- ufs/ffs/ffs_vfsops.c25 Apr 2007 01:58:15 -
***
*** 1109,1114 
--- 1109,1115 
int softdep_deps;
int softdep_accdeps;
struct bufobj *bo;
+   int flushed_count = 0;

fs = ump->um_fs;
if (fs->fs_fmod != 0 && fs->fs_ronly != 0) {  /* XXX */
***
*** 1174,1179 
--- 1175,1184 
allerror = error;
vput(vp);
MNT_ILOCK(mp);
+   if (flushed_count++ > 500) {
+   flushed_count = 0;
+   msleep(&flushed_count, MNT_MTX(mp), PZERO, "syncw", 1);
+   }
}
MNT_IUNLOCK(mp);
/*

-DG

David G. Lawrence
President
Download Technologies, Inc. - http://www.downloadtech.com - (866)  
399 8500

The FreeBSD Project - http://www.freebsd.org
Pave the road of life with opportunities.


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Packet loss every 30.999 seconds

2007-12-19 Thread Bruce Evans

On Tue, 18 Dec 2007, David G Lawrence wrote:


I got an almost identical delay (with 64000 vnodes).

Now, 17ms isn't much.


   Says you. On modern systems, trying to run a pseudo real-time application
on an otherwise quiescent system, 17ms is just short of an eternity. I agree
that the syncer should be preemptable (which is what my bandaid patch
attempts to do), but that probably wouldn't have helped my specific problem
since my application was a user process, not a kernel thread.


FreeBSD isn't a real-time system, and 17ms isn't much for it.  I saw lots
of syscall delays of nearly 1 second while debugging this.  (With another
hat, I would say that 17 us was a long time in 1992.  17 us is hundreds of
times longer now.)


  One more followup (I swear I'm done, really!)... I have a laptop here
that runs at 150MHz when it is in the lowest running CPU power save mode.
At that speed, this bug causes a delay of more than 300ms and is enough
to cause loss of keyboard input. I have to switch into high speed mode
before I try to type anything, else I end up with random typos. Very
annoying.


Yes, something is wrong if keystrokes are lost with CPUs that run at
150 kHz (sic) or faster.

Debugging shows that the problem is like I said.  The loop really does
take 125 ns per iteration.  This time is actually not very much.  The
the linked list of vnodes could hardly be designed better to maximize
cache thrashing.  My system has a fairly small L2 cache (512K or 1M),
and even a few words from the vnode and the inode don't fit in the L2
cache when there are 64000 vnodes, but the vp and ip are also fairly
well desgined to maximize cache thrashing, so L2 cache thrashing starts
at just a few thousand vnodes.

My system has fairly low latency main memory, else the problem would
be larger:

% Memory latencies in nanoseconds - smaller is better
% (WARNING - may not be correct, check graphs)
% ---
% Host OS   Mhz  L1 $   L2 $Main memGuesses
% - -   - -----
% besplex.b FreeBSD 7.0-C  2205 1.361 5.6090   42.4 [PC3200 CL2.5 overclocked]
% sledge.fr FreeBSD 8.0-C  1802 1.666 8.9420   99.8
% freefall. FreeBSD 7.0-C  2778 0.746 6.6310  155.5

The loop makes the following memory accesses, at least in 5.2:

% loop:
%   for (vp = TAILQ_FIRST(&mp->mnt_nvnodelist); vp != NULL; vp = nvp) {
%   /*
%* If the vnode that we are about to sync is no longer
%* associated with this mount point, start over.
%*/
%   if (vp->v_mount != mp)
%   goto loop;
% 
% 		/*

%* Depend on the mntvnode_slock to keep things stable enough
%* for a quick test.  Since there might be hundreds of
%* thousands of vnodes, we cannot afford even a subroutine
%* call unless there's a good chance that we have work to do.
%*/
%   nvp = TAILQ_NEXT(vp, v_nmntvnodes);

Access 1 word at vp offset 0x90.  Costs 1 cache line.  IIRC, my system has
a cache line size of 0x40.  Assume this, and that vp is aligned on a
cache line boundary.  So this access costs the cache line at vp offsets
0x80-0xbf.

%   VI_LOCK(vp);

Access 1 word at vp offset 0x1c.  Costs the cache line at vp offsets 0-0x3f.

%   if (vp->v_iflag & VI_XLOCK) {

Access 1 word at vp offset 0x24.  Cache hit.

%   VI_UNLOCK(vp);
%   continue;
%   }
%   ip = VTOI(vp);

Access 1 word at vp offset 0xa8.  Cache hit.

%   if (vp->v_type == VNON || ((ip->i_flag &

Access 1 word at vp offset 0xa0.  Cache hit.

Access 1 word at ip offset 0x18.  Assume that ip is aligned, as above.  Costs
the cache line at ip offsets 0-0x3f.

%   (IN_ACCESS | IN_CHANGE | IN_MODIFIED | IN_UPDATE)) == 0 &&
%   TAILQ_EMPTY(&vp->v_dirtyblkhd))) {

Access 1 word at vp offset 0x48.  Costs the cache line at vp offsets 0x40-
0x7f.

%   VI_UNLOCK(vp);

Reaccess 1 word at vp offset 0x1c.  Cache hit.

%   continue;
%   }

The total cost is 4 cache lines or 256 bytes per vnode.  So with an L2
cache size of 1MB, the L2 cache will start thrashing at numvnodes =
4096.  With thrashing, an at my main memory latency of 42.4 nsec, it
might take 4*42.4 = 169.6 nsec to read main memory.  This is similar
to my observed time.  Presumably things aren't quite that bad because
there is some locality for the 3 lines in each vp.  It might be possible
to improve this a bit by accessing the lines sequentially and not
interleaving the access to ip.  Better, repack vp and move the IN*
flags from ip to vp (a change that has other advantages), so that
everything is in 1 cache line per vp.

This isn't consistent with the delay increasing to 300 ms when the CPU
is throttled -- memory should

Re: Packet loss every 30.999 seconds

2007-12-19 Thread Bruce Evans

On Tue, 18 Dec 2007, Mark Fullmer wrote:


A little progress.

I have a machine with a KTR enabled kernel running.

Another machine is running David's ffs_vfsops.c's patch.

I left two other machines (GENERIC kernels) running the packet loss test
overnight.  At ~ 32480 seconds of uptime the problem starts.  This is really


Try it with "find / -type f >/dev/null" to duplicate the problem almost
instantly.


marks are the intervals between test runs.  The window of missing packets
(timestamps between two packets where a sequence number is missing)
is usually less than 4us, altough I'm not sure gettimeofday() can be
trusted for measuring this.  See https://www.eng.oar.net/~maf/bsd6/p3.png


gettimeofday() can normally be trusted to better than 1 us for time
differences of up to about 1 second.  However, gettimeofday() should
not be used in any program written after clock_gettime() became standard
in 1994.  clock_gettime() has a resolution of 1 ns.  It isn't quite
that accurate on current machines, but I trust it to measure differences
of 10 nsec between back to back clock_gettime() calls here.  Sample
output from wollman@'s old clock-watching program converted to
clock_gettime():

%%%
2007/12/05 (TSC) bde-current, -O2 -mcpu=athlon-xp
min 238, max 99730, mean 240.025380, std 77.291436
1th: 239 (1203207 observations)
2th: 240 (556307 observations)
3th: 241 (190211 observations)
4th: 238 (50091 observations)
5th: 242 (20 observations)

2007/11/23 (TSC) bde-current
min 247, max 11890, mean 247.857786, std 62.559317
1th: 247 (1274231 observations)
2th: 248 (668611 observations)
3th: 249 (56950 observations)
4th: 250 (23 observations)
5th: 263 (8 observations)

2007/05/19 (TSC) plain -current-noacpi
min 262, max 286965, mean 263.941187, std 41.801400
1th: 264 (1343245 observations)
2th: 263 (626226 observations)
3th: 265 (26860 observations)
4th: 262 (3572 observations)
5th: 268 (8 observations)

2007/05/19 (TSC) plain -current-acpi
min 261, max 68926, mean 279.848650, std 40.477440
1th: 261 (999391 observations)
2th: 320 (473325 observations)
3th: 262 (373831 observations)
4th: 321 (148126 observations)
5th: 312 (4759 observations)

2007/05/19 (ACPI-fast timecounter) plain -current-acpi
min 558, max 285494, mean 827.597038, std 78.322301
1th: 838 (1685662 observations)
2th: 839 (136980 observations)
3th: 559 (72160 observations)
4th: 837 (48902 observations)
5th: 558 (31217 observations)

2007/05/19 (i8254) plain -current-acpi
min 3352, max 288288, mean 4182.774148, std 257.977752
1th: 4190 (1423885 observations)
2th: 4191 (440158 observations)
3th: 3352 (65261 observations)
4th: 5028 (39202 observations)
5th: 5029 (15456 observations)
%%%

"min" here gives the minimum latency of a clock_gettime() syscall.
The improvement from 247 nsec to 240 nsec in the "mean" due to -O2
-march-athlon-xp can be trusted to be measured very accurately since
it is an average over more than 100 million trials, and the improvement
from 247 nsec to 238 nsec for "min" can be trusted because it is
consistent with the improvement in the mean.

The program had to be converted to use clock_gettime() a few years
ago when CPU speeds increased so much that the correct "min" became
significantly less than 1.  With gettimeofday(), it cannot distinguish
between an overhead of 1 ns and an overhead of 1 us.

For the ACPI and i8254 timecounter, you can see that the low-level
timecounters have a low frequency clock from the large gaps between
the observations.  There is a gap of 279-280 ns for the acpi timecounter.
This is the period of the acpi timecounter's clock (frequency
14318182/4 = period 279.3651 ns.  Since we can observe this period to
within 1 ns, we must have a basic accuracy of nearly 1 ns, but if we
make only 2 observations we are likely to have an inaccuracy of 279
ns due to the granularity of the clock.  The TSC has a clock granuarity
of 6 ns on my CPU, and delivers almost that much accuracy with only
2 observations, but technical problems prevent general use of the TSC.

Bruce
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Packet loss every 30.999 seconds

2007-12-19 Thread David G Lawrence
> On Tue, 18 Dec 2007, David G Lawrence wrote:
> 
> >>>I got an almost identical delay (with 64000 vnodes).
> >>>
> >>>Now, 17ms isn't much.
> >>
> >>   Says you. On modern systems, trying to run a pseudo real-time 
> >>   application
> >>on an otherwise quiescent system, 17ms is just short of an eternity. I 
> >>agree
> >>that the syncer should be preemptable (which is what my bandaid patch
> >>attempts to do), but that probably wouldn't have helped my specific 
> >>problem
> >>since my application was a user process, not a kernel thread.
> 
> FreeBSD isn't a real-time system, and 17ms isn't much for it.  I saw lots

   I never said it was, but that doesn't stop us from using FreeBSD in
pseudo real-time applications. This is made possible by fast CPUs and
dedicated-task systems where the load is carefully controlled.

> of syscall delays of nearly 1 second while debugging this.  (With another

   I can make the delay several minutes by pushing the reset button.
 
> Debugging shows that the problem is like I said.  The loop really does
> take 125 ns per iteration.  This time is actually not very much.  The

   Considering that the CPU clock cycle time is on the order of 300ps, I
would say 125ns to do a few checks is pathetic.

   In any case, it appears that my patch is a no-op, at least for the
problem I was trying to solve. This has me confused, however, because at
one point the problem was mitigated with it. The patch has gone through
several iterations, however, and it could be that it was made to the top
of the loop, before any of the checks, in a previous version. Hmmm.

-DG

David G. Lawrence
President
Download Technologies, Inc. - http://www.downloadtech.com - (866) 399 8500
The FreeBSD Project - http://www.freebsd.org
Pave the road of life with opportunities.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Packet loss every 30.999 seconds

2007-12-19 Thread David G Lawrence
> Try it with "find / -type f >/dev/null" to duplicate the problem almost
> instantly.

   FreeBSD used to have some code that would cause vnodes with no cached
pages to be recycled quickly (which would have made a simple find
ineffective without reading the files at least a little bit). I guess
that got removed when the size of the vnode pool was dramatically
increased.

-DG

David G. Lawrence
President
Download Technologies, Inc. - http://www.downloadtech.com - (866) 399 8500
The FreeBSD Project - http://www.freebsd.org
Pave the road of life with opportunities.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Packet loss every 30.999 seconds

2007-12-19 Thread David G Lawrence
>In any case, it appears that my patch is a no-op, at least for the
> problem I was trying to solve. This has me confused, however, because at
> one point the problem was mitigated with it. The patch has gone through
> several iterations, however, and it could be that it was made to the top
> of the loop, before any of the checks, in a previous version. Hmmm.

(replying to myself)

   I just found an earlier version of the patch, and sure enough, it was
to the top of the loop. Unfortunately, that version caused the system to
crash because vp was occasionally invalid after the wakeup.

   Anyway, let's see if Mark's packet loss problem is indeed related to
this code. If he does the find just after boot and immediately sees the
problem, then I would say that is fairly conclusive. He could also release
the cached vnodes by temporarily setting kern.maxvnodes=1 and then
setting it back to whatever it was previously (probably 6-10).
If the problem then goes away for awhile, that would be another good
indicator.

-DG

David G. Lawrence
President
Download Technologies, Inc. - http://www.downloadtech.com - (866) 399 8500
The FreeBSD Project - http://www.freebsd.org
Pave the road of life with opportunities.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Packet loss every 30.999 seconds

2007-12-19 Thread Stephan Uphoff

David G Lawrence wrote:

Try it with "find / -type f >/dev/null" to duplicate the problem almost
instantly.



   FreeBSD used to have some code that would cause vnodes with no cached
pages to be recycled quickly (which would have made a simple find
ineffective without reading the files at least a little bit). I guess
that got removed when the size of the vnode pool was dramatically
increased.
  
You can decrease vfs.wantfreevnodes if caching files without cached data 
is not beneficial

for your application.


-DG

David G. Lawrence
President
Download Technologies, Inc. - http://www.downloadtech.com - (866) 399 8500
The FreeBSD Project - http://www.freebsd.org
Pave the road of life with opportunities.
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


  


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Packet loss every 30.999 seconds

2007-12-19 Thread Bruce Evans

On Wed, 19 Dec 2007, David G Lawrence wrote:


Debugging shows that the problem is like I said.  The loop really does
take 125 ns per iteration.  This time is actually not very much.  The


  Considering that the CPU clock cycle time is on the order of 300ps, I
would say 125ns to do a few checks is pathetic.


As I said, 125 nsec is a short time in this context.  It is approximately
the time for a single L2 cache miss on a machine with slow memory like
freefall (Xeon 2.8 GHz with L2 cache latency of 155.5 ns).  As I said,
the code is organized so as to give about 4 L2 cache misses per vnode
if there are more than a few thousand vnodes, so it is doing very well
to take only 125 nsec for a few checks.


  In any case, it appears that my patch is a no-op, at least for the
problem I was trying to solve. This has me confused, however, because at
one point the problem was mitigated with it. The patch has gone through
several iterations, however, and it could be that it was made to the top
of the loop, before any of the checks, in a previous version. Hmmm.


The patch should work fine.  IIRC, it yields voluntarily so that other
things can run.  I committed a similar hack for uiomove().  It was
easy to make syscalls that take many seconds (now tenths of seconds
insted of seconds?), and without yielding or PREEMPTION or multiple
CPUs, everything except interrupts has to wait for these syscalls.  Now
the main problem is to figure out why PREEMPTION doesn't work.  I'm
not working on this directly since I'm running ~5.2 where nearly-full
kernel preemption doesn't work due to Giant locking.

Bruce
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Packet loss every 30.999 seconds

2007-12-19 Thread Mark Fullmer


On Dec 19, 2007, at 9:54 AM, Bruce Evans wrote:


On Tue, 18 Dec 2007, Mark Fullmer wrote:


A little progress.

I have a machine with a KTR enabled kernel running.

Another machine is running David's ffs_vfsops.c's patch.

I left two other machines (GENERIC kernels) running the packet  
loss test
overnight.  At ~ 32480 seconds of uptime the problem starts.  This  
is really


Try it with "find / -type f >/dev/null" to duplicate the problem  
almost

instantly.


I was able to verify last night that (cd /; tar -cpf -) > all.tar would
trigger the problem.  I'm working getting a test running with
David's ffs_sync() workaround now, adding a few counters there should
get this narrowed down a little more.

Thanks for the other info on timer resolution, I overlooked
clock_gettime().

--
mark
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Packet loss every 30.999 seconds

2007-12-19 Thread David G Lawrence
> >  In any case, it appears that my patch is a no-op, at least for the
> >problem I was trying to solve. This has me confused, however, because at
> >one point the problem was mitigated with it. The patch has gone through
> >several iterations, however, and it could be that it was made to the top
> >of the loop, before any of the checks, in a previous version. Hmmm.
> 
> The patch should work fine.  IIRC, it yields voluntarily so that other
> things can run.  I committed a similar hack for uiomove().  It was

   It patches the bottom of the loop, which is only reached if the vnode
is dirty. So it will only help if there are thousands of dirty vnodes.
While that condition can certainly happen, it isn't the case that I'm
particularly interested in.

> CPUs, everything except interrupts has to wait for these syscalls.  Now
> the main problem is to figure out why PREEMPTION doesn't work.  I'm
> not working on this directly since I'm running ~5.2 where nearly-full
> kernel preemption doesn't work due to Giant locking.

   I don't understand how PREEMPTION is supposed to work (I mean
to any significant detail), so I can't really comment on that.

-DG

David G. Lawrence
President
Download Technologies, Inc. - http://www.downloadtech.com - (866) 399 8500
The FreeBSD Project - http://www.freebsd.org
Pave the road of life with opportunities.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Packet loss every 30.999 seconds

2007-12-19 Thread David G Lawrence
> >Try it with "find / -type f >/dev/null" to duplicate the problem  
> >almost
> >instantly.
> 
> I was able to verify last night that (cd /; tar -cpf -) > all.tar would
> trigger the problem.  I'm working getting a test running with
> David's ffs_sync() workaround now, adding a few counters there should
> get this narrowed down a little more.

   Unfortunately, the version of the patch that I sent out isn't going to
help your problem. It needs to yield at the top of the loop, but vp isn't
necessarily valid after the wakeup from the msleep. That's a problem that
I'm having trouble figuring out a solution to - the solutions that come
to mind will all significantly increase the overhead of the loop.
   As a very inadequate work-around, you might consider lowering
kern.maxvnodes to something like 2 - that might be low enough to
not trigger the problem, but also be high enough to not significantly
affect system I/O performance.

-DG

David G. Lawrence
President
Download Technologies, Inc. - http://www.downloadtech.com - (866) 399 8500
The FreeBSD Project - http://www.freebsd.org
Pave the road of life with opportunities.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Packet loss every 30.999 seconds

2007-12-19 Thread Bruce Evans

On Wed, 19 Dec 2007, David G Lawrence wrote:


Try it with "find / -type f >/dev/null" to duplicate the problem almost
instantly.


  FreeBSD used to have some code that would cause vnodes with no cached
pages to be recycled quickly (which would have made a simple find
ineffective without reading the files at least a little bit). I guess
that got removed when the size of the vnode pool was dramatically
increased.


It might still.  The data should be cached somewhere, but caching it
in both the buffer cache/VMIO and the vnode/inode is wasteful.

I may have been only caching vnodes for directories.  I switched to
using a find or a tar on /home/ncvs/ports since that has a very high
density of directories.

Bruce
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Packet loss every 30.999 seconds

2007-12-19 Thread Bruce Evans

On Thu, 20 Dec 2007, Bruce Evans wrote:


On Wed, 19 Dec 2007, David G Lawrence wrote:

  Considering that the CPU clock cycle time is on the order of 300ps, I
would say 125ns to do a few checks is pathetic.


As I said, 125 nsec is a short time in this context.  It is approximately
the time for a single L2 cache miss on a machine with slow memory like
freefall (Xeon 2.8 GHz with L2 cache latency of 155.5 ns).  As I said,


Perfmon counts for the cache misses during sync(1);

==> /tmp/kg1/z0 <==
vfs.numvnodes: 630
# s/kx-dc-accesses 
484516
# s/kx-dc-misses 
20852

misses = 4%

==> /tmp/kg1/z1 <==
vfs.numvnodes: 9246
# s/kx-dc-accesses 
884361
# s/kx-dc-misses 
89833

misses = 10%

==> /tmp/kg1/z2 <==
vfs.numvnodes: 20312
# s/kx-dc-accesses 
1389959
# s/kx-dc-misses 
178207

misses = 13%

==> /tmp/kg1/z3 <==
vfs.numvnodes: 80802
# s/kx-dc-accesses 
4122411
# s/kx-dc-misses 
658740

misses = 16%

==> /tmp/kg1/z4 <==
vfs.numvnodes: 138557
# s/kx-dc-accesses 
7150726
# s/kx-dc-misses 
1129997

misses = 16%

===

I forgot to only count active vnodes in the above.  vfs.freevnodes was
small (< 5%).

I set kern.maxvnodes to 20, but vfs.numvnodes saturated at 138557
(probably all that fits in kvm or main memory on i386 with 1GB RAM).

With 138557 vnodes, a null sync(2) takes 39673 us according to kdump -R.
That is 35.1 ns per miss.  This is consistent with lmbench2's estimate
of 42.5 ns for main memory latency.

Watching vfs.*vnodes confirmed that vnode caching still works like you
said:
o "find /home/ncvs/ports -type f" only gives a vnode for each directory
o a repeated "find /home/ncvs/ports -type f" is fast because everything
  remains cached by VMIO.  FreeBSD performed very badly at this benchmark
  before VMIO existed and was used for directories
o "tar cf /dev/zero /home/ncvs/ports" gives a vnode for files too.

Bruce
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Packet loss every 30.999 seconds

2007-12-19 Thread Kostik Belousov
On Wed, Dec 19, 2007 at 09:13:31AM -0800, David G Lawrence wrote:
> > >Try it with "find / -type f >/dev/null" to duplicate the problem  
> > >almost
> > >instantly.
> > 
> > I was able to verify last night that (cd /; tar -cpf -) > all.tar would
> > trigger the problem.  I'm working getting a test running with
> > David's ffs_sync() workaround now, adding a few counters there should
> > get this narrowed down a little more.
> 
>Unfortunately, the version of the patch that I sent out isn't going to
> help your problem. It needs to yield at the top of the loop, but vp isn't
> necessarily valid after the wakeup from the msleep. That's a problem that
> I'm having trouble figuring out a solution to - the solutions that come
> to mind will all significantly increase the overhead of the loop.
>As a very inadequate work-around, you might consider lowering
> kern.maxvnodes to something like 2 - that might be low enough to
> not trigger the problem, but also be high enough to not significantly
> affect system I/O performance.

I think the following may be safe. It counts only the clean scanned vnodes
and does not evaluate the vp, that indeed may be reclaimed, after the sleep.

I never booted with the change.

diff --git a/sys/ufs/ffs/ffs_vfsops.c b/sys/ufs/ffs/ffs_vfsops.c
index cbccc62..e686b97 100644
--- a/sys/ufs/ffs/ffs_vfsops.c
+++ b/sys/ufs/ffs/ffs_vfsops.c
@@ -1176,6 +1176,7 @@ ffs_sync(mp, waitfor, td)
struct ufsmount *ump = VFSTOUFS(mp);
struct fs *fs;
int error, count, wait, lockreq, allerror = 0;
+   int yield_count;
int suspend;
int suspended;
int secondary_writes;
@@ -1216,6 +1217,7 @@ loop:
softdep_get_depcounts(mp, &softdep_deps, &softdep_accdeps);
MNT_ILOCK(mp);
 
+   yield_count = 0;
MNT_VNODE_FOREACH(vp, mp, mvp) {
/*
 * Depend on the mntvnode_slock to keep things stable enough
@@ -1233,6 +1235,11 @@ loop:
(IN_ACCESS | IN_CHANGE | IN_MODIFIED | IN_UPDATE)) == 0 &&
vp->v_bufobj.bo_dirty.bv_cnt == 0)) {
VI_UNLOCK(vp);
+   if (yield_count++ == 500) {
+   yield_count = 0;
+   msleep(&yield_count, MNT_MTX(mp), PZERO,
+   "ffspause", 1);
+   }
continue;
}
MNT_IUNLOCK(mp);


pgp9eJFCMuFsx.pgp
Description: PGP signature


Re: Packet loss every 30.999 seconds

2007-12-19 Thread Kostik Belousov
On Wed, Dec 19, 2007 at 08:11:59PM +0200, Kostik Belousov wrote:
> On Wed, Dec 19, 2007 at 09:13:31AM -0800, David G Lawrence wrote:
> > > >Try it with "find / -type f >/dev/null" to duplicate the problem  
> > > >almost
> > > >instantly.
> > > 
> > > I was able to verify last night that (cd /; tar -cpf -) > all.tar would
> > > trigger the problem.  I'm working getting a test running with
> > > David's ffs_sync() workaround now, adding a few counters there should
> > > get this narrowed down a little more.
> > 
> >Unfortunately, the version of the patch that I sent out isn't going to
> > help your problem. It needs to yield at the top of the loop, but vp isn't
> > necessarily valid after the wakeup from the msleep. That's a problem that
> > I'm having trouble figuring out a solution to - the solutions that come
> > to mind will all significantly increase the overhead of the loop.
> >As a very inadequate work-around, you might consider lowering
> > kern.maxvnodes to something like 2 - that might be low enough to
> > not trigger the problem, but also be high enough to not significantly
> > affect system I/O performance.
> 
> I think the following may be safe. It counts only the clean scanned vnodes
> and does not evaluate the vp, that indeed may be reclaimed, after the sleep.
> 
> I never booted with the change.
> 
> diff --git a/sys/ufs/ffs/ffs_vfsops.c b/sys/ufs/ffs/ffs_vfsops.c
> index cbccc62..e686b97 100644

Or, better to use uio_yield(). See below.

diff --git a/sys/ufs/ffs/ffs_vfsops.c b/sys/ufs/ffs/ffs_vfsops.c
index cbccc62..5d2535f 100644
--- a/sys/ufs/ffs/ffs_vfsops.c
+++ b/sys/ufs/ffs/ffs_vfsops.c
@@ -1176,6 +1176,7 @@ ffs_sync(mp, waitfor, td)
struct ufsmount *ump = VFSTOUFS(mp);
struct fs *fs;
int error, count, wait, lockreq, allerror = 0;
+   int yield_count;
int suspend;
int suspended;
int secondary_writes;
@@ -1216,6 +1217,7 @@ loop:
softdep_get_depcounts(mp, &softdep_deps, &softdep_accdeps);
MNT_ILOCK(mp);
 
+   yield_count = 0;
MNT_VNODE_FOREACH(vp, mp, mvp) {
/*
 * Depend on the mntvnode_slock to keep things stable enough
@@ -1233,6 +1235,12 @@ loop:
(IN_ACCESS | IN_CHANGE | IN_MODIFIED | IN_UPDATE)) == 0 &&
vp->v_bufobj.bo_dirty.bv_cnt == 0)) {
VI_UNLOCK(vp);
+   if (yield_count++ == 500) {
+   MNT_IUNLOCK(mp);
+   yield_count = 0;
+   uio_yield();
+   goto relock_mp;
+   }
continue;
}
MNT_IUNLOCK(mp);
@@ -1247,6 +1255,7 @@ loop:
if ((error = ffs_syncvnode(vp, waitfor)) != 0)
allerror = error;
vput(vp);
+   relock_mp:
MNT_ILOCK(mp);
}
MNT_IUNLOCK(mp);


pgpmpNKoG2bTI.pgp
Description: PGP signature


Re: Packet loss every 30.999 seconds

2007-12-19 Thread Mark Fullmer

Just to confirm the patch did not change the behavior.  I ran with it
last night and double checked this morning to make sure.

It looks like if you put the check at the top of the loop and the  
next node

is changed during msleep() SLIST_NEXT will walk into the trash.  I'm
in over my head here

Setting kern.maxvnodes=1000 does stop both the periodic packet loss and
the high latency syscall's, so it does look like walking this chain
without yielding the processor is part of the problem I'm seeing.

The other behavior I don't understand is why the em driver is able
to increment if_ipackets but still lose the packet.

Dumping the internal stats with dev.em.1.stats=1:

Dec 19 13:07:46 dytnq-nf1 kernel: em1: Excessive collisions = 0
Dec 19 13:07:46 dytnq-nf1 kernel: em1: Sequence errors = 0
Dec 19 13:07:46 dytnq-nf1 kernel: em1: Defer count = 0
Dec 19 13:07:46 dytnq-nf1 kernel: em1: Missed Packets = 0
Dec 19 13:07:46 dytnq-nf1 kernel: em1: Receive No Buffers = 0
Dec 19 13:07:46 dytnq-nf1 kernel: em1: Receive Length Errors = 0
Dec 19 13:07:46 dytnq-nf1 kernel: em1: Receive errors = 0
Dec 19 13:07:46 dytnq-nf1 kernel: em1: Crc errors = 0
Dec 19 13:07:46 dytnq-nf1 kernel: em1: Alignment errors = 0
Dec 19 13:07:46 dytnq-nf1 kernel: em1: Collision/Carrier extension  
errors = 0

Dec 19 13:07:46 dytnq-nf1 kernel: em1: RX overruns = 0
Dec 19 13:07:46 dytnq-nf1 kernel: em1: watchdog timeouts = 0
Dec 19 13:07:46 dytnq-nf1 kernel: em1: XON Rcvd = 0
Dec 19 13:07:46 dytnq-nf1 kernel: em1: XON Xmtd = 0
Dec 19 13:07:46 dytnq-nf1 kernel: em1: XOFF Rcvd = 0
Dec 19 13:07:46 dytnq-nf1 kernel: em1: XOFF Xmtd = 0
Dec 19 13:07:46 dytnq-nf1 kernel: em1: Good Packets Rcvd = 0
Dec 19 13:07:46 dytnq-nf1 kernel: em1: Good Packets Xmtd = 0
Dec 19 13:07:46 dytnq-nf1 kernel: em1: TSO Contexts Xmtd = 0
Dec 19 13:07:46 dytnq-nf1 kernel: em1: TSO Contexts Failed = 0

With FreeBSD 4 I was able to run a UDP data collector with rtprio set,
kern.ipc.maxsockbuf=2048, then use setsockopt() with SO_RCVBUF
in the application.  If packets were dropped they would show up
with netstat -s as "dropped due to full socket buffers".

Since the packet never makes it to ip_input() I no longer have
any way to count drops.  There will always be corner cases where
interrupts are lost and drops not accounted for if the adapter
hardware can't report them, but right now I've got no way to
estimate any loss.

--
mark

On Dec 19, 2007, at 12:13 PM, David G Lawrence wrote:


Try it with "find / -type f >/dev/null" to duplicate the problem
almost
instantly.


I was able to verify last night that (cd /; tar -cpf -) > all.tar  
would

trigger the problem.  I'm working getting a test running with
David's ffs_sync() workaround now, adding a few counters there should
get this narrowed down a little more.


   Unfortunately, the version of the patch that I sent out isn't  
going to
help your problem. It needs to yield at the top of the loop, but vp  
isn't
necessarily valid after the wakeup from the msleep. That's a  
problem that
I'm having trouble figuring out a solution to - the solutions that  
come

to mind will all significantly increase the overhead of the loop.
   As a very inadequate work-around, you might consider lowering
kern.maxvnodes to something like 2 - that might be low enough to
not trigger the problem, but also be high enough to not significantly
affect system I/O performance.

-DG

David G. Lawrence
President
Download Technologies, Inc. - http://www.downloadtech.com - (866)  
399 8500

The FreeBSD Project - http://www.freebsd.org
Pave the road of life with opportunities.


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Packet loss every 30.999 seconds

2007-12-19 Thread Bruce Evans

On Wed, 19 Dec 2007, David G Lawrence wrote:


The patch should work fine.  IIRC, it yields voluntarily so that other
things can run.  I committed a similar hack for uiomove().  It was


  It patches the bottom of the loop, which is only reached if the vnode
is dirty. So it will only help if there are thousands of dirty vnodes.
While that condition can certainly happen, it isn't the case that I'm
particularly interested in.


Oops.

When it reaches the bottom of the loop, it will probably block on i/o
sometimes, so that the problem is smaller anyway.


CPUs, everything except interrupts has to wait for these syscalls.  Now
the main problem is to figure out why PREEMPTION doesn't work.  I'm
not working on this directly since I'm running ~5.2 where nearly-full
kernel preemption doesn't work due to Giant locking.


  I don't understand how PREEMPTION is supposed to work (I mean
to any significant detail), so I can't really comment on that.


Me neither, but I will comment anyway :-).  I think PREEMPTION should
even preempt kernel threads in favor of (higher priority of course)
user threads that are in the kernel, but doesn't do this now.  Even
interrupt threads should have dynamic priorities so that when they
become too hoggish they can be preempted even by user threads subject
to the this priority rule.  This is further from happening.

ffs_sync() can hold the mountpoint lock for a long time.  That gives
problems preempting it.  To move your fix to the top of the loop, I
think you just need to drop the mountpoint lock every few hundred
iterations while yielding.  This would help for PREEMPTION too.  Dropping
the lock must be safe because it is already done while flushing.

Hmm, the loop is nicely obfuscated and pessimized in current (see
rev.1.234).  The fast (modulo no cache misses) path used to be just a
TAILQ_NEXT() to reach the next vnode, but now unnecessarily joins the
slow path at MNT_VNODE_FOREACH(), and MNT_VNODE_FOREACH() hides a
function call.

Bruce
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Packet loss every 30.999 seconds

2007-12-19 Thread Julian Elischer

David G Lawrence wrote:

 In any case, it appears that my patch is a no-op, at least for the
problem I was trying to solve. This has me confused, however, because at
one point the problem was mitigated with it. The patch has gone through
several iterations, however, and it could be that it was made to the top
of the loop, before any of the checks, in a previous version. Hmmm.

The patch should work fine.  IIRC, it yields voluntarily so that other
things can run.  I committed a similar hack for uiomove().  It was


   It patches the bottom of the loop, which is only reached if the vnode
is dirty. So it will only help if there are thousands of dirty vnodes.
While that condition can certainly happen, it isn't the case that I'm
particularly interested in.


CPUs, everything except interrupts has to wait for these syscalls.  Now
the main problem is to figure out why PREEMPTION doesn't work.  I'm
not working on this directly since I'm running ~5.2 where nearly-full
kernel preemption doesn't work due to Giant locking.


   I don't understand how PREEMPTION is supposed to work (I mean
to any significant detail), so I can't really comment on that.


It's really very simple.

When you do a "wakeup" 
(or anything else that puts a thread on a run queue)

i.e.  use setrunqueue()
then if that thread has more priority than you do, (and in the general case
is an interrupt thread), you immedialty call mi_switch so that it runs 
imediatly.
You get guaranteed to run again when it finishes. 
(you are not just put back on the run queue at the end).


the critical_enter()/critical_exit() calls disable this from happening to 
you if you really must not be interrupted by another thread.


there is an option where it is not jsut interrupt threads that can jump in,
but I think it's usually disabled.




-DG

David G. Lawrence
President
Download Technologies, Inc. - http://www.downloadtech.com - (866) 399 8500
The FreeBSD Project - http://www.freebsd.org
Pave the road of life with opportunities.
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Packet loss every 30.999 seconds

2007-12-19 Thread Kostik Belousov
On Wed, Dec 19, 2007 at 11:44:00AM -0800, Julian Elischer wrote:
> David G Lawrence wrote:
> >>> In any case, it appears that my patch is a no-op, at least for the
> >>>problem I was trying to solve. This has me confused, however, because at
> >>>one point the problem was mitigated with it. The patch has gone through
> >>>several iterations, however, and it could be that it was made to the top
> >>>of the loop, before any of the checks, in a previous version. Hmmm.
> >>The patch should work fine.  IIRC, it yields voluntarily so that other
> >>things can run.  I committed a similar hack for uiomove().  It was
> >
> >   It patches the bottom of the loop, which is only reached if the vnode
> >is dirty. So it will only help if there are thousands of dirty vnodes.
> >While that condition can certainly happen, it isn't the case that I'm
> >particularly interested in.
> >
> >>CPUs, everything except interrupts has to wait for these syscalls.  Now
> >>the main problem is to figure out why PREEMPTION doesn't work.  I'm
> >>not working on this directly since I'm running ~5.2 where nearly-full
> >>kernel preemption doesn't work due to Giant locking.
> >
> >   I don't understand how PREEMPTION is supposed to work (I mean
> >to any significant detail), so I can't really comment on that.
> 
> It's really very simple.
> 
> When you do a "wakeup" 
> (or anything else that puts a thread on a run queue)
> i.e.  use setrunqueue()
> then if that thread has more priority than you do, (and in the general case
> is an interrupt thread), you immedialty call mi_switch so that it runs 
> imediatly.
> You get guaranteed to run again when it finishes. 
> (you are not just put back on the run queue at the end).
As far as I see it, only the interrupt threads can put the kernel thread off
the CPU. More, the thread being forced out shall be an "idle user thread".

See kern_switch.c, maybe_preempt(), the #ifndef FULL_PREEMPTION block.
> 
> the critical_enter()/critical_exit() calls disable this from happening to 
> you if you really must not be interrupted by another thread.
> 
> there is an option where it is not jsut interrupt threads that can jump in,
> but I think it's usually disabled.
Do you mean FULL_PREEMPTION ?


pgp9yoTTNhHmz.pgp
Description: PGP signature


Re: Packet loss every 30.999 seconds

2007-12-20 Thread Peter Jeremy
On Wed, Dec 19, 2007 at 12:06:59PM -0500, Mark Fullmer wrote:
>Thanks for the other info on timer resolution, I overlooked
>clock_gettime().

If you have a UP system with a usable TSC (or equivalent) then
using rdtsc() (or equivalent) is a much cheaper way to measure
short durations with high resolution.

-- 
Peter Jeremy
Please excuse any delays as the result of my ISP's inability to implement
an MTA that is either RFC2821-compliant or matches their claimed behaviour.


pgpCXh5EuAxtY.pgp
Description: PGP signature


Re: Packet loss every 30.999 seconds

2007-12-20 Thread Mark Fullmer

Thanks, I'll test this later on today.

On Dec 19, 2007, at 1:11 PM, Kostik Belousov wrote:


On Wed, Dec 19, 2007 at 09:13:31AM -0800, David G Lawrence wrote:

Try it with "find / -type f >/dev/null" to duplicate the problem
almost
instantly.


I was able to verify last night that (cd /; tar -cpf -) > all.tar  
would

trigger the problem.  I'm working getting a test running with
David's ffs_sync() workaround now, adding a few counters there  
should

get this narrowed down a little more.


   Unfortunately, the version of the patch that I sent out isn't  
going to
help your problem. It needs to yield at the top of the loop, but  
vp isn't
necessarily valid after the wakeup from the msleep. That's a  
problem that
I'm having trouble figuring out a solution to - the solutions that  
come

to mind will all significantly increase the overhead of the loop.
   As a very inadequate work-around, you might consider lowering
kern.maxvnodes to something like 2 - that might be low enough to
not trigger the problem, but also be high enough to not significantly
affect system I/O performance.


I think the following may be safe. It counts only the clean scanned  
vnodes
and does not evaluate the vp, that indeed may be reclaimed, after  
the sleep.


I never booted with the change.

diff --git a/sys/ufs/ffs/ffs_vfsops.c b/sys/ufs/ffs/ffs_vfsops.c
index cbccc62..e686b97 100644
--- a/sys/ufs/ffs/ffs_vfsops.c
+++ b/sys/ufs/ffs/ffs_vfsops.c
@@ -1176,6 +1176,7 @@ ffs_sync(mp, waitfor, td)
struct ufsmount *ump = VFSTOUFS(mp);
struct fs *fs;
int error, count, wait, lockreq, allerror = 0;
+   int yield_count;
int suspend;
int suspended;
int secondary_writes;
@@ -1216,6 +1217,7 @@ loop:
softdep_get_depcounts(mp, &softdep_deps, &softdep_accdeps);
MNT_ILOCK(mp);

+   yield_count = 0;
MNT_VNODE_FOREACH(vp, mp, mvp) {
/*
 * Depend on the mntvnode_slock to keep things stable enough
@@ -1233,6 +1235,11 @@ loop:
(IN_ACCESS | IN_CHANGE | IN_MODIFIED | IN_UPDATE)) == 0 &&
vp->v_bufobj.bo_dirty.bv_cnt == 0)) {
VI_UNLOCK(vp);
+   if (yield_count++ == 500) {
+   yield_count = 0;
+   msleep(&yield_count, MNT_MTX(mp), PZERO,
+   "ffspause", 1);
+   }
continue;
}
MNT_IUNLOCK(mp);


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Packet loss every 30.999 seconds

2007-12-21 Thread Alfred Perlstein
* David G Lawrence <[EMAIL PROTECTED]> [071219 09:12] wrote:
> > >Try it with "find / -type f >/dev/null" to duplicate the problem  
> > >almost
> > >instantly.
> > 
> > I was able to verify last night that (cd /; tar -cpf -) > all.tar would
> > trigger the problem.  I'm working getting a test running with
> > David's ffs_sync() workaround now, adding a few counters there should
> > get this narrowed down a little more.
> 
>Unfortunately, the version of the patch that I sent out isn't going to
> help your problem. It needs to yield at the top of the loop, but vp isn't
> necessarily valid after the wakeup from the msleep. That's a problem that
> I'm having trouble figuring out a solution to - the solutions that come
> to mind will all significantly increase the overhead of the loop.
>As a very inadequate work-around, you might consider lowering
> kern.maxvnodes to something like 2 - that might be low enough to
> not trigger the problem, but also be high enough to not significantly
> affect system I/O performance.

I apologize for not reading the code as I am swamped, but a technique
that Matt Dillon used for bufs might work here.

Can you use a placeholder vnode as a place to restart the scan?
you might have to mark it special so that other threads/things
(getnewvnode()?) don't molest it, but it can provide for a convenient
restart point.

-- 
- Alfred Perlstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Packet loss every 30.999 seconds

2007-12-21 Thread David G Lawrence
> >Unfortunately, the version of the patch that I sent out isn't going to
> > help your problem. It needs to yield at the top of the loop, but vp isn't
> > necessarily valid after the wakeup from the msleep. That's a problem that
> > I'm having trouble figuring out a solution to - the solutions that come
> > to mind will all significantly increase the overhead of the loop.
> 
> I apologize for not reading the code as I am swamped, but a technique
> that Matt Dillon used for bufs might work here.
> 
> Can you use a placeholder vnode as a place to restart the scan?
> you might have to mark it special so that other threads/things
> (getnewvnode()?) don't molest it, but it can provide for a convenient
> restart point.

   That was one of the solutions that I considered and rejected since it
would significantly increase the overhead of the loop.
   The solution provided by Kostik Belousov that uses uio_yield looks like
a find solution. I intend to try it out on some servers RSN.

-DG

David G. Lawrence
President
Download Technologies, Inc. - http://www.downloadtech.com - (866) 399 8500
The FreeBSD Project - http://www.freebsd.org
Pave the road of life with opportunities.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Packet loss every 30.999 seconds

2007-12-21 Thread Alfred Perlstein
* David G Lawrence <[EMAIL PROTECTED]> [071221 15:42] wrote:
> > >Unfortunately, the version of the patch that I sent out isn't going to
> > > help your problem. It needs to yield at the top of the loop, but vp isn't
> > > necessarily valid after the wakeup from the msleep. That's a problem that
> > > I'm having trouble figuring out a solution to - the solutions that come
> > > to mind will all significantly increase the overhead of the loop.
> > 
> > I apologize for not reading the code as I am swamped, but a technique
> > that Matt Dillon used for bufs might work here.
> > 
> > Can you use a placeholder vnode as a place to restart the scan?
> > you might have to mark it special so that other threads/things
> > (getnewvnode()?) don't molest it, but it can provide for a convenient
> > restart point.
> 
>That was one of the solutions that I considered and rejected since it
> would significantly increase the overhead of the loop.
>The solution provided by Kostik Belousov that uses uio_yield looks like
> a find solution. I intend to try it out on some servers RSN.

Out of curiosity's sake, why would it make the loop slower?  one
would only add the placeholder when yielding, not for every iteration.



-- 
- Alfred Perlstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


RE: Packet loss every 30.999 seconds

2007-12-21 Thread David Schwartz


I'm just an observer, and I may be confused, but it seems to me that this is
motion in the wrong direction (at least, it's not going to fix the actual
problem). As I understand the problem, once you reach a certain point, the
system slows down *every* 30.999 seconds. Now, it's possible for the code to
cause one slowdown as it cleans up, but why does it need to clean up so much
31 seconds later?

Why not find/fix the actual bug? Then work on getting the yield right if it
turns out there's an actual problem for it to fix.

If the problem is that too much work is being done at a stretch and it turns
out this is because work is being done erroneously or needlessly, fixing
that should solve the whole problem. Doing the work that doesn't need to be
done more slowly is at best an ugly workaround.

Or am I misunderstanding?

DS


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Packet loss every 30.999 seconds

2007-12-21 Thread Mark Fullmer
The uio_yield() idea did not work.  Still have the same 31 second  
interval

packet loss.

Is it safe to assume the vp will be valid after a msleep() or  
uio_yield()?  If

so can we do something a little different:

Currently:

/* this takes too long when list is large */
MNT_VNODE_FOREACH(vp, mp, mvp) {
 do work
}

Why not do this incrementally and call ffs_sync() more often, or
break it out into ffs_isync() (incremental sync).

static struct vnode *vp;

/* first? */
if (!vp)
  vp = __mnt_vnode_first(&mvp, mp);

for (vcount = 0; vp && (vcount != 500); ++vcount) {
  do work
  vp = __mnt_vnode_next(&mvp, mp);
}

The problem I see with this is a race condition where this list may  
change

between the incremental calls.

--
mark

On Dec 21, 2007, at 6:43 PM, David G Lawrence wrote:

   Unfortunately, the version of the patch that I sent out isn't  
going to
help your problem. It needs to yield at the top of the loop, but  
vp isn't
necessarily valid after the wakeup from the msleep. That's a  
problem that
I'm having trouble figuring out a solution to - the solutions  
that come

to mind will all significantly increase the overhead of the loop.


I apologize for not reading the code as I am swamped, but a technique
that Matt Dillon used for bufs might work here.

Can you use a placeholder vnode as a place to restart the scan?
you might have to mark it special so that other threads/things
(getnewvnode()?) don't molest it, but it can provide for a convenient
restart point.


   That was one of the solutions that I considered and rejected  
since it

would significantly increase the overhead of the loop.
   The solution provided by Kostik Belousov that uses uio_yield  
looks like

a find solution. I intend to try it out on some servers RSN.

-DG

David G. Lawrence
President
Download Technologies, Inc. - http://www.downloadtech.com - (866)  
399 8500

The FreeBSD Project - http://www.freebsd.org
Pave the road of life with opportunities.
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Packet loss every 30.999 seconds

2007-12-21 Thread Kostik Belousov
On Fri, Dec 21, 2007 at 05:43:09PM -0800, David Schwartz wrote:
> 
> 
> I'm just an observer, and I may be confused, but it seems to me that this is
> motion in the wrong direction (at least, it's not going to fix the actual
> problem). As I understand the problem, once you reach a certain point, the
> system slows down *every* 30.999 seconds. Now, it's possible for the code to
> cause one slowdown as it cleans up, but why does it need to clean up so much
> 31 seconds later?
> 
> Why not find/fix the actual bug? Then work on getting the yield right if it
> turns out there's an actual problem for it to fix.
> 
> If the problem is that too much work is being done at a stretch and it turns
> out this is because work is being done erroneously or needlessly, fixing
> that should solve the whole problem. Doing the work that doesn't need to be
> done more slowly is at best an ugly workaround.
> 
> Or am I misunderstanding?

Yes, rewriting the syncer is the right solution. It probably cannot be done
quickly enough. If the yield workaround provide mitigation for now, it
shall go in.


pgp9KeKRxAVmb.pgp
Description: PGP signature


Re: Packet loss every 30.999 seconds

2007-12-21 Thread Kostik Belousov
On Fri, Dec 21, 2007 at 10:30:51PM -0500, Mark Fullmer wrote:
> The uio_yield() idea did not work.  Still have the same 31 second  
> interval packet loss.
What patch you have used ?

Lets check whether the syncer is the culprit for you.
Please, change the value of the syncdelay at the sys/kern/vfs_subr.c
around the line 238 from 30 to some other value, e.g., 45. After that,
check the interval of the effect you have observed.

It would be interesting to check whether completely disabling the syncer
eliminates the packet loss, but such system have to be operated with
extreme caution.

> 
> Is it safe to assume the vp will be valid after a msleep() or  
> uio_yield()?  If
No.


pgpOR5uacte11.pgp
Description: PGP signature


Re: Packet loss every 30.999 seconds

2007-12-21 Thread Kostik Belousov
On Fri, Dec 21, 2007 at 04:24:32PM -0800, Alfred Perlstein wrote:
> * David G Lawrence <[EMAIL PROTECTED]> [071221 15:42] wrote:
> > > >Unfortunately, the version of the patch that I sent out isn't going 
> > > > to
> > > > help your problem. It needs to yield at the top of the loop, but vp 
> > > > isn't
> > > > necessarily valid after the wakeup from the msleep. That's a problem 
> > > > that
> > > > I'm having trouble figuring out a solution to - the solutions that come
> > > > to mind will all significantly increase the overhead of the loop.
> > > 
> > > I apologize for not reading the code as I am swamped, but a technique
> > > that Matt Dillon used for bufs might work here.
> > > 
> > > Can you use a placeholder vnode as a place to restart the scan?
> > > you might have to mark it special so that other threads/things
> > > (getnewvnode()?) don't molest it, but it can provide for a convenient
> > > restart point.
> > 
> >That was one of the solutions that I considered and rejected since it
> > would significantly increase the overhead of the loop.
> >The solution provided by Kostik Belousov that uses uio_yield looks like
> > a find solution. I intend to try it out on some servers RSN.
> 
> Out of curiosity's sake, why would it make the loop slower?  one
> would only add the placeholder when yielding, not for every iteration.

Marker is already reinserted into the list on every iteration.


pgpiO6czZtMIm.pgp
Description: PGP signature


Re: Packet loss every 30.999 seconds

2007-12-21 Thread Mark Fullmer


On Dec 22, 2007, at 12:36 AM, Kostik Belousov wrote:


On Fri, Dec 21, 2007 at 10:30:51PM -0500, Mark Fullmer wrote:

The uio_yield() idea did not work.  Still have the same 31 second
interval packet loss.

What patch you have used ?


This is hand applied from the diff you sent December 19, 2007 1:24:48  
PM EST


sr1400-ar0.eng:/usr/src/sys/ufs/ffs# diff -c ffs_vfsops.c  
ffs_vfsops.c.orig

*** ffs_vfsops.cFri Dec 21 21:08:39 2007
--- ffs_vfsops.c.orig   Sat Dec 22 00:51:22 2007
***
*** 1107,1113 
struct ufsmount *ump = VFSTOUFS(mp);
struct fs *fs;
int error, count, wait, lockreq, allerror = 0;
-   int yield_count;
int suspend;
int suspended;
int secondary_writes;
--- 1107,1112 
***
*** 1148,1154 
softdep_get_depcounts(mp, &softdep_deps, &softdep_accdeps);
MNT_ILOCK(mp);

- yield_count = 0;
MNT_VNODE_FOREACH(vp, mp, mvp) {
/*
 * Depend on the mntvnode_slock to keep things  
stable enough

--- 1147,1152 
***
*** 1166,1177 
(IN_ACCESS | IN_CHANGE | IN_MODIFIED |  
IN_UPDATE)) == 0 &&

vp->v_bufobj.bo_dirty.bv_cnt == 0)) {
VI_UNLOCK(vp);
-   if (yield_count++ == 100) {
-   MNT_IUNLOCK(mp);
-   yield_count = 0;
-   uio_yield();
-   goto relock_mp;
-   }
continue;
}
MNT_IUNLOCK(mp);
--- 1164,1169 
***
*** 1186,1192 
if ((error = ffs_syncvnode(vp, waitfor)) != 0)
allerror = error;
vput(vp);
- relock_mp:
MNT_ILOCK(mp);
}
MNT_IUNLOCK(mp);
--- 1178,1183 





Lets check whether the syncer is the culprit for you.
Please, change the value of the syncdelay at the sys/kern/vfs_subr.c
around the line 238 from 30 to some other value, e.g., 45. After that,
check the interval of the effect you have observed.


Changed it to 13.  Not sure if SYNCER_MAXDELAY also needed to be
increased if syncdelay was increased.

static int syncdelay = 13;  /* max time to delay syncing  
data */


Test:

; use vnodes
% find / -type f -print > /dev/null

; verify
% sysctl vfs.numvnodes
vfs.numvnodes: 32128

; run packet loss test
now have periodic loss every 13994633us (13.99 seconds).

; reduce # of vnodes with sysctl kern.maxvnodes=1000
test now runs clean.



It would be interesting to check whether completely disabling the  
syncer

eliminates the packet loss, but such system have to be operated with
extreme caution.




___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Packet loss every 30.999 seconds

2007-12-21 Thread David G Lawrence
> >What patch you have used ?
> 
> This is hand applied from the diff you sent December 19, 2007 1:24:48  
> PM EST

   Mark, try the previos patch from Kostik - the one that does the one
tick msleep. I think you'll find that that one does work. The likely
problem with the second version is that uio_yield doesn't lower the
priority enough for the other threads to run. Forcing it to msleep for
a tick will eliminate the priority from the consideration.

-DG

David G. Lawrence
President
Download Technologies, Inc. - http://www.downloadtech.com - (866) 399 8500
The FreeBSD Project - http://www.freebsd.org
Pave the road of life with opportunities.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Packet loss every 30.999 seconds

2007-12-21 Thread Kostik Belousov
On Sat, Dec 22, 2007 at 01:28:31AM -0500, Mark Fullmer wrote:
> 
> On Dec 22, 2007, at 12:36 AM, Kostik Belousov wrote:
> >Lets check whether the syncer is the culprit for you.
> >Please, change the value of the syncdelay at the sys/kern/vfs_subr.c
> >around the line 238 from 30 to some other value, e.g., 45. After that,
> >check the interval of the effect you have observed.
> 
> Changed it to 13.  Not sure if SYNCER_MAXDELAY also needed to be
> increased if syncdelay was increased.
> 
> static int syncdelay = 13;  /* max time to delay syncing  
> data */
> 
> Test:
> 
> ; use vnodes
> % find / -type f -print > /dev/null
> 
> ; verify
> % sysctl vfs.numvnodes
> vfs.numvnodes: 32128
> 
> ; run packet loss test
> now have periodic loss every 13994633us (13.99 seconds).
> 
> ; reduce # of vnodes with sysctl kern.maxvnodes=1000
> test now runs clean.
Definitely syncer. 

> >
> >It would be interesting to check whether completely disabling the  
> >syncer
> >eliminates the packet loss, but such system have to be operated with
> >extreme caution.
Ok, no need to do this.

As Bruce Evans noted, there is a vfs_msync() that do almost the same
traversal of the vnodes. It was missed in the previous patch. Try this one.

diff --git a/sys/kern/vfs_subr.c b/sys/kern/vfs_subr.c
index 3c2e1ed..6515d6a 100644
--- a/sys/kern/vfs_subr.c
+++ b/sys/kern/vfs_subr.c
@@ -2967,7 +2967,9 @@ vfs_msync(struct mount *mp, int flags)
 {
struct vnode *vp, *mvp;
struct vm_object *obj;
+   int yield_count;
 
+   yield_count = 0;
MNT_ILOCK(mp);
MNT_VNODE_FOREACH(vp, mp, mvp) {
VI_LOCK(vp);
@@ -2996,6 +2998,12 @@ vfs_msync(struct mount *mp, int flags)
MNT_ILOCK(mp);
} else
VI_UNLOCK(vp);
+   if (yield_count++ == 500) {
+   MNT_IUNLOCK(mp);
+   yield_count = 0;
+   uio_yield();
+   MNT_ILOCK(mp);
+   }
}
MNT_IUNLOCK(mp);
 }
diff --git a/sys/ufs/ffs/ffs_vfsops.c b/sys/ufs/ffs/ffs_vfsops.c
index cbccc62..9e8b887 100644
--- a/sys/ufs/ffs/ffs_vfsops.c
+++ b/sys/ufs/ffs/ffs_vfsops.c
@@ -1182,6 +1182,7 @@ ffs_sync(mp, waitfor, td)
int secondary_accwrites;
int softdep_deps;
int softdep_accdeps;
+   int yield_count;
struct bufobj *bo;
 
fs = ump->um_fs;
@@ -1216,6 +1217,7 @@ loop:
softdep_get_depcounts(mp, &softdep_deps, &softdep_accdeps);
MNT_ILOCK(mp);
 
+   yield_count = 0;
MNT_VNODE_FOREACH(vp, mp, mvp) {
/*
 * Depend on the mntvnode_slock to keep things stable enough
@@ -1233,6 +1235,12 @@ loop:
(IN_ACCESS | IN_CHANGE | IN_MODIFIED | IN_UPDATE)) == 0 &&
vp->v_bufobj.bo_dirty.bv_cnt == 0)) {
VI_UNLOCK(vp);
+   if (yield_count++ == 500) {
+   MNT_IUNLOCK(mp);
+   yield_count = 0;
+   uio_yield();
+   MNT_ILOCK(mp);
+   }
continue;
}
MNT_IUNLOCK(mp);


pgpbeGkWXvOc3.pgp
Description: PGP signature


Re: Packet loss every 30.999 seconds

2007-12-21 Thread David G Lawrence
> I'm just an observer, and I may be confused, but it seems to me that this is
> motion in the wrong direction (at least, it's not going to fix the actual
> problem). As I understand the problem, once you reach a certain point, the
> system slows down *every* 30.999 seconds. Now, it's possible for the code to
> cause one slowdown as it cleans up, but why does it need to clean up so much
> 31 seconds later?
> 
> Why not find/fix the actual bug? Then work on getting the yield right if it
> turns out there's an actual problem for it to fix.
> 
> If the problem is that too much work is being done at a stretch and it turns
> out this is because work is being done erroneously or needlessly, fixing
> that should solve the whole problem. Doing the work that doesn't need to be
> done more slowly is at best an ugly workaround.
> 
> Or am I misunderstanding?

   It's the syncer that is causing the problem, and it runs every 31 seconds.
Historically, the syncer ran every 30 seconds, but things have changed a
bit over time.
   The reason that the syncer takes so muck time is that ffs_sync is a bit
stupid in how it works - it loops through all of the vnodes on each ffs
mountpoint (typically almost all of the vnodes in the system) to see if
any of them need to be synced out. This was marginally okay when there
were perhaps a thousand vnodes in the system, but when the maximum number
of vnodes was dramatically increased in FreeBSD some years ago (to
typically 5-10) and combined with kernel threads of FreeBSD 5,
this has resulted in some rather bad side effects.
   I think the proper solution would be to create a ffs_sync work list
(another TAILQ/LISTQ), probably with the head in the mountpoint struct,
that has on it any vnodes that need to be synced. Unfortuantely, such a
change would be extensive, scattered throughout much of the ufs/ffs code.

-DG

David G. Lawrence
President
Download Technologies, Inc. - http://www.downloadtech.com - (866) 399 8500
The FreeBSD Project - http://www.freebsd.org
Pave the road of life with opportunities.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Packet loss every 30.999 seconds

2007-12-21 Thread David G Lawrence
> As Bruce Evans noted, there is a vfs_msync() that do almost the same
> traversal of the vnodes. It was missed in the previous patch. Try this one.

   I forgot to comment on that when Bruce pointed that out. My solution
has been to comment out the call to vfs_msync. :-) It comes into play
when you have files modified through the mmap interface (kind of rare
on most systems). Obviously I have mixed feelings about vfs_msync, but
I'm not suggesting here that we should get rid of it as any sort of
solution.

-DG

David G. Lawrence
President
Download Technologies, Inc. - http://www.downloadtech.com - (866) 399 8500
The FreeBSD Project - http://www.freebsd.org
Pave the road of life with opportunities.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Packet loss every 30.999 seconds

2007-12-21 Thread David G Lawrence
> > > Can you use a placeholder vnode as a place to restart the scan?
> > > you might have to mark it special so that other threads/things
> > > (getnewvnode()?) don't molest it, but it can provide for a convenient
> > > restart point.
> > 
> >That was one of the solutions that I considered and rejected since it
> > would significantly increase the overhead of the loop.
> >The solution provided by Kostik Belousov that uses uio_yield looks like
> > a find solution. I intend to try it out on some servers RSN.
> 
> Out of curiosity's sake, why would it make the loop slower?  one
> would only add the placeholder when yielding, not for every iteration.

   Actually, I misread your suggestion and was thinking marker flag,
rather than placeholder vnode. Sorry about that. The current code
actually already uses a marker vnode. It is hidden and obfuscated in
the MNT_VNODE_FOREACH macro, further hidden in the __mnt_vnode_first/next
functions, so it should be safe from vnode reclaimation/free problems.

-DG

David G. Lawrence
President
Download Technologies, Inc. - http://www.downloadtech.com - (866) 399 8500
The FreeBSD Project - http://www.freebsd.org
Pave the road of life with opportunities.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Packet loss every 30.999 seconds

2007-12-22 Thread Bruce Evans

On Sat, 22 Dec 2007, Kostik Belousov wrote:


On Fri, Dec 21, 2007 at 05:43:09PM -0800, David Schwartz wrote:


I'm just an observer, and I may be confused, but it seems to me that this is
motion in the wrong direction (at least, it's not going to fix the actual
problem). As I understand the problem, once you reach a certain point, the
system slows down *every* 30.999 seconds. Now, it's possible for the code to
cause one slowdown as it cleans up, but why does it need to clean up so much
31 seconds later?


It is just searching for things to clean up, and doing this pessimally due
to unnecessary cache misses and (more recently) introduction of overheads
to handling the case where the mount point is locked into the fast path
where the mount point is not unlocked.

The search every 30 seconds or so is probably more efficient, and is
certainly simpler, than managing the list on every change to every vnode
for every file system.  However, it gives a high latency in non-preemptible
kernels.


Why not find/fix the actual bug? Then work on getting the yield right if it
turns out there's an actual problem for it to fix.


Yielding is probably the correct fix for non-preemptible kernels.  Some
operations just take a long time, but are low priority so they can be
preempted.  This operation is partly under user control, since any user
can call sync(2) and thus generate the latency every  seconds.
But this is no worse than a user generating even larger blocks of latency
by reading huge amounts from /dev/zero.  My old latency workaround for
the latter (and other huge i/o's) is still sort of necessary, though it
now works bogusly (hogticks doesn't work since it is reset on context
switches to interrupt handlers; however, any context switch mostly fixes
the problem).  My old latency workaround only reduces the latency to a
multiple of 1/HZ, so a default of 200 ms, so it still is supposed to allow
latencies much larger than the ones that cause problems here, but its
bogus current operation tends to give latencies of more like 1/HZ which
is short enough when HZ has its default misconfiguration to 1000.

I still don't understand the original problem, that the kernel is not
even preemptible enough for network interrupts to work (except in 5.2
where Giant breaks things).  Perhaps I misread the problem, and it is
actually that networking works but userland is unable to run in time
to avoid packet loss.


If the problem is that too much work is being done at a stretch and it turns
out this is because work is being done erroneously or needlessly, fixing
that should solve the whole problem. Doing the work that doesn't need to be
done more slowly is at best an ugly workaround.


Lots of necessary work is being done.


Yes, rewriting the syncer is the right solution. It probably cannot be done
quickly enough. If the yield workaround provide mitigation for now, it
shall go in.


I don't think rewriting the syncer just for this is the right solution.
Rewriting the syncer so that it schedules actual i/o more efficiently
might involve a solution.  Better scheduling would probably take more
CPU and increase the problem.

Note that MNT_VNODE_FOREACH() is used 17 times, so the yielding fix is
needed in 17 places if it isn't done internally in MNT_VNODE_FOREACH().
There are 4 places in vfs and 13 places in 6 file systems:

% ./ufs/ffs/ffs_snapshot.c: MNT_VNODE_FOREACH(xvp, mp, mvp) {
% ./ufs/ffs/ffs_snapshot.c: MNT_VNODE_FOREACH(vp, mp, mvp) {
% ./ufs/ffs/ffs_vfsops.c:   MNT_VNODE_FOREACH(vp, mp, mvp) {
% ./ufs/ffs/ffs_vfsops.c:   MNT_VNODE_FOREACH(vp, mp, mvp) {
% ./ufs/ufs/ufs_quota.c:MNT_VNODE_FOREACH(vp, mp, mvp) {
% ./ufs/ufs/ufs_quota.c:MNT_VNODE_FOREACH(vp, mp, mvp) {
% ./ufs/ufs/ufs_quota.c:MNT_VNODE_FOREACH(vp, mp, mvp) {
% ./fs/msdosfs/msdosfs_vfsops.c:MNT_VNODE_FOREACH(vp, mp, nvp) {
% ./fs/coda/coda_subr.c:MNT_VNODE_FOREACH(vp, mp, nvp) {
% ./gnu/fs/ext2fs/ext2_vfsops.c:MNT_VNODE_FOREACH(vp, mp, mvp) {
% ./gnu/fs/ext2fs/ext2_vfsops.c:MNT_VNODE_FOREACH(vp, mp, mvp) {
% ./kern/vfs_default.c: MNT_VNODE_FOREACH(vp, mp, mvp) {
% ./kern/vfs_subr.c:MNT_VNODE_FOREACH(vp, mp, mvp) {
% ./kern/vfs_subr.c:MNT_VNODE_FOREACH(vp, mp, mvp) {
% ./nfs4client/nfs4_vfsops.c:   MNT_VNODE_FOREACH(vp, mp, mvp) {
% ./nfsclient/nfs_subs.c:   MNT_VNODE_FOREACH(vp, mp, nvp) {
% ./nfsclient/nfs_vfsops.c: MNT_VNODE_FOREACH(vp, mp, mvp) {

Only file systems that support writing need it (for VOP_SYNC() and for
MNT_RELOAD), else there would be many more places.  There would also
be more places if MNT_RELOAD support were not missing for some file
systems.

Bruce
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Packet loss every 30.999 seconds

2007-12-22 Thread Mark Fullmer


On Dec 22, 2007, at 12:08 PM, Bruce Evans wrote:


I still don't understand the original problem, that the kernel is not
even preemptible enough for network interrupts to work (except in 5.2
where Giant breaks things).  Perhaps I misread the problem, and it is
actually that networking works but userland is unable to run in time
to avoid packet loss.



The test is done with UDP packets between two servers.  The em
driver is incrementing the received packet count correctly but
the packet is not making it up the network stack.  If
the application was not servicing the socket fast enough I would
expect to see the "dropped due to full socket buffers" (udps_fullsock)
counter incrementing, as shown by netstat -s.

I grab a copy of netstat -s, netstat -i, and netstat -m
before and after testing.  Other than the link packets counter,
I haven't seen any other indication of where the packet is getting
lost.  The em driver has a debugging stats option which does not
indicate receive side overflows.

I'm fairly certain this same behavior can be seen with the fxp
driver, but I'll need to double check.

These are results I sent a few days ago after setting up a
test without an ethernet switch between the sender and receiver.

The switch was originally used to verify the sender was actually
transmitting.  With spanning tree, ethernet keepalives, and CDP
(cisco proprietary neighbor protocol) disabled and static ARP entries
on the sender and receiver I can account for all packets making
it to the receiver.

##


Back to back test with no ethernet switch between two em interfaces,
same result.  The receiving side has been up > 1 day and exhibits
the problem.  These are also two different servers.  The small
gettimeofday() syscall tester also shows the same ~30
second pattern of high latency between syscalls.

Receiver test application reports 3699 missed packets

Sender netstat -i:

(before test)
em11500   00:04:23:cf:51:b7   20 0  
15975785 0 0
em11500 10.1/24   10.1.0.237 -  
15975801 - -


(after test)
em11500   00:04:23:cf:51:b7   22 0  
25975822 0 0
em11500 10.1/24   10.1.0.239 -  
25975838 - -


total IP packets sent in during test = end - start
25975838-15975801 =  1037 (expected, 1,000,000 packets test +  
overhead)


Receiver netstat -i:

(before test)
em11500   00:04:23:c4:cc:89 15975785 0
21 0 0
em11500 10.1/24   10.1.0.1  15969626 -
19 - -


(after test)
em11500   00:04:23:c4:cc:89 25975822 0
23 0 0
em11500 10.1/24   10.1.0.1  25965964 -
21 - -


total ethernet frames received during test = end - start
25975822-15975785 = 1037 (as expected)

total IP packets processed during test = end - start
25965964-15969626 = 9996338 (expecting 1037)

Missed packets = expected - received
1037-9996338 = 3699

netstat -i accounts for the 3699 missed packets also reported by the
application

Looking closer at the tester output again shows the periodic
~30 second windows of packet loss.

There's a second problem here in that packets are just disappearing
before they make it to ip_input(), or there's a dropped packets
counter I've not found yet.

I can provide remote access to anyone who wants to take a look, this
is very easy to duplicate.  The ~ 1 day uptime before the behavior
surfaces is not making this easy to isolate.



___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Packet loss every 30.999 seconds

2007-12-22 Thread Mark Fullmer

This appears to work.  No packet loss with vfs.numvnodes
at 32132, 16K PPS test with 1 million packets.

I'll run some additional tests bringing vfs.numvnodes
closer to kern.maxvnodes.

On Dec 22, 2007, at 2:03 AM, Kostik Belousov wrote:



As Bruce Evans noted, there is a vfs_msync() that do almost the same
traversal of the vnodes. It was missed in the previous patch. Try  
this one.


diff --git a/sys/kern/vfs_subr.c b/sys/kern/vfs_subr.c
index 3c2e1ed..6515d6a 100644
--- a/sys/kern/vfs_subr.c
+++ b/sys/kern/vfs_subr.c
@@ -2967,7 +2967,9 @@ vfs_msync(struct mount *mp, int flags)
 {
struct vnode *vp, *mvp;
struct vm_object *obj;
+   int yield_count;

+   yield_count = 0;
MNT_ILOCK(mp);
MNT_VNODE_FOREACH(vp, mp, mvp) {
VI_LOCK(vp);
@@ -2996,6 +2998,12 @@ vfs_msync(struct mount *mp, int flags)
MNT_ILOCK(mp);
} else
VI_UNLOCK(vp);
+   if (yield_count++ == 500) {
+   MNT_IUNLOCK(mp);
+   yield_count = 0;
+   uio_yield();
+   MNT_ILOCK(mp);
+   }
}
MNT_IUNLOCK(mp);
 }
diff --git a/sys/ufs/ffs/ffs_vfsops.c b/sys/ufs/ffs/ffs_vfsops.c
index cbccc62..9e8b887 100644
--- a/sys/ufs/ffs/ffs_vfsops.c
+++ b/sys/ufs/ffs/ffs_vfsops.c
@@ -1182,6 +1182,7 @@ ffs_sync(mp, waitfor, td)
int secondary_accwrites;
int softdep_deps;
int softdep_accdeps;
+   int yield_count;
struct bufobj *bo;

fs = ump->um_fs;
@@ -1216,6 +1217,7 @@ loop:
softdep_get_depcounts(mp, &softdep_deps, &softdep_accdeps);
MNT_ILOCK(mp);

+   yield_count = 0;
MNT_VNODE_FOREACH(vp, mp, mvp) {
/*
 * Depend on the mntvnode_slock to keep things stable enough
@@ -1233,6 +1235,12 @@ loop:
(IN_ACCESS | IN_CHANGE | IN_MODIFIED | IN_UPDATE)) == 0 &&
vp->v_bufobj.bo_dirty.bv_cnt == 0)) {
VI_UNLOCK(vp);
+   if (yield_count++ == 500) {
+   MNT_IUNLOCK(mp);
+   yield_count = 0;
+   uio_yield();
+   MNT_ILOCK(mp);
+   }
continue;
}
MNT_IUNLOCK(mp);


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Packet loss every 30.999 seconds

2007-12-22 Thread Kostik Belousov
On Sun, Dec 23, 2007 at 04:08:09AM +1100, Bruce Evans wrote:
> On Sat, 22 Dec 2007, Kostik Belousov wrote:
> >Yes, rewriting the syncer is the right solution. It probably cannot be done
> >quickly enough. If the yield workaround provide mitigation for now, it
> >shall go in.
> 
> I don't think rewriting the syncer just for this is the right solution.
> Rewriting the syncer so that it schedules actual i/o more efficiently
> might involve a solution.  Better scheduling would probably take more
> CPU and increase the problem.
I think that we can easily predict what vnode(s) become dirty at the
places where we do vn_start_write().

> 
> Note that MNT_VNODE_FOREACH() is used 17 times, so the yielding fix is
> needed in 17 places if it isn't done internally in MNT_VNODE_FOREACH().
> There are 4 places in vfs and 13 places in 6 file systems:
> 
> % ./ufs/ffs/ffs_snapshot.c:   MNT_VNODE_FOREACH(xvp, mp, mvp) {
> % ./ufs/ffs/ffs_snapshot.c:   MNT_VNODE_FOREACH(vp, mp, mvp) {
> % ./ufs/ffs/ffs_vfsops.c: MNT_VNODE_FOREACH(vp, mp, mvp) {
> % ./ufs/ffs/ffs_vfsops.c: MNT_VNODE_FOREACH(vp, mp, mvp) {
> % ./ufs/ufs/ufs_quota.c:  MNT_VNODE_FOREACH(vp, mp, mvp) {
> % ./ufs/ufs/ufs_quota.c:  MNT_VNODE_FOREACH(vp, mp, mvp) {
> % ./ufs/ufs/ufs_quota.c:  MNT_VNODE_FOREACH(vp, mp, mvp) {
> % ./fs/msdosfs/msdosfs_vfsops.c:  MNT_VNODE_FOREACH(vp, mp, nvp) {
> % ./fs/coda/coda_subr.c:  MNT_VNODE_FOREACH(vp, mp, nvp) {
> % ./gnu/fs/ext2fs/ext2_vfsops.c:  MNT_VNODE_FOREACH(vp, mp, mvp) {
> % ./gnu/fs/ext2fs/ext2_vfsops.c:  MNT_VNODE_FOREACH(vp, mp, mvp) {
> % ./kern/vfs_default.c:   MNT_VNODE_FOREACH(vp, mp, mvp) {
> % ./kern/vfs_subr.c:  MNT_VNODE_FOREACH(vp, mp, mvp) {
> % ./kern/vfs_subr.c:  MNT_VNODE_FOREACH(vp, mp, mvp) {
> % ./nfs4client/nfs4_vfsops.c: MNT_VNODE_FOREACH(vp, mp, mvp) {
> % ./nfsclient/nfs_subs.c: MNT_VNODE_FOREACH(vp, mp, nvp) {
> % ./nfsclient/nfs_vfsops.c:   MNT_VNODE_FOREACH(vp, mp, mvp) {
> 
> Only file systems that support writing need it (for VOP_SYNC() and for
> MNT_RELOAD), else there would be many more places.  There would also
> be more places if MNT_RELOAD support were not missing for some file
> systems.

Ok, since you talked about this first :). I already made the following
patch, but did not published it since I still did not inspected all
callers of MNT_VNODE_FOREACH() for safety of dropping mount interlock.
It shall be safe, but better to check. Also, I postponed the check
until it was reported that yielding does solve the original problem.

diff --git a/sys/kern/vfs_mount.c b/sys/kern/vfs_mount.c
index 14acc5b..046af82 100644
--- a/sys/kern/vfs_mount.c
+++ b/sys/kern/vfs_mount.c
@@ -1994,6 +1994,12 @@ __mnt_vnode_next(struct vnode **mvp, struct mount *mp)
mtx_assert(MNT_MTX(mp), MA_OWNED);
 
KASSERT((*mvp)->v_mount == mp, ("marker vnode mount list mismatch"));
+   if ((*mvp)->v_yield++ == 500) {
+   MNT_IUNLOCK(mp);
+   (*mvp)->v_yield = 0;
+   uio_yield();
+   MNT_ILOCK(mp);
+   }
vp = TAILQ_NEXT(*mvp, v_nmntvnodes);
while (vp != NULL && vp->v_type == VMARKER)
vp = TAILQ_NEXT(vp, v_nmntvnodes);
diff --git a/sys/sys/vnode.h b/sys/sys/vnode.h
index dc70417..6e3119b 100644
--- a/sys/sys/vnode.h
+++ b/sys/sys/vnode.h
@@ -131,6 +131,7 @@ struct vnode {
struct socket   *vu_socket; /* v unix domain net (VSOCK) */
struct cdev *vu_cdev;   /* v device (VCHR, VBLK) */
struct fifoinfo *vu_fifoinfo;   /* v fifo (VFIFO) */
+   int vu_yield;   /*   yield count (VMARKER) */
} v_un;
 
/*
@@ -185,6 +186,7 @@ struct vnode {
 #definev_socketv_un.vu_socket
 #definev_rdev  v_un.vu_cdev
 #definev_fifoinfo  v_un.vu_fifoinfo
+#definev_yield v_un.vu_yield
 
 /* XXX: These are temporary to avoid a source sweep at this time */
 #define v_object   v_bufobj.bo_object


pgpCKEb1u3wW1.pgp
Description: PGP signature


Re: Packet loss every 30.999 seconds

2007-12-22 Thread Bruce Evans

On Sat, 22 Dec 2007, Kostik Belousov wrote:


On Sun, Dec 23, 2007 at 04:08:09AM +1100, Bruce Evans wrote:

On Sat, 22 Dec 2007, Kostik Belousov wrote:

Yes, rewriting the syncer is the right solution. It probably cannot be done
quickly enough. If the yield workaround provide mitigation for now, it
shall go in.


I don't think rewriting the syncer just for this is the right solution.
Rewriting the syncer so that it schedules actual i/o more efficiently
might involve a solution.  Better scheduling would probably take more
CPU and increase the problem.

I think that we can easily predict what vnode(s) become dirty at the
places where we do vn_start_write().


This works for writes to regular files at most.  There are also reads
(for ffs, these set IN_ATIME unless the file system is mounted with
noatime) and directory operations.  By grepping for IN_CHANGE, I get
78 places in ffs alone where dirtying of the inode occurs or is scheduled
to occur (ffs = /sys/ufs).  The efficiency of "marking" timestamps,
especially for atimes, depends on just setting a flag in normal operation
and picking up coalesced settings of the flag later, often at sync time
by scanning all vnodes.


Note that MNT_VNODE_FOREACH() is used 17 times, so the yielding fix is
needed in 17 places if it isn't done internally in MNT_VNODE_FOREACH().
There are 4 places in vfs and 13 places in 6 file systems:
...

Only file systems that support writing need it (for VOP_SYNC() and for
MNT_RELOAD), else there would be many more places.  There would also
be more places if MNT_RELOAD support were not missing for some file
systems.


Ok, since you talked about this first :). I already made the following
patch, but did not published it since I still did not inspected all
callers of MNT_VNODE_FOREACH() for safety of dropping mount interlock.
It shall be safe, but better to check. Also, I postponed the check
until it was reported that yielding does solve the original problem.


Good.  I'd still like to unobfuscate the function call.


diff --git a/sys/kern/vfs_mount.c b/sys/kern/vfs_mount.c
index 14acc5b..046af82 100644
--- a/sys/kern/vfs_mount.c
+++ b/sys/kern/vfs_mount.c
@@ -1994,6 +1994,12 @@ __mnt_vnode_next(struct vnode **mvp, struct mount *mp)
mtx_assert(MNT_MTX(mp), MA_OWNED);

KASSERT((*mvp)->v_mount == mp, ("marker vnode mount list mismatch"));
+   if ((*mvp)->v_yield++ == 500) {
+   MNT_IUNLOCK(mp);
+   (*mvp)->v_yield = 0;
+   uio_yield();


Another unobfuscation is to not name this uio_yield().


+   MNT_ILOCK(mp);
+   }
vp = TAILQ_NEXT(*mvp, v_nmntvnodes);
while (vp != NULL && vp->v_type == VMARKER)
vp = TAILQ_NEXT(vp, v_nmntvnodes);
diff --git a/sys/sys/vnode.h b/sys/sys/vnode.h
index dc70417..6e3119b 100644
--- a/sys/sys/vnode.h
+++ b/sys/sys/vnode.h
@@ -131,6 +131,7 @@ struct vnode {
struct socket   *vu_socket; /* v unix domain net (VSOCK) */
struct cdev *vu_cdev;   /* v device (VCHR, VBLK) */
struct fifoinfo *vu_fifoinfo;   /* v fifo (VFIFO) */
+   int vu_yield;   /*   yield count (VMARKER) */
} v_un;

/*


Putting the count in the union seems fragile at best.  Even if nothing
can access the marker vnode, you need to context-switch its old contents
while using it for the count, in case its old contents is used.  Vnode-
printing routines might still be confused.

Bruce
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Packet loss every 30.999 seconds

2007-12-22 Thread Alfred Perlstein
* David G Lawrence <[EMAIL PROTECTED]> [071221 23:31] wrote:
> > > > Can you use a placeholder vnode as a place to restart the scan?
> > > > you might have to mark it special so that other threads/things
> > > > (getnewvnode()?) don't molest it, but it can provide for a convenient
> > > > restart point.
> > > 
> > >That was one of the solutions that I considered and rejected since it
> > > would significantly increase the overhead of the loop.
> > >The solution provided by Kostik Belousov that uses uio_yield looks like
> > > a find solution. I intend to try it out on some servers RSN.
> > 
> > Out of curiosity's sake, why would it make the loop slower?  one
> > would only add the placeholder when yielding, not for every iteration.
> 
>Actually, I misread your suggestion and was thinking marker flag,
> rather than placeholder vnode. Sorry about that. The current code
> actually already uses a marker vnode. It is hidden and obfuscated in
> the MNT_VNODE_FOREACH macro, further hidden in the __mnt_vnode_first/next
> functions, so it should be safe from vnode reclaimation/free problems.

That level of obscuring is a bit worrysome.

Yes, I did mean placeholder vnode.

Even so, is it of utility or not?

Or is it already being used and I'm missing something and should
just "utsl" at this point?

-- 
- Alfred Perlstein
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Packet loss every 30.999 seconds

2007-12-24 Thread Kostik Belousov
On Sun, Dec 23, 2007 at 10:20:31AM +1100, Bruce Evans wrote:
> On Sat, 22 Dec 2007, Kostik Belousov wrote:
> >Ok, since you talked about this first :). I already made the following
> >patch, but did not published it since I still did not inspected all
> >callers of MNT_VNODE_FOREACH() for safety of dropping mount interlock.
> >It shall be safe, but better to check. Also, I postponed the check
> >until it was reported that yielding does solve the original problem.
> 
> Good.  I'd still like to unobfuscate the function call.
What do you mean there ? 

> Putting the count in the union seems fragile at best.  Even if nothing
> can access the marker vnode, you need to context-switch its old contents
> while using it for the count, in case its old contents is used.  Vnode-
> printing routines might still be confused.
Could you, please, describe what you mean by "contex-switch" for the
VMARKER ?


Mark, could you, please, retest the patch below in your setup ?
I want to put a change or some edition of it into the 7.0 release, and
we need to move fast to do this.

diff --git a/sys/kern/vfs_mount.c b/sys/kern/vfs_mount.c
index 14acc5b..046af82 100644
--- a/sys/kern/vfs_mount.c
+++ b/sys/kern/vfs_mount.c
@@ -1994,6 +1994,12 @@ __mnt_vnode_next(struct vnode **mvp, struct mount *mp)
mtx_assert(MNT_MTX(mp), MA_OWNED);
 
KASSERT((*mvp)->v_mount == mp, ("marker vnode mount list mismatch"));
+   if ((*mvp)->v_yield++ == 500) {
+   MNT_IUNLOCK(mp);
+   (*mvp)->v_yield = 0;
+   uio_yield();
+   MNT_ILOCK(mp);
+   }
vp = TAILQ_NEXT(*mvp, v_nmntvnodes);
while (vp != NULL && vp->v_type == VMARKER)
vp = TAILQ_NEXT(vp, v_nmntvnodes);
diff --git a/sys/sys/vnode.h b/sys/sys/vnode.h
index dc70417..6e3119b 100644
--- a/sys/sys/vnode.h
+++ b/sys/sys/vnode.h
@@ -131,6 +131,7 @@ struct vnode {
struct socket   *vu_socket; /* v unix domain net (VSOCK) */
struct cdev *vu_cdev;   /* v device (VCHR, VBLK) */
struct fifoinfo *vu_fifoinfo;   /* v fifo (VFIFO) */
+   int vu_yield;   /*   yield count (VMARKER) */
} v_un;
 
/*
@@ -185,6 +186,7 @@ struct vnode {
 #definev_socketv_un.vu_socket
 #definev_rdev  v_un.vu_cdev
 #definev_fifoinfo  v_un.vu_fifoinfo
+#definev_yield v_un.vu_yield
 
 /* XXX: These are temporary to avoid a source sweep at this time */
 #define v_object   v_bufobj.bo_object


pgpRF2zEg9a7Q.pgp
Description: PGP signature


Re: Packet loss every 30.999 seconds

2007-12-24 Thread Bruce Evans

On Mon, 24 Dec 2007, Kostik Belousov wrote:


On Sun, Dec 23, 2007 at 10:20:31AM +1100, Bruce Evans wrote:

On Sat, 22 Dec 2007, Kostik Belousov wrote:

Ok, since you talked about this first :). I already made the following
patch, but did not published it since I still did not inspected all
callers of MNT_VNODE_FOREACH() for safety of dropping mount interlock.
It shall be safe, but better to check. Also, I postponed the check
until it was reported that yielding does solve the original problem.


Good.  I'd still like to unobfuscate the function call.

What do you mean there ?


Make the loop control and overheads clear by making the function call
explicit, maybe by expanding MNT_VNODE_FOREACH() inline after fixing
the style bugs in it.  Later, fix the code to match the comment again
by not making a function call in the usual case.  This is harder.


Putting the count in the union seems fragile at best.  Even if nothing
can access the marker vnode, you need to context-switch its old contents
while using it for the count, in case its old contents is used.  Vnode-
printing routines might still be confused.

Could you, please, describe what you mean by "contex-switch" for the
VMARKER ?


Oh, I didn't notice that the marker vnode is out of band (a whole new
vnode is malloced for each marker).  The context switching would be
needed if an ordinary active vnode that uses the union is used as a
marker.

Bruce
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Packet loss every 30.999 seconds

2007-12-24 Thread Mark Fullmer


On Dec 24, 2007, at 8:19 AM, Kostik Belousov wrote:



Mark, could you, please, retest the patch below in your setup ?
I want to put a change or some edition of it into the 7.0 release, and
we need to move fast to do this.


It's building now.  The testing will run overnight.

Your patch to ffs_sync() and vfs_msync() stopped the periodic packet  
loss,
but other file system activity such as (cd /; tar -cf - .) > /dev/ 
null will

cause dropped packets.  Same behavior, packets never make it up to the
IP layer.

--
mark
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Packet loss every 30.999 seconds

2007-12-24 Thread Kostik Belousov
On Mon, Dec 24, 2007 at 08:16:50PM -0500, Mark Fullmer wrote:
> 
> On Dec 24, 2007, at 8:19 AM, Kostik Belousov wrote:
> 
> >
> >Mark, could you, please, retest the patch below in your setup ?
> >I want to put a change or some edition of it into the 7.0 release, and
> >we need to move fast to do this.
> 
> It's building now.  The testing will run overnight.
> 
> Your patch to ffs_sync() and vfs_msync() stopped the periodic packet  
> loss,
> but other file system activity such as (cd /; tar -cf - .) > /dev/ 
> null will
> cause dropped packets.  Same behavior, packets never make it up to the
> IP layer.

What fs do you use ? If FFS, are softupdates turned on ? Please, show the
total time spent in the softdepflush process.

Also, try to add the FULL_PREEMPTION kernel config option and report
whether it helps.


pgpoDvYHjDAZK.pgp
Description: PGP signature


Re: Packet loss every 30.999 seconds

2007-12-26 Thread Mark Fullmer


On Dec 25, 2007, at 12:27 AM, Kostik Belousov wrote:



What fs do you use ? If FFS, are softupdates turned on ? Please,  
show the

total time spent in the softdepflush process.

Also, try to add the FULL_PREEMPTION kernel config option and report
whether it helps.


FFS with soft updates on all filesystems.

With your latest uio_yield() in MNT_VNODE_FOREACH patch it's a
little harder to provoke packet loss.  Standard nightly
crontabs and a tar -cf - / > /dev/null no longer cause drops.  A
make buildkernel will though.

root  38  0.0  0.0 0 8  ??  DL   Mon08PM   0:04.62  
[softdepflush]


Building a new kernel with KTR and FULL_PREEMPTION now.

--
mark
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Packet loss every 30.999 seconds

2007-12-26 Thread Kris Kennaway

Mark Fullmer wrote:


On Dec 25, 2007, at 12:27 AM, Kostik Belousov wrote:



What fs do you use ? If FFS, are softupdates turned on ? Please, show the
total time spent in the softdepflush process.

Also, try to add the FULL_PREEMPTION kernel config option and report
whether it helps.


FFS with soft updates on all filesystems.

With your latest uio_yield() in MNT_VNODE_FOREACH patch it's a
little harder to provoke packet loss.  Standard nightly
crontabs and a tar -cf - / > /dev/null no longer cause drops.  A
make buildkernel will though.

root  38  0.0  0.0 0 8  ??  DL   Mon08PM   0:04.62 
[softdepflush]


Building a new kernel with KTR and FULL_PREEMPTION now.


FYI FULL_PREEMPTION causes performance loss in other situations.

Kris

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Packet loss every 30.999 seconds

2007-12-27 Thread Bruce Evans

On Sat, 22 Dec 2007, Mark Fullmer wrote:


On Dec 22, 2007, at 12:08 PM, Bruce Evans wrote:


I still don't understand the original problem, that the kernel is not
even preemptible enough for network interrupts to work (except in 5.2
where Giant breaks things).  Perhaps I misread the problem, and it is
actually that networking works but userland is unable to run in time
to avoid packet loss.


The test is done with UDP packets between two servers.  The em
driver is incrementing the received packet count correctly but
the packet is not making it up the network stack.  If
the application was not servicing the socket fast enough I would
expect to see the "dropped due to full socket buffers" (udps_fullsock)
counter incrementing, as shown by netstat -s.


I couldn't see any sign of PREEMPTION not working in 6.3-PREREALEASE.
em seemed to keep up with the maximum rate that I can easily generate
(640 kpps with tiny udp packets), though it cannot transmit at more than
400 kpps on the same hardware.  This is without aby syncer activity to
cause glitches.  The rest of the system couldn't keep up, and with my
normal configuration of net.isr.direct=1, systat -ip (udps_fullsock)
showed too many packets being dropped, but all the numbers seemed to
add up right.  (I didn't do end-to-end packet counts.  I'm using ttcp
to send and receive packets; the receiver loses so many packets that
it rarely terminates properly, and when it does terminate it always
shows many dropped.)  However, with net.isr.direct=0, packets are dropped
with no sign of the problem except a reduced count of good packets in
systat -ip.

Packet rate counter net.isr.direct=1  net.isr.direct=0
---   
netstat -I  639042643522 (faster later)
systat -ip (total rx)   639042382567 (dropped many b4 here)
  (UDP total)   639042382567
  (udps_fullsock)   29891170340
 (diff of prev 2)   340031312227 (300+k always dropped)
net.isr.count   small large (seems to be correct 643k)
net.isr.directedlarge (correct?)  no change
net.isr.queued  0 0
net.isr.drop0 0

net.isr.direct=0 is apparently causing dropped packets without even counting
them.  However, the drop seems to be below the netisr level.

More worryingly, with full 1500-byte packets (1472 data + 28 UDP
header), packets can be sent at a rate of 76 kpps (nearly 950 Mbps)
with a load of only 80% on the receiver, yet the ttcp receiver still
drops about 1000 pps due top "socket buffer full".  With net.usr.direct=0
it drops an additinal 700 pps due to this.  Glitches from sync(2)
taking 25 ms increase the loss by about 1000 packets, and using rtprio
for the ttcp receiver doesn't seem to help at all.

In previous mail, you (Mark) wrote:

# With FreeBSD 4 I was able to run a UDP data collector with rtprio set,
# kern.ipc.maxsockbuf=2048, then use setsockopt() with SO_RCVBUF
# in the application.  If packets were dropped they would show up
# with netstat -s as "dropped due to full socket buffers".
# 
# Since the packet never makes it to ip_input() I no longer have

# any way to count drops.  There will always be corner cases where
# interrupts are lost and drops not accounted for if the adapter
# hardware can't report them, but right now I've got no way to
# estimate any loss.

I tried using SO_RCVBUF in ttcp (it's an old version of ttcp that doesn't
have an option for this).  With the default kern.ipc.maxsockbuf of 256K,
this didn't seem to help.  20MB should work better :-) but I didn't try that.
I don't understand how fast the socket buffer fills up and would have
thought that 256K was enough for tiny packets but not for 1500-byte packets.
Their seems to be a general problem that 1Gbps NICs have or should have
rings of size >= 256 or 512 so that they aren't forced to drop packets
when their interrupt handler has a reasonable but larger latency, yet if
we actually use this feature then we flood the upper layers with hundreds
of packets and fill up socket buffers etc. there.

Bruce
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Packet loss every 30.999 seconds

2007-12-27 Thread Bruce Evans

On Fri, 28 Dec 2007, Bruce Evans wrote:


In previous mail, you (Mark) wrote:

# With FreeBSD 4 I was able to run a UDP data collector with rtprio set,
# kern.ipc.maxsockbuf=2048, then use setsockopt() with SO_RCVBUF
# in the application.  If packets were dropped they would show up
# with netstat -s as "dropped due to full socket buffers".
# # Since the packet never makes it to ip_input() I no longer have
# any way to count drops.  There will always be corner cases where
# interrupts are lost and drops not accounted for if the adapter
# hardware can't report them, but right now I've got no way to
# estimate any loss.

I tried using SO_RCVBUF in ttcp (it's an old version of ttcp that doesn't
have an option for this).  With the default kern.ipc.maxsockbuf of 256K,
this didn't seem to help.  20MB should work better :-) but I didn't try that.


I've now tried this.  With kern.ipc.maxsockbuf=2048 (~20MB) an
SO_RCVBUF of 0x100 (16MB), the "socket buffer full lossage increases
from ~300 kpps (~47%) to ~450 kpps (70%) with tiny packets.  I think
this is caused by most accesses to the larger buffer being cache misses
-- since the system can't keep up, cache misses make it worse).

However, with 1500-byte packets, the larger buffer reduces the lossage
from 1 kpps in 76 kpps to precisely zero pps, at a cost of only a small
percentage of system overhead (~20Idle to ~18%Idle).

The above is with net.isr.direct=1.  With net.isr.direct=0, the loss is
too small to be obvious and is reported as 0, but I don't trust the
report.  ttcp's packet counts indicate losses of a few per million with
direct=0 but none with direct=1.  "while :; do sync; sleep 0.1" in the
background causes a loss of about 100 pps with direct=0 and a smaller
loss with direct=1.  Running the ttcp receiver at rtprio 0 doesn't make
much difference to the losses.

Bruce
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Packet loss every 30.999 seconds

2007-12-27 Thread Bruce Evans

On Fri, 28 Dec 2007, Bruce Evans wrote:


On Fri, 28 Dec 2007, Bruce Evans wrote:


In previous mail, you (Mark) wrote:

# With FreeBSD 4 I was able to run a UDP data collector with rtprio set,
# kern.ipc.maxsockbuf=2048, then use setsockopt() with SO_RCVBUF
# in the application.  If packets were dropped they would show up
# with netstat -s as "dropped due to full socket buffers".
# # Since the packet never makes it to ip_input() I no longer have
# any way to count drops.  There will always be corner cases where
# interrupts are lost and drops not accounted for if the adapter
# hardware can't report them, but right now I've got no way to
# estimate any loss.


I found where drops are recorded for the net.isr.direct=0 case.  It is
in net.inet.ip.intr_queue.drops.  The netisr subsystem just calls
IF_HANDOFF(), and IF_HANDOFF() calls _IF_DROP() if the queue fills up.
_IF_DROP(ifq) just increments ifq->ip_drops.  The usual case for netisrs
is for the queue to be ipintrq for NETISR_IP.  The following details
don't help:

- drops for input queues don't seem to be displayed by any utilities
  (except ones for ipintrq are displayed primitively by
  sysctl net.inet.ip.intr_queue_drops).  netstat and systat only
  display drops for send queues and ip frags.
- the netisr subsystem's drop count doesn't seem to be displayed by any
  utilities except sysctl.  It only counts drops due to there not being
  a queue; other drops are counted by _IF_DROP() in the per-queue counter.
  Users have a hard time integrating all these primitively displayed drop
  counts with other error counters.
- the length of ipintrq defaults to the default ifq length of ipqmaxlen =
  IPQ_MAXLEN = 50.  This is inadequate if there is just one NIC in the
  system that has an rx ring size of >= slightly less than 50.  But 1
  Gbps NICs should have an rx ring size of 256 or 512 (I think the
  size is 256 for em; it is 256 for bge due to bogus configuration of
  hardware that can handle it being 512).  If the larger hardware rx
  ring is actually used, then ipintrq drops are almost ensured in the
  direct=0 case, so using the larger h/w ring is worse than useless
  (it also increases cache misses).  This is for just one NIC.  This
  problem is often limited by handling rx packets in small bursts, at
  a cost of extra overhead.  Interrupt moderation increases it by
  increasing burst sizes.

  This contrasts with the handling of send queues.  Send queues are
  per-interface and most drivers increase the default length from 50
  to their ring size (-1 for bogus reasons).  I think this is only an
  optimization, while a similar change for rx queues is important for
  avoiding packet loss.  For send queues, the ifq acts mainly as a
  primitive implementation of watermarks.  I have found that tx queue
  lengths need to be more like 5000 than 50 or 500 to provide enough
  buffering when applications are delayed by other applications or
  just by sleeping until the next clock tick, and use tx queues of
  length ~2 (a couple of clock ticks at HZ = 100), but now think
  queue lengths should be restricted to more like 50 since long queues
  cannot fit in L2 caches (not to mention they are bad for latency).

The length of ipintrq can be changed using sysctl
net.inet.ip.intrq_queue_maxlen.  Changing it from 50 to 1024 turns most
or all ipintrq drops into "socket buffer full" drops
(640 kpps input packets and 434 kpps socket buffer fulls with direct=0;
 640 kpps input packets and 324 kpps socket buffer fulls with direct=1).

Bruce
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"