Re: panic in RELENG_5 UMA - two new stack traces

2005-07-05 Thread Gary Mu1der

Gleb Smirnoff wrote:

G How often does it crash? Does debug.mpsafenet=0 increases stability?
G 
G I can reproduce the crash within 60 seconds of firing off 30+ ping/arp 
G -d scripts, all running in parallel.
G 
G debug.mpsafenet=0 seems to have solved the problem. I'm running 100+ 
G instances of the above script and the system has been stable for over an 
G hour.


Thanks! We definitely see that the bug is a race, not a broken logic. I am
almost sure, that you are experiencing the same bug as I described in
the beginning of the thread.

Although there is no yet fix available for race between 'arp -d' and
outgoing packet, there is one for race between incoming ARP reply and
outgoing packet. We will probably commit it soon, after more review.


Sorry to say, but it looks like debug.mpsafenet=0 reduced the frequency 
of the problem, but did not eliminate it. The system crashed and hung 
again over the weekend with very little load. There was no kernel panic, 
so no core files.


I can leave 5.4 on this system for a week or so before installing 4.11, 
if you want me to continue doing diagnostics on it.


Gary

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: panic in RELENG_5 UMA - two new stack traces

2005-07-01 Thread Gleb Smirnoff
On Tue, Jun 28, 2005 at 11:24:47AM -0400, Gary Mu1der wrote:
G I spent the day yesterday trying to reproduce the crash that I posted 
G last week and you kindly replied to. This is due to the fact that I 
G stupidly managed to overwrite the kernel.debug that I used to generate 
G the stack trace. Sadly I could not cause the system to crash again with 
G the same sb* errors.
G 
G I did however remove both the Berkley Packet Filter and IPFilter from my 
G custom kernel to try and isolate the problem. This has caused the crash 
G to occur in a different and more reproducible form. I have both 
G INVARIANTS and WITNESS enabled, as you can see from my kernel conf. 
G which is included at the end of this e-mail.
G 
G Below are the latest stack traces (using bge and then fxp NICs), kernel 
G conf. and dmesg. Any help would be appreciated. This time I have a copy 
G of both the core files and corresponding kernel.debug so I can hopefully 
G provide you with any info you need.

How often does it crash? Does debug.mpsafenet=0 increases stability?

-- 
Totus tuus, Glebius.
GLEBIUS-RIPN GLEB-RIPE
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: panic in RELENG_5 UMA - two new stack traces

2005-07-01 Thread Gary Mu1der

Gleb Smirnoff wrote:

On Tue, Jun 28, 2005 at 11:24:47AM -0400, Gary Mu1der wrote:
G I spent the day yesterday trying to reproduce the crash that I posted 
G last week and you kindly replied to. This is due to the fact that I 
G stupidly managed to overwrite the kernel.debug that I used to generate 
G the stack trace. Sadly I could not cause the system to crash again with 
G the same sb* errors.
G 
G I did however remove both the Berkley Packet Filter and IPFilter from my 
G custom kernel to try and isolate the problem. This has caused the crash 
G to occur in a different and more reproducible form. I have both 
G INVARIANTS and WITNESS enabled, as you can see from my kernel conf. 
G which is included at the end of this e-mail.
G 
G Below are the latest stack traces (using bge and then fxp NICs), kernel 
G conf. and dmesg. Any help would be appreciated. This time I have a copy 
G of both the core files and corresponding kernel.debug so I can hopefully 
G provide you with any info you need.


How often does it crash? Does debug.mpsafenet=0 increases stability?


I can reproduce the crash within 60 seconds of firing off 30+ ping/arp 
-d scripts, all running in parallel.


debug.mpsafenet=0 seems to have solved the problem. I'm running 100+ 
instances of the above script and the system has been stable for over an 
hour.


As I wanted some background on what debug.mpsafenet=0 does, I did some 
Googling and found a good write up here:


http://unix.derkeiler.com/Mailing-Lists/FreeBSD/current/2004-08/2280.html

Thanks,
Gary

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: panic in RELENG_5 UMA - two new stack traces

2005-07-01 Thread Gleb Smirnoff
On Fri, Jul 01, 2005 at 01:54:59PM -0400, Gary Mu1der wrote:
G On Tue, Jun 28, 2005 at 11:24:47AM -0400, Gary Mu1der wrote:
G G I spent the day yesterday trying to reproduce the crash that I posted 
G G last week and you kindly replied to. This is due to the fact that I 
G G stupidly managed to overwrite the kernel.debug that I used to generate 
G G the stack trace. Sadly I could not cause the system to crash again with 
G G the same sb* errors.
G G 
G G I did however remove both the Berkley Packet Filter and IPFilter from 
G my G custom kernel to try and isolate the problem. This has caused the 
G crash G to occur in a different and more reproducible form. I have both 
G G INVARIANTS and WITNESS enabled, as you can see from my kernel conf. 
G G which is included at the end of this e-mail.
G G 
G G Below are the latest stack traces (using bge and then fxp NICs), kernel 
G G conf. and dmesg. Any help would be appreciated. This time I have a copy 
G G of both the core files and corresponding kernel.debug so I can 
G hopefully G provide you with any info you need.
G 
G How often does it crash? Does debug.mpsafenet=0 increases stability?
G 
G I can reproduce the crash within 60 seconds of firing off 30+ ping/arp 
G -d scripts, all running in parallel.
G 
G debug.mpsafenet=0 seems to have solved the problem. I'm running 100+ 
G instances of the above script and the system has been stable for over an 
G hour.

Thanks! We definitely see that the bug is a race, not a broken logic. I am
almost sure, that you are experiencing the same bug as I described in
the beginning of the thread.

Although there is no yet fix available for race between 'arp -d' and
outgoing packet, there is one for race between incoming ARP reply and
outgoing packet. We will probably commit it soon, after more review.

-- 
Totus tuus, Glebius.
GLEBIUS-RIPN GLEB-RIPE
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: panic in RELENG_5 UMA - two new stack traces

2005-07-01 Thread Gary Mu1der

Gleb Smirnoff wrote:
G I can reproduce the crash within 60 seconds of firing off 30+ ping/arp 
G -d scripts, all running in parallel.
G 
G debug.mpsafenet=0 seems to have solved the problem. I'm running 100+ 
G instances of the above script and the system has been stable for over an 
G hour.


Thanks! We definitely see that the bug is a race, not a broken logic. I am
almost sure, that you are experiencing the same bug as I described in
the beginning of the thread.

Although there is no yet fix available for race between 'arp -d' and
outgoing packet, there is one for race between incoming ARP reply and
outgoing packet. We will probably commit it soon, after more review.


Is this bug specific to only using arp -d, or does it look like the 
arp -d tests identify a bug that might cause TCP/IP related crashes 
with other types of real-world network traffic.


To rephrase: Does it look like fixing this bug may fix a lot of the 
network-related crashes a number of people have reported?


Thanks,
Gary
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: panic in RELENG_5 UMA - two new stack traces

2005-07-01 Thread Gleb Smirnoff
On Fri, Jul 01, 2005 at 04:32:38PM -0400, Gary Mu1der wrote:
G G I can reproduce the crash within 60 seconds of firing off 30+ ping/arp 
G G -d scripts, all running in parallel.
G G 
G G debug.mpsafenet=0 seems to have solved the problem. I'm running 100+ 
G G instances of the above script and the system has been stable for over 
G an G hour.
G 
G Thanks! We definitely see that the bug is a race, not a broken logic. I am
G almost sure, that you are experiencing the same bug as I described in
G the beginning of the thread.
G 
G Although there is no yet fix available for race between 'arp -d' and
G outgoing packet, there is one for race between incoming ARP reply and
G outgoing packet. We will probably commit it soon, after more review.
G 
G Is this bug specific to only using arp -d, or does it look like the 
G arp -d tests identify a bug that might cause TCP/IP related crashes 
G with other types of real-world network traffic.
G 
G To rephrase: Does it look like fixing this bug may fix a lot of the 
G network-related crashes a number of people have reported?

See above in the thread. We have two races: one that can fire anytime
in runtime, and we are going to fix it. The other with 'arp -d', not fixed
yet.

I am not sure how many reports on network related panics where related to
this race. Let's fix it and see. You can patch your boxes with the patch
and see whether they are more stable in runtime.

-- 
Totus tuus, Glebius.
GLEBIUS-RIPN GLEB-RIPE
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: panic in RELENG_5 UMA - two new stack traces

2005-06-28 Thread Gary Mu1der

Gleb,

Thank you very much for your reply.

I spent the day yesterday trying to reproduce the crash that I posted 
last week and you kindly replied to. This is due to the fact that I 
stupidly managed to overwrite the kernel.debug that I used to generate 
the stack trace. Sadly I could not cause the system to crash again with 
the same sb* errors.


I did however remove both the Berkley Packet Filter and IPFilter from my 
custom kernel to try and isolate the problem. This has caused the crash 
to occur in a different and more reproducible form. I have both 
INVARIANTS and WITNESS enabled, as you can see from my kernel conf. 
which is included at the end of this e-mail.


Below are the latest stack traces (using bge and then fxp NICs), kernel 
conf. and dmesg. Any help would be appreciated. This time I have a copy 
of both the core files and corresponding kernel.debug so I can hopefully 
provide you with any info you need.




d5# uname -a
FreeBSD d5.bidx.com 5.4-RELEASE FreeBSD 5.4-RELEASE #12: Tue Jun 28 
09:19:34 EDT 2005 
[EMAIL PROTECTED]:/usr/obj/usr/src/sys/DB-DUAL-AMD64-RAID5  amd64


Here is a stack trace when I am using the bge NIC driver (which I've had 
reports on the freebsd-amd64 list as being unstable under load):


d5# kgdb kernel.debug.20 vmcore.20
[GDB will not be able to debug user-mode threads: 
/usr/lib/libthread_db.so: Undefined symbol ps_pglobal_lookup]

GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain 
conditions.

Type show copying to see the conditions.
There is absolutely no warranty for GDB.  Type show warranty for details.
This GDB was configured as amd64-marcel-freebsd.
#0  doadump () at pcpu.h:167
167 pcpu.h: No such file or directory.
in pcpu.h
(kgdb) bt full
#0  doadump () at pcpu.h:167
No locals.
#1  0x80241dc9 in boot (howto=260)
at /usr/src/sys/kern/kern_shutdown.c:410
_ep = (struct eventhandler_entry *) 0x0
_el = (struct eventhandler_list *) 0xff829e00
first_buf_printf = 1
#2  0x8024185b in panic (
fmt=0x803b35a8 Duplicate free of item %p from zone %p(%s)\n)
at /usr/src/sys/kern/kern_shutdown.c:566
bootopt = 260
newpanic = 0
ap = {{gp_offset = 32, fp_offset = 48,
overflow_arg_area = 0xb3431ad0,
reg_save_area = 0xb34319f0}}
buf = Duplicate free of item 0xff00d318bb00 from zone 
0xff00f3fe46c0(Mbuf)\n, '\0' repeats 178 times

#3  0x8031f2e8 in uma_dbg_free (zone=0xff00f3fe46c0,
slab=0xff00d318bf50, item=0xff00d318bb00)
at /usr/src/sys/vm/uma_dbg.c:301
keg = 0xff00f3fde000
freei = 11
#4  0x8031d720 in uma_zfree_arg (zone=0xff00f3fe46c0,
item=0xff00d318bb00, udata=0x0) at /usr/src/sys/vm/uma_core.c:2273
keg = 0xff00f3fde000
cache = 0xff00f3fe4740
bucket = 0x9
bflags = 0
skip = SKIP_DTOR
#5  0x8027f5d1 in m_freem (mb=0x0) at uma.h:304
No locals.
#6  0x801d424e in bge_intr (xsc=0x0)
at /usr/src/sys/dev/bge/if_bge.c:2862
sc = (struct bge_softc *) 0x80843000
status = 0
#7  0x8022c899 in ithread_loop (arg=0xff022300)
at /usr/src/sys/kern/kern_intr.c:547
ih = (struct intrhand *) 0xffa1eb80
p = (struct proc *) 0xff00ec16f8b8
count = 0
warming = 0
warned = 0
__func__ = ithread_loop
#8  0x8022b8d3 in fork_exit (
callout=0x8022c7c0 ithread_loop, arg=0xff022300,
frame=0xb3431c50) at /usr/src/sys/kern/kern_fork.c:791
p = (struct proc *) 0xff00ec16f8b8
#9  0x8032879e in fork_trampoline ()
at /usr/src/sys/amd64/amd64/exception.S:296
No locals.
#10 0x in ?? ()
No symbol table info available.
#11 0x in ?? ()
No symbol table info available.
#12 0x0001 in ?? ()
No symbol table info available.
#13 0x in ?? ()
No symbol table info available.
#14 0x in ?? ()
No symbol table info available.
#15 0x in ?? ()
No symbol table info available.
#16 0x in ?? ()
No symbol table info available.
#17 0x in ?? ()
No symbol table info available.
#18 0x in ?? ()
No symbol table info available.
#19 0x in ?? ()
No symbol table info available.
#20 0x in ?? ()
No symbol table info available.
#21 0x in ?? ()
No symbol table info available.
#22 0x in ?? ()
No symbol table info available.
#23 0x in ?? ()
No symbol table info available.
#24 0x in ?? ()
No symbol table info available.
#25 0x in ?? ()
No symbol table info available.
#26