Re: panic in RELENG_5 UMA - two new stack traces
Gleb Smirnoff wrote: G How often does it crash? Does debug.mpsafenet=0 increases stability? G G I can reproduce the crash within 60 seconds of firing off 30+ ping/arp G -d scripts, all running in parallel. G G debug.mpsafenet=0 seems to have solved the problem. I'm running 100+ G instances of the above script and the system has been stable for over an G hour. Thanks! We definitely see that the bug is a race, not a broken logic. I am almost sure, that you are experiencing the same bug as I described in the beginning of the thread. Although there is no yet fix available for race between 'arp -d' and outgoing packet, there is one for race between incoming ARP reply and outgoing packet. We will probably commit it soon, after more review. Sorry to say, but it looks like debug.mpsafenet=0 reduced the frequency of the problem, but did not eliminate it. The system crashed and hung again over the weekend with very little load. There was no kernel panic, so no core files. I can leave 5.4 on this system for a week or so before installing 4.11, if you want me to continue doing diagnostics on it. Gary ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: panic in RELENG_5 UMA - two new stack traces
On Tue, Jun 28, 2005 at 11:24:47AM -0400, Gary Mu1der wrote: G I spent the day yesterday trying to reproduce the crash that I posted G last week and you kindly replied to. This is due to the fact that I G stupidly managed to overwrite the kernel.debug that I used to generate G the stack trace. Sadly I could not cause the system to crash again with G the same sb* errors. G G I did however remove both the Berkley Packet Filter and IPFilter from my G custom kernel to try and isolate the problem. This has caused the crash G to occur in a different and more reproducible form. I have both G INVARIANTS and WITNESS enabled, as you can see from my kernel conf. G which is included at the end of this e-mail. G G Below are the latest stack traces (using bge and then fxp NICs), kernel G conf. and dmesg. Any help would be appreciated. This time I have a copy G of both the core files and corresponding kernel.debug so I can hopefully G provide you with any info you need. How often does it crash? Does debug.mpsafenet=0 increases stability? -- Totus tuus, Glebius. GLEBIUS-RIPN GLEB-RIPE ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: panic in RELENG_5 UMA - two new stack traces
Gleb Smirnoff wrote: On Tue, Jun 28, 2005 at 11:24:47AM -0400, Gary Mu1der wrote: G I spent the day yesterday trying to reproduce the crash that I posted G last week and you kindly replied to. This is due to the fact that I G stupidly managed to overwrite the kernel.debug that I used to generate G the stack trace. Sadly I could not cause the system to crash again with G the same sb* errors. G G I did however remove both the Berkley Packet Filter and IPFilter from my G custom kernel to try and isolate the problem. This has caused the crash G to occur in a different and more reproducible form. I have both G INVARIANTS and WITNESS enabled, as you can see from my kernel conf. G which is included at the end of this e-mail. G G Below are the latest stack traces (using bge and then fxp NICs), kernel G conf. and dmesg. Any help would be appreciated. This time I have a copy G of both the core files and corresponding kernel.debug so I can hopefully G provide you with any info you need. How often does it crash? Does debug.mpsafenet=0 increases stability? I can reproduce the crash within 60 seconds of firing off 30+ ping/arp -d scripts, all running in parallel. debug.mpsafenet=0 seems to have solved the problem. I'm running 100+ instances of the above script and the system has been stable for over an hour. As I wanted some background on what debug.mpsafenet=0 does, I did some Googling and found a good write up here: http://unix.derkeiler.com/Mailing-Lists/FreeBSD/current/2004-08/2280.html Thanks, Gary ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: panic in RELENG_5 UMA - two new stack traces
On Fri, Jul 01, 2005 at 01:54:59PM -0400, Gary Mu1der wrote: G On Tue, Jun 28, 2005 at 11:24:47AM -0400, Gary Mu1der wrote: G G I spent the day yesterday trying to reproduce the crash that I posted G G last week and you kindly replied to. This is due to the fact that I G G stupidly managed to overwrite the kernel.debug that I used to generate G G the stack trace. Sadly I could not cause the system to crash again with G G the same sb* errors. G G G G I did however remove both the Berkley Packet Filter and IPFilter from G my G custom kernel to try and isolate the problem. This has caused the G crash G to occur in a different and more reproducible form. I have both G G INVARIANTS and WITNESS enabled, as you can see from my kernel conf. G G which is included at the end of this e-mail. G G G G Below are the latest stack traces (using bge and then fxp NICs), kernel G G conf. and dmesg. Any help would be appreciated. This time I have a copy G G of both the core files and corresponding kernel.debug so I can G hopefully G provide you with any info you need. G G How often does it crash? Does debug.mpsafenet=0 increases stability? G G I can reproduce the crash within 60 seconds of firing off 30+ ping/arp G -d scripts, all running in parallel. G G debug.mpsafenet=0 seems to have solved the problem. I'm running 100+ G instances of the above script and the system has been stable for over an G hour. Thanks! We definitely see that the bug is a race, not a broken logic. I am almost sure, that you are experiencing the same bug as I described in the beginning of the thread. Although there is no yet fix available for race between 'arp -d' and outgoing packet, there is one for race between incoming ARP reply and outgoing packet. We will probably commit it soon, after more review. -- Totus tuus, Glebius. GLEBIUS-RIPN GLEB-RIPE ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: panic in RELENG_5 UMA - two new stack traces
Gleb Smirnoff wrote: G I can reproduce the crash within 60 seconds of firing off 30+ ping/arp G -d scripts, all running in parallel. G G debug.mpsafenet=0 seems to have solved the problem. I'm running 100+ G instances of the above script and the system has been stable for over an G hour. Thanks! We definitely see that the bug is a race, not a broken logic. I am almost sure, that you are experiencing the same bug as I described in the beginning of the thread. Although there is no yet fix available for race between 'arp -d' and outgoing packet, there is one for race between incoming ARP reply and outgoing packet. We will probably commit it soon, after more review. Is this bug specific to only using arp -d, or does it look like the arp -d tests identify a bug that might cause TCP/IP related crashes with other types of real-world network traffic. To rephrase: Does it look like fixing this bug may fix a lot of the network-related crashes a number of people have reported? Thanks, Gary ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: panic in RELENG_5 UMA - two new stack traces
On Fri, Jul 01, 2005 at 04:32:38PM -0400, Gary Mu1der wrote: G G I can reproduce the crash within 60 seconds of firing off 30+ ping/arp G G -d scripts, all running in parallel. G G G G debug.mpsafenet=0 seems to have solved the problem. I'm running 100+ G G instances of the above script and the system has been stable for over G an G hour. G G Thanks! We definitely see that the bug is a race, not a broken logic. I am G almost sure, that you are experiencing the same bug as I described in G the beginning of the thread. G G Although there is no yet fix available for race between 'arp -d' and G outgoing packet, there is one for race between incoming ARP reply and G outgoing packet. We will probably commit it soon, after more review. G G Is this bug specific to only using arp -d, or does it look like the G arp -d tests identify a bug that might cause TCP/IP related crashes G with other types of real-world network traffic. G G To rephrase: Does it look like fixing this bug may fix a lot of the G network-related crashes a number of people have reported? See above in the thread. We have two races: one that can fire anytime in runtime, and we are going to fix it. The other with 'arp -d', not fixed yet. I am not sure how many reports on network related panics where related to this race. Let's fix it and see. You can patch your boxes with the patch and see whether they are more stable in runtime. -- Totus tuus, Glebius. GLEBIUS-RIPN GLEB-RIPE ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: panic in RELENG_5 UMA - two new stack traces
Gleb, Thank you very much for your reply. I spent the day yesterday trying to reproduce the crash that I posted last week and you kindly replied to. This is due to the fact that I stupidly managed to overwrite the kernel.debug that I used to generate the stack trace. Sadly I could not cause the system to crash again with the same sb* errors. I did however remove both the Berkley Packet Filter and IPFilter from my custom kernel to try and isolate the problem. This has caused the crash to occur in a different and more reproducible form. I have both INVARIANTS and WITNESS enabled, as you can see from my kernel conf. which is included at the end of this e-mail. Below are the latest stack traces (using bge and then fxp NICs), kernel conf. and dmesg. Any help would be appreciated. This time I have a copy of both the core files and corresponding kernel.debug so I can hopefully provide you with any info you need. d5# uname -a FreeBSD d5.bidx.com 5.4-RELEASE FreeBSD 5.4-RELEASE #12: Tue Jun 28 09:19:34 EDT 2005 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/DB-DUAL-AMD64-RAID5 amd64 Here is a stack trace when I am using the bge NIC driver (which I've had reports on the freebsd-amd64 list as being unstable under load): d5# kgdb kernel.debug.20 vmcore.20 [GDB will not be able to debug user-mode threads: /usr/lib/libthread_db.so: Undefined symbol ps_pglobal_lookup] GNU gdb 6.1.1 [FreeBSD] Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type show copying to see the conditions. There is absolutely no warranty for GDB. Type show warranty for details. This GDB was configured as amd64-marcel-freebsd. #0 doadump () at pcpu.h:167 167 pcpu.h: No such file or directory. in pcpu.h (kgdb) bt full #0 doadump () at pcpu.h:167 No locals. #1 0x80241dc9 in boot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:410 _ep = (struct eventhandler_entry *) 0x0 _el = (struct eventhandler_list *) 0xff829e00 first_buf_printf = 1 #2 0x8024185b in panic ( fmt=0x803b35a8 Duplicate free of item %p from zone %p(%s)\n) at /usr/src/sys/kern/kern_shutdown.c:566 bootopt = 260 newpanic = 0 ap = {{gp_offset = 32, fp_offset = 48, overflow_arg_area = 0xb3431ad0, reg_save_area = 0xb34319f0}} buf = Duplicate free of item 0xff00d318bb00 from zone 0xff00f3fe46c0(Mbuf)\n, '\0' repeats 178 times #3 0x8031f2e8 in uma_dbg_free (zone=0xff00f3fe46c0, slab=0xff00d318bf50, item=0xff00d318bb00) at /usr/src/sys/vm/uma_dbg.c:301 keg = 0xff00f3fde000 freei = 11 #4 0x8031d720 in uma_zfree_arg (zone=0xff00f3fe46c0, item=0xff00d318bb00, udata=0x0) at /usr/src/sys/vm/uma_core.c:2273 keg = 0xff00f3fde000 cache = 0xff00f3fe4740 bucket = 0x9 bflags = 0 skip = SKIP_DTOR #5 0x8027f5d1 in m_freem (mb=0x0) at uma.h:304 No locals. #6 0x801d424e in bge_intr (xsc=0x0) at /usr/src/sys/dev/bge/if_bge.c:2862 sc = (struct bge_softc *) 0x80843000 status = 0 #7 0x8022c899 in ithread_loop (arg=0xff022300) at /usr/src/sys/kern/kern_intr.c:547 ih = (struct intrhand *) 0xffa1eb80 p = (struct proc *) 0xff00ec16f8b8 count = 0 warming = 0 warned = 0 __func__ = ithread_loop #8 0x8022b8d3 in fork_exit ( callout=0x8022c7c0 ithread_loop, arg=0xff022300, frame=0xb3431c50) at /usr/src/sys/kern/kern_fork.c:791 p = (struct proc *) 0xff00ec16f8b8 #9 0x8032879e in fork_trampoline () at /usr/src/sys/amd64/amd64/exception.S:296 No locals. #10 0x in ?? () No symbol table info available. #11 0x in ?? () No symbol table info available. #12 0x0001 in ?? () No symbol table info available. #13 0x in ?? () No symbol table info available. #14 0x in ?? () No symbol table info available. #15 0x in ?? () No symbol table info available. #16 0x in ?? () No symbol table info available. #17 0x in ?? () No symbol table info available. #18 0x in ?? () No symbol table info available. #19 0x in ?? () No symbol table info available. #20 0x in ?? () No symbol table info available. #21 0x in ?? () No symbol table info available. #22 0x in ?? () No symbol table info available. #23 0x in ?? () No symbol table info available. #24 0x in ?? () No symbol table info available. #25 0x in ?? () No symbol table info available. #26