Re: panic in RELENG_5 UMA - two new stack traces

2005-07-05 Thread Gary Mu1der

Gleb Smirnoff wrote:

G How often does it crash? Does debug.mpsafenet=0 increases stability?
G 
G I can reproduce the crash within 60 seconds of firing off 30+ ping/arp 
G -d scripts, all running in parallel.
G 
G debug.mpsafenet=0 seems to have solved the problem. I'm running 100+ 
G instances of the above script and the system has been stable for over an 
G hour.


Thanks! We definitely see that the bug is a race, not a broken logic. I am
almost sure, that you are experiencing the same bug as I described in
the beginning of the thread.

Although there is no yet fix available for race between 'arp -d' and
outgoing packet, there is one for race between incoming ARP reply and
outgoing packet. We will probably commit it soon, after more review.


Sorry to say, but it looks like debug.mpsafenet=0 reduced the frequency 
of the problem, but did not eliminate it. The system crashed and hung 
again over the weekend with very little load. There was no kernel panic, 
so no core files.


I can leave 5.4 on this system for a week or so before installing 4.11, 
if you want me to continue doing diagnostics on it.


Gary

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: panic in RELENG_5 UMA - two new stack traces

2005-07-01 Thread Gleb Smirnoff
On Tue, Jun 28, 2005 at 11:24:47AM -0400, Gary Mu1der wrote:
G I spent the day yesterday trying to reproduce the crash that I posted 
G last week and you kindly replied to. This is due to the fact that I 
G stupidly managed to overwrite the kernel.debug that I used to generate 
G the stack trace. Sadly I could not cause the system to crash again with 
G the same sb* errors.
G 
G I did however remove both the Berkley Packet Filter and IPFilter from my 
G custom kernel to try and isolate the problem. This has caused the crash 
G to occur in a different and more reproducible form. I have both 
G INVARIANTS and WITNESS enabled, as you can see from my kernel conf. 
G which is included at the end of this e-mail.
G 
G Below are the latest stack traces (using bge and then fxp NICs), kernel 
G conf. and dmesg. Any help would be appreciated. This time I have a copy 
G of both the core files and corresponding kernel.debug so I can hopefully 
G provide you with any info you need.

How often does it crash? Does debug.mpsafenet=0 increases stability?

-- 
Totus tuus, Glebius.
GLEBIUS-RIPN GLEB-RIPE
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: panic in RELENG_5 UMA - two new stack traces

2005-07-01 Thread Gary Mu1der

Gleb Smirnoff wrote:

On Tue, Jun 28, 2005 at 11:24:47AM -0400, Gary Mu1der wrote:
G I spent the day yesterday trying to reproduce the crash that I posted 
G last week and you kindly replied to. This is due to the fact that I 
G stupidly managed to overwrite the kernel.debug that I used to generate 
G the stack trace. Sadly I could not cause the system to crash again with 
G the same sb* errors.
G 
G I did however remove both the Berkley Packet Filter and IPFilter from my 
G custom kernel to try and isolate the problem. This has caused the crash 
G to occur in a different and more reproducible form. I have both 
G INVARIANTS and WITNESS enabled, as you can see from my kernel conf. 
G which is included at the end of this e-mail.
G 
G Below are the latest stack traces (using bge and then fxp NICs), kernel 
G conf. and dmesg. Any help would be appreciated. This time I have a copy 
G of both the core files and corresponding kernel.debug so I can hopefully 
G provide you with any info you need.


How often does it crash? Does debug.mpsafenet=0 increases stability?


I can reproduce the crash within 60 seconds of firing off 30+ ping/arp 
-d scripts, all running in parallel.


debug.mpsafenet=0 seems to have solved the problem. I'm running 100+ 
instances of the above script and the system has been stable for over an 
hour.


As I wanted some background on what debug.mpsafenet=0 does, I did some 
Googling and found a good write up here:


http://unix.derkeiler.com/Mailing-Lists/FreeBSD/current/2004-08/2280.html

Thanks,
Gary

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: panic in RELENG_5 UMA - two new stack traces

2005-07-01 Thread Gleb Smirnoff
On Fri, Jul 01, 2005 at 01:54:59PM -0400, Gary Mu1der wrote:
G On Tue, Jun 28, 2005 at 11:24:47AM -0400, Gary Mu1der wrote:
G G I spent the day yesterday trying to reproduce the crash that I posted 
G G last week and you kindly replied to. This is due to the fact that I 
G G stupidly managed to overwrite the kernel.debug that I used to generate 
G G the stack trace. Sadly I could not cause the system to crash again with 
G G the same sb* errors.
G G 
G G I did however remove both the Berkley Packet Filter and IPFilter from 
G my G custom kernel to try and isolate the problem. This has caused the 
G crash G to occur in a different and more reproducible form. I have both 
G G INVARIANTS and WITNESS enabled, as you can see from my kernel conf. 
G G which is included at the end of this e-mail.
G G 
G G Below are the latest stack traces (using bge and then fxp NICs), kernel 
G G conf. and dmesg. Any help would be appreciated. This time I have a copy 
G G of both the core files and corresponding kernel.debug so I can 
G hopefully G provide you with any info you need.
G 
G How often does it crash? Does debug.mpsafenet=0 increases stability?
G 
G I can reproduce the crash within 60 seconds of firing off 30+ ping/arp 
G -d scripts, all running in parallel.
G 
G debug.mpsafenet=0 seems to have solved the problem. I'm running 100+ 
G instances of the above script and the system has been stable for over an 
G hour.

Thanks! We definitely see that the bug is a race, not a broken logic. I am
almost sure, that you are experiencing the same bug as I described in
the beginning of the thread.

Although there is no yet fix available for race between 'arp -d' and
outgoing packet, there is one for race between incoming ARP reply and
outgoing packet. We will probably commit it soon, after more review.

-- 
Totus tuus, Glebius.
GLEBIUS-RIPN GLEB-RIPE
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: panic in RELENG_5 UMA - two new stack traces

2005-07-01 Thread Gary Mu1der

Gleb Smirnoff wrote:
G I can reproduce the crash within 60 seconds of firing off 30+ ping/arp 
G -d scripts, all running in parallel.
G 
G debug.mpsafenet=0 seems to have solved the problem. I'm running 100+ 
G instances of the above script and the system has been stable for over an 
G hour.


Thanks! We definitely see that the bug is a race, not a broken logic. I am
almost sure, that you are experiencing the same bug as I described in
the beginning of the thread.

Although there is no yet fix available for race between 'arp -d' and
outgoing packet, there is one for race between incoming ARP reply and
outgoing packet. We will probably commit it soon, after more review.


Is this bug specific to only using arp -d, or does it look like the 
arp -d tests identify a bug that might cause TCP/IP related crashes 
with other types of real-world network traffic.


To rephrase: Does it look like fixing this bug may fix a lot of the 
network-related crashes a number of people have reported?


Thanks,
Gary
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: panic in RELENG_5 UMA - two new stack traces

2005-07-01 Thread Gleb Smirnoff
On Fri, Jul 01, 2005 at 04:32:38PM -0400, Gary Mu1der wrote:
G G I can reproduce the crash within 60 seconds of firing off 30+ ping/arp 
G G -d scripts, all running in parallel.
G G 
G G debug.mpsafenet=0 seems to have solved the problem. I'm running 100+ 
G G instances of the above script and the system has been stable for over 
G an G hour.
G 
G Thanks! We definitely see that the bug is a race, not a broken logic. I am
G almost sure, that you are experiencing the same bug as I described in
G the beginning of the thread.
G 
G Although there is no yet fix available for race between 'arp -d' and
G outgoing packet, there is one for race between incoming ARP reply and
G outgoing packet. We will probably commit it soon, after more review.
G 
G Is this bug specific to only using arp -d, or does it look like the 
G arp -d tests identify a bug that might cause TCP/IP related crashes 
G with other types of real-world network traffic.
G 
G To rephrase: Does it look like fixing this bug may fix a lot of the 
G network-related crashes a number of people have reported?

See above in the thread. We have two races: one that can fire anytime
in runtime, and we are going to fix it. The other with 'arp -d', not fixed
yet.

I am not sure how many reports on network related panics where related to
this race. Let's fix it and see. You can patch your boxes with the patch
and see whether they are more stable in runtime.

-- 
Totus tuus, Glebius.
GLEBIUS-RIPN GLEB-RIPE
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: panic in RELENG_5 UMA - two new stack traces

2005-06-28 Thread Gary Mu1der

Gleb,

Thank you very much for your reply.

I spent the day yesterday trying to reproduce the crash that I posted 
last week and you kindly replied to. This is due to the fact that I 
stupidly managed to overwrite the kernel.debug that I used to generate 
the stack trace. Sadly I could not cause the system to crash again with 
the same sb* errors.


I did however remove both the Berkley Packet Filter and IPFilter from my 
custom kernel to try and isolate the problem. This has caused the crash 
to occur in a different and more reproducible form. I have both 
INVARIANTS and WITNESS enabled, as you can see from my kernel conf. 
which is included at the end of this e-mail.


Below are the latest stack traces (using bge and then fxp NICs), kernel 
conf. and dmesg. Any help would be appreciated. This time I have a copy 
of both the core files and corresponding kernel.debug so I can hopefully 
provide you with any info you need.




d5# uname -a
FreeBSD d5.bidx.com 5.4-RELEASE FreeBSD 5.4-RELEASE #12: Tue Jun 28 
09:19:34 EDT 2005 
[EMAIL PROTECTED]:/usr/obj/usr/src/sys/DB-DUAL-AMD64-RAID5  amd64


Here is a stack trace when I am using the bge NIC driver (which I've had 
reports on the freebsd-amd64 list as being unstable under load):


d5# kgdb kernel.debug.20 vmcore.20
[GDB will not be able to debug user-mode threads: 
/usr/lib/libthread_db.so: Undefined symbol ps_pglobal_lookup]

GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain 
conditions.

Type show copying to see the conditions.
There is absolutely no warranty for GDB.  Type show warranty for details.
This GDB was configured as amd64-marcel-freebsd.
#0  doadump () at pcpu.h:167
167 pcpu.h: No such file or directory.
in pcpu.h
(kgdb) bt full
#0  doadump () at pcpu.h:167
No locals.
#1  0x80241dc9 in boot (howto=260)
at /usr/src/sys/kern/kern_shutdown.c:410
_ep = (struct eventhandler_entry *) 0x0
_el = (struct eventhandler_list *) 0xff829e00
first_buf_printf = 1
#2  0x8024185b in panic (
fmt=0x803b35a8 Duplicate free of item %p from zone %p(%s)\n)
at /usr/src/sys/kern/kern_shutdown.c:566
bootopt = 260
newpanic = 0
ap = {{gp_offset = 32, fp_offset = 48,
overflow_arg_area = 0xb3431ad0,
reg_save_area = 0xb34319f0}}
buf = Duplicate free of item 0xff00d318bb00 from zone 
0xff00f3fe46c0(Mbuf)\n, '\0' repeats 178 times

#3  0x8031f2e8 in uma_dbg_free (zone=0xff00f3fe46c0,
slab=0xff00d318bf50, item=0xff00d318bb00)
at /usr/src/sys/vm/uma_dbg.c:301
keg = 0xff00f3fde000
freei = 11
#4  0x8031d720 in uma_zfree_arg (zone=0xff00f3fe46c0,
item=0xff00d318bb00, udata=0x0) at /usr/src/sys/vm/uma_core.c:2273
keg = 0xff00f3fde000
cache = 0xff00f3fe4740
bucket = 0x9
bflags = 0
skip = SKIP_DTOR
#5  0x8027f5d1 in m_freem (mb=0x0) at uma.h:304
No locals.
#6  0x801d424e in bge_intr (xsc=0x0)
at /usr/src/sys/dev/bge/if_bge.c:2862
sc = (struct bge_softc *) 0x80843000
status = 0
#7  0x8022c899 in ithread_loop (arg=0xff022300)
at /usr/src/sys/kern/kern_intr.c:547
ih = (struct intrhand *) 0xffa1eb80
p = (struct proc *) 0xff00ec16f8b8
count = 0
warming = 0
warned = 0
__func__ = ithread_loop
#8  0x8022b8d3 in fork_exit (
callout=0x8022c7c0 ithread_loop, arg=0xff022300,
frame=0xb3431c50) at /usr/src/sys/kern/kern_fork.c:791
p = (struct proc *) 0xff00ec16f8b8
#9  0x8032879e in fork_trampoline ()
at /usr/src/sys/amd64/amd64/exception.S:296
No locals.
#10 0x in ?? ()
No symbol table info available.
#11 0x in ?? ()
No symbol table info available.
#12 0x0001 in ?? ()
No symbol table info available.
#13 0x in ?? ()
No symbol table info available.
#14 0x in ?? ()
No symbol table info available.
#15 0x in ?? ()
No symbol table info available.
#16 0x in ?? ()
No symbol table info available.
#17 0x in ?? ()
No symbol table info available.
#18 0x in ?? ()
No symbol table info available.
#19 0x in ?? ()
No symbol table info available.
#20 0x in ?? ()
No symbol table info available.
#21 0x in ?? ()
No symbol table info available.
#22 0x in ?? ()
No symbol table info available.
#23 0x in ?? ()
No symbol table info available.
#24 0x in ?? ()
No symbol table info available.
#25 0x in ?? ()
No symbol table info available.
#26 

Re: panic in RELENG_5 UMA

2005-06-27 Thread Gleb Smirnoff
On Fri, Jun 24, 2005 at 03:28:34PM -0400, Gary Mu1der wrote:
G Can someone confirm that the following stack trace is showing the same 
G problem, or not?
G I can reproduce the problem with the custom kernel config included below 
G (which is basically GENERIC stripped of devices I don't have or need and 
G IPFILTER added), but not with a stock GENERIC kernel.
G 
G To cause the crash I'm running 20-30 instances of the following script:
G 
G d5# cat arping.sh
G #!/bin/sh
G 
G while :
G do
G arp -d 192.168.4.$1 /dev/null 21;
G ping -c 1 -t 1 192.168.4.$1 /dev/null 21;
G done

When running without INVARIANTS, it is much more difficult to analyze panics.
If this script drops your kernel to panic, then it is very likely that
it is the same problem.

Can you please provide the following info:

G (kgdb) bt
G #0  doadump () at pcpu.h:167
G #1  0x in ?? ()
G #2  0x802557b7 in boot (howto=260) at
G /usr/src/sys/kern/kern_shutdown.c:410
G #3  0x80255fef in panic (fmt=0xff00b5907500  ?6?)
G at /usr/src/sys/kern/kern_shutdown.c:566

(kgdb) p *panicstr

G #4  0x8029ad2a in sbdrop_locked (sb=0xb6274860, len=1146)
G at /usr/src/sys/kern/uipc_socket2.c:1149

(kgdb) frame 4
(kgdb) ls
(kgdb) p *m

G #5  0x8029afe2 in sbflush_locked (sb=0xb6274860)
G at /usr/src/sys/kern/uipc_socket2.c:1116
G #6  0x8029b049 in sbrelease_locked (sb=0xb6274860,
G so=0xff00a0a2a8a0)
G at /usr/src/sys/kern/uipc_socket2.c:564
G #7  0x8029b0d5 in sbrelease (sb=0xb6274860,
G so=0xff00a0a2a8a0)
G at /usr/src/sys/kern/uipc_socket2.c:577
G #8  0x80297b03 in sorflush (so=0xff00a0a2a8a0)
G at /usr/src/sys/kern/uipc_socket.c:1483
G #9  0x80297e42 in sofree (so=0xff00a0a2a8a0) at
G /usr/src/sys/kern/uipc_socket.c:407
G #10 0x80298467 in soclose (so=0xff00a0a2a8a0) at
G /usr/src/sys/kern/uipc_socket.c:485
G #11 0x802847b5 in soo_close (fp=0xff009ca95b60, td=0x0)
G at /usr/src/sys/kern/sys_socket.c:299
G #12 0x8022c2c0 in fdrop_locked (fp=0xff009ca95b60,
G td=0xff00b5907500)
G at file.h:288

(kgdb) frame 12
(kgdb) p *td
(kgdb) p *td-td_proc

G #13 0x8022c40a in closef (fp=0xff009ca95b60,

-- 
Totus tuus, Glebius.
GLEBIUS-RIPN GLEB-RIPE
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: panic in RELENG_5 UMA

2005-06-24 Thread Gary Mu1der

All,

Can someone confirm that the following stack trace is showing the same 
problem, or not?


I can reproduce the problem with the custom kernel config included below 
(which is basically GENERIC stripped of devices I don't have or need and 
IPFILTER added), but not with a stock GENERIC kernel.


To cause the crash I'm running 20-30 instances of the following script:

d5# cat arping.sh
#!/bin/sh

while :
do
arp -d 192.168.4.$1 /dev/null 21;
ping -c 1 -t 1 192.168.4.$1 /dev/null 21;
done

d5# uname -a
FreeBSD d5.bidx.com 5.4-RELEASE FreeBSD 5.4-RELEASE #6: Thu Jun 23
13:45:20 EDT 2005
[EMAIL PROTECTED]:/usr/obj/usr/src/sys/DB-DUAL-AMD64-RAID5  amd64

d5# kgdb /usr/obj/usr/src/sys/DB-DUAL-AMD64-RAID5/kernel.debug ./vmcore.5
[GDB will not be able to debug user-mode threads:
/usr/lib/libthread_db.so: Undefined symbol ps_pglobal_lookup]
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain
conditions.
Type show copying to see the conditions.
There is absolutely no warranty for GDB.  Type show warranty for details.
This GDB was configured as amd64-marcel-freebsd.
#0  doadump () at pcpu.h:167
167 pcpu.h: No such file or directory.
in pcpu.h
(kgdb) bt
#0  doadump () at pcpu.h:167
#1  0x in ?? ()
#2  0x802557b7 in boot (howto=260) at
/usr/src/sys/kern/kern_shutdown.c:410
#3  0x80255fef in panic (fmt=0xff00b5907500  ë6µ)
at /usr/src/sys/kern/kern_shutdown.c:566
#4  0x8029ad2a in sbdrop_locked (sb=0xb6274860, len=1146)
at /usr/src/sys/kern/uipc_socket2.c:1149
#5  0x8029afe2 in sbflush_locked (sb=0xb6274860)
at /usr/src/sys/kern/uipc_socket2.c:1116
#6  0x8029b049 in sbrelease_locked (sb=0xb6274860,
so=0xff00a0a2a8a0)
at /usr/src/sys/kern/uipc_socket2.c:564
#7  0x8029b0d5 in sbrelease (sb=0xb6274860,
so=0xff00a0a2a8a0)
at /usr/src/sys/kern/uipc_socket2.c:577
#8  0x80297b03 in sorflush (so=0xff00a0a2a8a0)
at /usr/src/sys/kern/uipc_socket.c:1483
#9  0x80297e42 in sofree (so=0xff00a0a2a8a0) at
/usr/src/sys/kern/uipc_socket.c:407
#10 0x80298467 in soclose (so=0xff00a0a2a8a0) at
/usr/src/sys/kern/uipc_socket.c:485
#11 0x802847b5 in soo_close (fp=0xff009ca95b60, td=0x0)
at /usr/src/sys/kern/sys_socket.c:299
#12 0x8022c2c0 in fdrop_locked (fp=0xff009ca95b60,
td=0xff00b5907500)
at file.h:288
#13 0x8022c40a in closef (fp=0xff009ca95b60,
td=0xff00b5907500)
at /usr/src/sys/kern/kern_descrip.c:1920
#14 0x8022e5be in fdfree (td=0xff00b5907500)
at /usr/src/sys/kern/kern_descrip.c:1624
#15 0x80238bd0 in exit1 (td=0xff00b5907500, rv=0)
at /usr/src/sys/kern/kern_exit.c:236
#16 0x8023a04e in sys_exit (td=0x0, uap=0x0) at
/usr/src/sys/kern/kern_exit.c:93
#17 0x8035cd8c in syscall (frame=
  {tf_rdi = 0, tf_rsi = 5263360, tf_rdx = 0, tf_rcx = 34366596768,
tf_r8 = 0, tf_r9 = 140737488350136, tf_rax = 1, tf_rbx = 0, tf_rbp = 3,
tf_r10 = -1099499764224, tf_r11 = 515, tf_r12 = 140---Type return to
continue, or q return to quit---
737488350376, tf_r13 = 0, tf_r14 = 0, tf_r15 = 0, tf_trapno = 12,
tf_addr = 34368259080, tf_flags = 0, tf_err = 2, tf_rip = 34366590280,
tf_cs = 43, tf_rflags = 514, tf_rsp = 140737488350296, tf_ss = 35}) at
/usr/src/sys/amd64/amd64/trap.c:771
#18 0x80349f88 in Xfast_syscall () at
/usr/src/sys/amd64/amd64/exception.S:248
#19 0x in ?? ()
#20 0x00505000 in ?? ()
#21 0x in ?? ()
#22 0x00080068a6a0 in ?? ()
#23 0x in ?? ()
#24 0x7fffebb8 in ?? ()
#25 0x0001 in ?? ()
#26 0x in ?? ()
#27 0x0003 in ?? ()
#28 0xffb50600 in ?? ()
#29 0x0203 in ?? ()
#30 0x7fffeca8 in ?? ()
#31 0x in ?? ()
#32 0x in ?? ()
#33 0x in ?? ()
#34 0x000c in ?? ()
#35 0x000800820408 in ?? ()
#36 0x in ?? ()
#37 0x0002 in ?? ()
#38 0x000800688d48 in ?? ()
#39 0x002b in ?? ()
#40 0x0202 in ?? ()
#41 0x7fffec58 in ?? ()
#42 0x0023 in ?? ()
#43 0x7fffe968 in ?? ()
#44 0x0023 in ?? ()
#45 0x in ?? ()
---Type return to continue, or q return to quit---
#46 0x in ?? ()
#47 0x in ?? ()
#48 0x in ?? ()
#49 0x in ?? ()
#50 0x in ?? ()
#51 0x in ?? ()
#52 0x in ?? ()
#53 0xa14b4000 in ?? ()
#54 0xb6274c40 in ?? ()
#55 0x0101 in ?? ()
#56 0x in ?? ()
#57 0xff00b536eba0 in ?? ()
#58 0xff00ec19a780 in ?? ()
#59 0xb6274b58 in 

Re: panic in RELENG_5 UMA

2005-06-24 Thread Gary Mu1der
Sorry, I forgot to add that this is a Tyan Thunder K8SPRO w/dual AMD 
Opteron Processors, model no. 246, 4GB of RAM and an Adaptec 2200S RAID 
controller.


The NIC being used is the onboard Broadcom Gigabit Ethernet (bge).

Thanks,
Gary


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: panic in RELENG_5 UMA

2005-06-23 Thread Gleb Smirnoff
On Wed, Jun 22, 2005 at 03:03:53PM +0200, Andre Oppermann wrote:
A  Fixing this one is harder. We take la from unlocked rtentry obtained via
A  rt_check(), or from arplookup(). The latter drops lock on rtentry, too.
A  Then we do some work and use this la. It may have already been freed in
A  arp_rtrequest(), the RTM_DELETE case.
A  
A  I see two approaches here:
A  
A  1) Protecting llinfo with route lock. In this case we need rt_check()
A  to return locked *rt (just reference won't help). We also need
A  arplookup() to return locked rt. And do not unlock it withing all
A  arpresolve() and a big part of in_arpinput() functions.
A 
A I think for 5-stable this is the way to go.

What about fixing it step by step? The patch attached to my previous message
fixes the panic report by Jeremie, I suppose. It is race between output
path and input path, that can occur anytime in runtime.

The race that is not fixed by my patch (discussed above) is between output path
and RTM_DELETE message, is less critical - it can occur only when administrator
runs arp -d.

Can you please review my patch? I think we should commit it first, and then
work on the second race.

-- 
Totus tuus, Glebius.
GLEBIUS-RIPN GLEB-RIPE
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: panic in RELENG_5 UMA

2005-06-23 Thread Jeremie Le Hen
Gleb,

 What about fixing it step by step? The patch attached to my previous message
 fixes the panic report by Jeremie, I suppose. It is race between output
 path and input path, that can occur anytime in runtime.

FYI, I compiled my kernel with your patch and I have had no panic since
then.  Note that my previous uptime was multiple tens of days and I
haven't done stress tests.  But anyway I think your massively parallel
arp -d/ping tests are far more significative than my box which only
communicates with a couple of settled machines.

Regards,
-- 
Jeremie Le Hen
 jeremie at le-hen dot org  ttz at chchile dot org 
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: panic in RELENG_5 UMA

2005-06-22 Thread Andre Oppermann
Gleb Smirnoff wrote:
 
 [ cc'ing parties involved in this part of code]
 
 On Tue, Jun 21, 2005 at 01:07:01PM +0400, Gleb Smirnoff wrote:
 T On Tue, Jun 21, 2005 at 09:04:27AM +0200, Jeremie Le Hen wrote:
 T J #25 0xc05a0a0b in m_freem (mb=0x0) at uma.h:304
 T J No locals.
 T J #26 0xc05ee0d5 in arpresolve (ifp=0xc1a5b000, rt0=0xc1d44000, 
 m=0xc1be7200,
 T J dst=0xd6d3fa94, desten=0xd6d3fa2c /??]??w??)
 T J at ../../../netinet/if_ether.c:442
 T J la = (struct llinfo_arp *) 0xc1a75a00
 T J sdl = (struct sockaddr_dl *) 0xc2128910
 T J error = -1038972656
 T J rt = (struct rtentry *) 0xc1d44000
 T
 T IMHO, this looks like a race. The route is not locked, when
 T its llinfo is edited.
 T
 T Probably the mbuf was freed when arp reply arrived and la_hold was send.
 T Look into in_arpinput() near 736:
 T
 T (*ifp-if_output)(ifp, la-la_hold, rt_key(rt), 
 rt);
 T la-la_hold = 0;
 T
 T Yeah, I have just triggered another panic running 15 instances of this 
 script on
 T SMP box:
 T
 T (
 T while (true); do
 T  arp -d 81.19.64.111  /dev/null 21;
 T  ping -c 1 -t 1 81.19.64.111 /dev/null 21;
 T done
 T ) 
 T
 T But my duplicate free is in fxp_txeof(). This means that output thread has
 T won the race.
 
 I suppose that the attached patch closes your race. However, there is still
 race between RTM_DELETE and output path. The above script still drops kernel
 to panic, but the other one. Output path works with already freed llinfo:
 
 #28 0xc0507000 in m_freem (mb=0x0) at mbuf.h:410
 #29 0xc053fde3 in arpresolve (ifp=0xc2012800, rt0=0xc22fcdec, m=0xc25a8000, 
 dst=0xe720bb28,
 desten=0xe720bacc uøbÀ+\001) at /usr/src/sys/netinet/if_ether.c:443
 #30 0xc0538078 in ether_output (ifp=0xc2012800, m=0xc25a8000, dst=0xe720bb28, 
 rt0=0xc22fcdec)
 at /usr/src/sys/net/if_ethersubr.c:173
 #31 0xc054b5b4 in ip_output (m=0xc25a8000, opt=0xc25a80ac, ro=0xe720bb24, 
 flags=0x20, imo=0x0, inp=0xc25eb5a0)
 at /usr/src/sys/netinet/ip_output.c:772
 #32 0xc054d36b in rip_output (m=0xc25a8000, so=0x0, dst=0x0) at 
 /usr/src/sys/netinet/raw_ip.c:320
 #33 0xc054de7b in rip_send (so=0xc248c914, flags=0x0, m=0xc25a8000, 
 nam=0xc218d410, control=0x0, td=0xc224d7d0)
 at /usr/src/sys/netinet/raw_ip.c:785
 #34 0xc050a30f in sosend (so=0xc248c914, addr=0xc218d410, uio=0xe720bc3c, 
 top=0xc25a8000, control=0x0, flags=0x0,
 td=0xc224d7d0) at /usr/src/sys/kern/uipc_socket.c:827
 
 (kgdb) frame 29
 #29 0xc053fde3 in arpresolve (ifp=0xc2012800, rt0=0xc22fcdec, m=0xc25a8000, 
 dst=0xe720bb28,
 desten=0xe720bacc uøbÀ+\001) at /usr/src/sys/netinet/if_ether.c:443
 443 m_freem(la-la_hold);
 (kgdb) p *la
 $3 = {
   la_le = {
 le_next = 0xdeadc0de,
 le_prev = 0xdeadc0de
   },
   la_rt = 0xdeadc0de,
   la_hold = 0xdeadc0de,
   la_preempt = 0xc0de,
   la_asked = 0xdead
 }
 
 Fixing this one is harder. We take la from unlocked rtentry obtained via
 rt_check(), or from arplookup(). The latter drops lock on rtentry, too.
 Then we do some work and use this la. It may have already been freed in
 arp_rtrequest(), the RTM_DELETE case.
 
 I see two approaches here:
 
 1) Protecting llinfo with route lock. In this case we need rt_check()
 to return locked *rt (just reference won't help). We also need
 arplookup() to return locked rt. And do not unlock it withing all
 arpresolve() and a big part of in_arpinput() functions.

I think for 5-stable this is the way to go.

 2) Add mutex to llinfo_arp. I'm afraid this will hurt performance.

The new ARP stuff should fix these issues, however it is not ready yet.
At the moment it looks like it wont make it right away into 6.0 but go
into 7-current and then MFC'd back for 6.1R.

-- 
Andre
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: panic in RELENG_5 UMA

2005-06-22 Thread Gleb Smirnoff
On Wed, Jun 22, 2005 at 03:03:53PM +0200, Andre Oppermann wrote:
A  Fixing this one is harder. We take la from unlocked rtentry obtained via
A  rt_check(), or from arplookup(). The latter drops lock on rtentry, too.
A  Then we do some work and use this la. It may have already been freed in
A  arp_rtrequest(), the RTM_DELETE case.
A  
A  I see two approaches here:
A  
A  1) Protecting llinfo with route lock. In this case we need rt_check()
A  to return locked *rt (just reference won't help). We also need
A  arplookup() to return locked rt. And do not unlock it withing all
A  arpresolve() and a big part of in_arpinput() functions.
A 
A I think for 5-stable this is the way to go.

I have started working on this. Making arplookup() to return locked rt
looks possible. There are two more questions:

- is it possible to make rt_check() to return locked *rt? This requires
  editing nd6.c, and if_*subr.c. We can't MFC this to RELENG_5.
  Probably, at first step I'll try to avoid changing rt_check and see
  whether changing arplookup() is enough to avoid panics.

- Is the following statement always true?
la-la_rt-rt_llinfo == la

A  2) Add mutex to llinfo_arp. I'm afraid this will hurt performance.
A 
A The new ARP stuff should fix these issues, however it is not ready yet.
A At the moment it looks like it wont make it right away into 6.0 but go
A into 7-current and then MFC'd back for 6.1R.

Yeah. I've already compiled a kernel with it. It is bootable and working,
but I haven't yet run hard tests. I'll work on locking now and perform
testing. In general it looks much better than what we have now. The locking
is going to be simple and straightforward. Thanks for nice code! Do you
mind if I pull it into a perforce branch to work on it together?

-- 
Totus tuus, Glebius.
GLEBIUS-RIPN GLEB-RIPE
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: panic in RELENG_5 UMA

2005-06-22 Thread Andre Oppermann
Gleb Smirnoff wrote:
 
 On Wed, Jun 22, 2005 at 03:03:53PM +0200, Andre Oppermann wrote:
 A  Fixing this one is harder. We take la from unlocked rtentry obtained via
 A  rt_check(), or from arplookup(). The latter drops lock on rtentry, too.
 A  Then we do some work and use this la. It may have already been freed in
 A  arp_rtrequest(), the RTM_DELETE case.
 A 
 A  I see two approaches here:
 A 
 A  1) Protecting llinfo with route lock. In this case we need rt_check()
 A  to return locked *rt (just reference won't help). We also need
 A  arplookup() to return locked rt. And do not unlock it withing all
 A  arpresolve() and a big part of in_arpinput() functions.
 A
 A I think for 5-stable this is the way to go.
 
 I have started working on this. Making arplookup() to return locked rt
 looks possible. There are two more questions:
 
 - is it possible to make rt_check() to return locked *rt? This requires
   editing nd6.c, and if_*subr.c. We can't MFC this to RELENG_5.
   Probably, at first step I'll try to avoid changing rt_check and see
   whether changing arplookup() is enough to avoid panics.

Actually I don't know if rt_check() can return a locket *rt.

 - Is the following statement always true?
 la-la_rt-rt_llinfo == la

Good question.  I'll look into Design and Implementation of 4.4BSD and
FreeBSD 5 when I get home.

 A  2) Add mutex to llinfo_arp. I'm afraid this will hurt performance.
 A
 A The new ARP stuff should fix these issues, however it is not ready yet.
 A At the moment it looks like it wont make it right away into 6.0 but go
 A into 7-current and then MFC'd back for 6.1R.
 
 Yeah. I've already compiled a kernel with it. It is bootable and working,
 but I haven't yet run hard tests. I'll work on locking now and perform
 testing. In general it looks much better than what we have now. The locking
 is going to be simple and straightforward. Thanks for nice code! Do you
 mind if I pull it into a perforce branch to work on it together?

Better wait a bit before you pull it into perforce.  First we have to
move Qing along and second I'd like to do one more iteration with him
over the code.  There are a couple of rough edges and style issues I'd
like to carve out first.  And then there is the tab-space problem which
makes it a pain importing.  We need to fix Qing's editor as the very
first thing. ;-)

-- 
Andre
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


panic in RELENG_5 UMA

2005-06-21 Thread Jeremie Le Hen
Hi list,

I caught a panic this night on my RELENG_5.  The kernel was compiled on
2005/05/21.  Please, feel free to ask for further informations (and
include me explicitely in the recipients list since I'm not subscribed
to this list).

kgdb stacktrace:
%%%
#22 0xc0566d1d in panic (
fmt=0xc0728d5d Duplicate free of item %p from zone %p(%s)\n)
at ../../../kern/kern_shutdown.c:550
td = (struct thread *) 0xc205ec00
bootopt = 256
newpanic = 1
ap = 0xd6d3f968 
buf = Duplicate free of item 0xc1be8800 from zone 0xc1045ae0(Mbuf)\n,
'\0' repeats 194 times
#23 0xc069e280 in uma_dbg_free (zone=0xc1045ae0, slab=0xc1be8fa8,
item=0xc1be8800) at ../../../vm/uma_dbg.c:301
keg = 0xc101f3c0
slabref = 0x0
freei = 8
#24 0xc069cc39 in uma_zfree_arg (zone=0xc1045ae0, item=0xc1be8800, udata=0x0)
at ../../../vm/uma_core.c:2273
keg = 0xc101f3c0
cache = 0xc1045b18
bucket = 0xc1be2000
bflags = 0
cpu = 0
skip = SKIP_DTOR
#25 0xc05a0a0b in m_freem (mb=0x0) at uma.h:304
No locals.
#26 0xc05ee0d5 in arpresolve (ifp=0xc1a5b000, rt0=0xc1d44000, m=0xc1be7200,
dst=0xd6d3fa94, desten=0xd6d3fa2c /æ]ÀäµwÀ)
at ../../../netinet/if_ether.c:442
la = (struct llinfo_arp *) 0xc1a75a00
sdl = (struct sockaddr_dl *) 0xc2128910
error = -1038972656
rt = (struct rtentry *) 0xc1d44000
#27 0xc05dac65 in ether_output (ifp=0xc1a5b000, m=0xc1be7200, dst=0xd6d3fa94,
rt0=0x0) at ../../../net/if_ethersubr.c:165
type = -10541
error = 50
hdrcmplt = 0
esrc = K\000\000\000\214z
edst = /æ]Àäµ
eh = (struct ether_header *) 0x32
loop_copy = 0
#28 0xc060150c in ip_output (m=0xc1be7200, opt=0xc1be7240, ro=0xd6d3fa90,
flags=0, imo=0x0, inp=0xc40f7a8c) at ../../../netinet/ip_output.c:770
ip = (struct ip *) 0xc1be7240
ifp = (struct ifnet *) 0xc1a5b000
m0 = (struct mbuf *) 0xc1be7240
hlen = 20
len = 1
error = 0
dst = (struct sockaddr_in *) 0xd6d3fa94
ia = (struct in_ifaddr *) 0xc1c2b300
isbroadcast = 0
sw_csum = 1
iproute = {ro_rt = 0xc1d44000, ro_dst = {sa_len = 16 '\020',
sa_family = 2 '\002',
sa_data = \000\000Àš\001²\000\000\000\000\000\000\000}}
odst = {s_addr = 1}
fwd_tag = (struct m_tag *) 0x0
__func__ = ip_output
#29 0xc060aba1 in tcp_output (tp=0xc1d75534)
at ../../../netinet/tcp_output.c:1119
so = (struct socket *) 0xc2afe000
len = 144
recwin = 66608
sendwin = -1044483500
flags = 24
error = -1044483500
m = (struct mbuf *) 0xc1be7200
ip = (struct ip *) 0xc1be7240
th = (struct tcphdr *) 0xc1be7254
opt = 
\001\001\b\n\002äm\003õJÁ+\001\000\000žà¯ÂÐà¯Â\000à¯Â\204ûÓÖ\203
~ZÀÐà¯Â
ipoptlen = 0
optlen = 12
hdrlen = 52
idle = 1
sendalot = 0
i = 299
sack_rxmit = 0
sack_bytes_rxmt = 0
p = (struct sackhole *) 0x0
tao = {tao_cc = 767, tao_ccsent = 3228670914, tao_mssopt = 64356}
__func__ = tcp_output
#30 0xc061167c in tcp_usr_send (so=0xc2afe000, flags=0, m=0xc1be7600, nam=0x0,
control=0x0, td=0xc205ec00) at ../../../netinet/tcp_usrreq.c:699
error = 0
inp = (struct inpcb *) 0xc40f7a8c
tp = (struct tcpcb *) 0xc1d75534
#31 0xc05a41e8 in sosend (so=0xc2afe000, addr=0x0, uio=0xd6d3fc70,
top=0xc1be7600, control=0x0, flags=0, td=0xc205ec00)
at ../../../kern/uipc_socket.c:835
mp = (struct mbuf **) 0xc1be7600
m = (struct mbuf *) 0xc1be7600
space = 33160
len = 144
resid = 0
clen = -1044482560
error = 0
dontroute = 0
atomic = 0
#32 0xc05928bf in soo_write (fp=0x0, uio=0xd6d3fc70, active_cred=0xc4211e80,
flags=0, td=0xc205ec00) at ../../../kern/sys_socket.c:118
so = (struct socket *) 0xc2afe000
error = 144
#33 0xc058bc0b in dofilewrite (td=0xc205ec00, fp=0xc2aff83c, fd=0, buf=0x0,
nbyte=3228877920, offset=Unhandled dwarf expression opcode 0x93
) at file.h:245
auio = {uio_iov = 0xd6d3fc68, uio_iovcnt = 1, uio_offset = 143,
  uio_resid = 0, uio_segflg = UIO_USERSPACE, uio_rw = UIO_WRITE,
  uio_td = 0xc205ec00}
aiov = {iov_base = 0x807d090, iov_len = 0}
cnt = 144
error = -1066089376
ktruio = (struct uio *) 0x0
#34 0xc058ba74 in write (td=0xc205ec00, uap=0xd6d3fd04)
at ../../../kern/sys_generic.c:300
fp = (struct file *) 0xc2aff83c
error = 0
#35 0xc06d2a12 in syscall (frame=
  {tf_fs = -1078001617, tf_es = 47, tf_ds = -1078001617, tf_edi = 134671528,
 tf_esi = 144, tf_ebp = -1077943016, tf_isp = -690750108, tf_ebx = 671922152, tf
_edx = 134671528, tf_ecx = 4, tf_eax = 4, tf_trapno = 12, tf_err = 2, tf_eip = 6
73631499, tf_cs = 31, tf_eflags = 518, tf_esp = -1077943044, tf_ss = 47})
at 

Re: panic in RELENG_5 UMA

2005-06-21 Thread Jeremie Le Hen
Hi,

 I caught a panic this night on my RELENG_5.  The kernel was compiled on
 2005/05/21.  Please, feel free to ask for further informations (and
 include me explicitely in the recipients list since I'm not subscribed
 to this list).
 
 kgdb stacktrace:
 %%%
 [snip]
 %%%

I was a little bit sleepy earlier this morning.  I forgot to tell that
my kernel is compiled with INVARIANTS and PREEMPTION.

%%%
(kgdb) up 26
#26 0xc05ee0d5 in arpresolve (ifp=0xc1a5b000, rt0=0xc1d44000, m=0xc1be7200,
dst=0xd6d3fa94, desten=0xd6d3fa2c /æ]ÀäµwÀ)
at ../../../netinet/if_ether.c:442
442 m_freem(la-la_hold);
(kgdb) l
437  * There is an arptab entry, but no ethernet address
438  * response yet.  Replace the held mbuf with this
439  * latest one.
440  */
441 if (la-la_hold)
442 m_freem(la-la_hold);
443 la-la_hold = m;
444 if (rt-rt_expire) {
445 RT_LOCK(rt);
446 rt-rt_flags = ~RTF_REJECT;
(kgdb) print *la
$1 = {la_le = {le_next = 0xc1e74400, le_prev = 0xc077aa68},
  la_rt = 0xc1d44000, la_hold = 0x0, la_preempt = 5, la_asked = 0}
%%%

-- 
Jeremie Le Hen
 jeremie at le-hen dot org  ttz at chchile dot org 
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: panic in RELENG_5 UMA

2005-06-21 Thread Gleb Smirnoff
On Tue, Jun 21, 2005 at 09:04:27AM +0200, Jeremie Le Hen wrote:
J #25 0xc05a0a0b in m_freem (mb=0x0) at uma.h:304
J No locals.
J #26 0xc05ee0d5 in arpresolve (ifp=0xc1a5b000, rt0=0xc1d44000, m=0xc1be7200,
J dst=0xd6d3fa94, desten=0xd6d3fa2c /??]??w??)
J at ../../../netinet/if_ether.c:442
J la = (struct llinfo_arp *) 0xc1a75a00
J sdl = (struct sockaddr_dl *) 0xc2128910
J error = -1038972656
J rt = (struct rtentry *) 0xc1d44000

IMHO, this looks like a race. The route is not locked, when
its llinfo is edited.

Probably the mbuf was freed when arp reply arrived and la_hold was send.
Look into in_arpinput() near 736:

(*ifp-if_output)(ifp, la-la_hold, rt_key(rt), rt);
la-la_hold = 0;

Yeah, I have just triggered another panic running 15 instances of this script on
SMP box:

(
while (true); do
arp -d 81.19.64.111  /dev/null 21;
ping -c 1 -t 1 81.19.64.111 /dev/null 21;
done
) 

But my duplicate free is in fxp_txeof(). This means that output thread has
won the race.

-- 
Totus tuus, Glebius.
GLEBIUS-RIPN GLEB-RIPE
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: panic in RELENG_5 UMA

2005-06-21 Thread Jeremie Le Hen
Hi Gleb,

 IMHO, this looks like a race. The route is not locked, when
 its llinfo is edited.
 
 Probably the mbuf was freed when arp reply arrived and la_hold was send.
 Look into in_arpinput() near 736:
 
 (*ifp-if_output)(ifp, la-la_hold, rt_key(rt), rt);
 la-la_hold = 0;
 
 Yeah, I have just triggered another panic running 15 instances of this
 script on SMP box:
 
 (
 while (true); do
   arp -d 81.19.64.111  /dev/null 21;
   ping -c 1 -t 1 81.19.64.111 /dev/null 21;
 done
 ) 
 
 But my duplicate free is in fxp_txeof(). This means that output thread has
 won the race.

This explanation sounds good but my box is an UP with PREEMPTION.
Is is supposed to be also possible in this case ?

Regards,
-- 
Jeremie Le Hen
 jeremie at le-hen dot org  ttz at chchile dot org 
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: panic in RELENG_5 UMA

2005-06-21 Thread Gleb Smirnoff
On Tue, Jun 21, 2005 at 11:28:36AM +0200, Jeremie Le Hen wrote:
J  IMHO, this looks like a race. The route is not locked, when
J  its llinfo is edited.
J  
J  Probably the mbuf was freed when arp reply arrived and la_hold was send.
J  Look into in_arpinput() near 736:
J  
J  (*ifp-if_output)(ifp, la-la_hold, rt_key(rt), 
rt);
J  la-la_hold = 0;
J  
J  Yeah, I have just triggered another panic running 15 instances of this
J  script on SMP box:
J  
J  (
J  while (true); do
J arp -d 81.19.64.111  /dev/null 21;
J ping -c 1 -t 1 81.19.64.111 /dev/null 21;
J  done
J  ) 
J  
J  But my duplicate free is in fxp_txeof(). This means that output thread has
J  won the race.
J 
J This explanation sounds good but my box is an UP with PREEMPTION.
J Is is supposed to be also possible in this case ?

I guess yes, because of preemption.

-- 
Totus tuus, Glebius.
GLEBIUS-RIPN GLEB-RIPE
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: panic in RELENG_5 UMA

2005-06-21 Thread Gleb Smirnoff
[ cc'ing parties involved in this part of code]

On Tue, Jun 21, 2005 at 01:07:01PM +0400, Gleb Smirnoff wrote:
T On Tue, Jun 21, 2005 at 09:04:27AM +0200, Jeremie Le Hen wrote:
T J #25 0xc05a0a0b in m_freem (mb=0x0) at uma.h:304
T J No locals.
T J #26 0xc05ee0d5 in arpresolve (ifp=0xc1a5b000, rt0=0xc1d44000, 
m=0xc1be7200,
T J dst=0xd6d3fa94, desten=0xd6d3fa2c /??]??w??)
T J at ../../../netinet/if_ether.c:442
T J la = (struct llinfo_arp *) 0xc1a75a00
T J sdl = (struct sockaddr_dl *) 0xc2128910
T J error = -1038972656
T J rt = (struct rtentry *) 0xc1d44000
T 
T IMHO, this looks like a race. The route is not locked, when
T its llinfo is edited.
T 
T Probably the mbuf was freed when arp reply arrived and la_hold was send.
T Look into in_arpinput() near 736:
T 
T (*ifp-if_output)(ifp, la-la_hold, rt_key(rt), rt);
T la-la_hold = 0;
T 
T Yeah, I have just triggered another panic running 15 instances of this 
script on
T SMP box:
T 
T (
T while (true); do
T  arp -d 81.19.64.111  /dev/null 21;
T  ping -c 1 -t 1 81.19.64.111 /dev/null 21;
T done
T ) 
T 
T But my duplicate free is in fxp_txeof(). This means that output thread has
T won the race.

I suppose that the attached patch closes your race. However, there is still
race between RTM_DELETE and output path. The above script still drops kernel
to panic, but the other one. Output path works with already freed llinfo:

#28 0xc0507000 in m_freem (mb=0x0) at mbuf.h:410
#29 0xc053fde3 in arpresolve (ifp=0xc2012800, rt0=0xc22fcdec, m=0xc25a8000, 
dst=0xe720bb28, 
desten=0xe720bacc uЬbю+\001) at /usr/src/sys/netinet/if_ether.c:443
#30 0xc0538078 in ether_output (ifp=0xc2012800, m=0xc25a8000, dst=0xe720bb28, 
rt0=0xc22fcdec)
at /usr/src/sys/net/if_ethersubr.c:173
#31 0xc054b5b4 in ip_output (m=0xc25a8000, opt=0xc25a80ac, ro=0xe720bb24, 
flags=0x20, imo=0x0, inp=0xc25eb5a0)
at /usr/src/sys/netinet/ip_output.c:772
#32 0xc054d36b in rip_output (m=0xc25a8000, so=0x0, dst=0x0) at 
/usr/src/sys/netinet/raw_ip.c:320
#33 0xc054de7b in rip_send (so=0xc248c914, flags=0x0, m=0xc25a8000, 
nam=0xc218d410, control=0x0, td=0xc224d7d0)
at /usr/src/sys/netinet/raw_ip.c:785
#34 0xc050a30f in sosend (so=0xc248c914, addr=0xc218d410, uio=0xe720bc3c, 
top=0xc25a8000, control=0x0, flags=0x0, 
td=0xc224d7d0) at /usr/src/sys/kern/uipc_socket.c:827

(kgdb) frame 29
#29 0xc053fde3 in arpresolve (ifp=0xc2012800, rt0=0xc22fcdec, m=0xc25a8000, 
dst=0xe720bb28, 
desten=0xe720bacc uЬbю+\001) at /usr/src/sys/netinet/if_ether.c:443
443 m_freem(la-la_hold);
(kgdb) p *la
$3 = {
  la_le = {
le_next = 0xdeadc0de, 
le_prev = 0xdeadc0de
  }, 
  la_rt = 0xdeadc0de, 
  la_hold = 0xdeadc0de, 
  la_preempt = 0xc0de, 
  la_asked = 0xdead
}

Fixing this one is harder. We take la from unlocked rtentry obtained via
rt_check(), or from arplookup(). The latter drops lock on rtentry, too.
Then we do some work and use this la. It may have already been freed in
arp_rtrequest(), the RTM_DELETE case.

I see two approaches here:

1) Protecting llinfo with route lock. In this case we need rt_check()
to return locked *rt (just reference won't help). We also need
arplookup() to return locked rt. And do not unlock it withing all
arpresolve() and a big part of in_arpinput() functions.

2) Add mutex to llinfo_arp. I'm afraid this will hurt performance.

-- 
Totus tuus, Glebius.
GLEBIUS-RIPN GLEB-RIPE
Index: if_ether.c
===
RCS file: /home/ncvs/src/sys/netinet/if_ether.c,v
retrieving revision 1.137
diff -u -r1.137 if_ether.c
--- if_ether.c  5 Jun 2005 03:13:12 -   1.137
+++ if_ether.c  21 Jun 2005 10:36:08 -
@@ -438,11 +438,11 @@
 * response yet.  Replace the held mbuf with this
 * latest one.
 */
+   RT_LOCK(rt);
if (la-la_hold)
m_freem(la-la_hold);
la-la_hold = m;
if (rt-rt_expire) {
-   RT_LOCK(rt);
rt-rt_flags = ~RTF_REJECT;
if (la-la_asked == 0 || rt-rt_expire != time_second) {
rt-rt_expire = time_second;
@@ -459,8 +459,8 @@
}
 
}
-   RT_UNLOCK(rt);
}
+   RT_UNLOCK(rt);
return (EWOULDBLOCK);
 }
 
@@ -642,6 +642,8 @@
goto reply;
la = arplookup(isaddr.s_addr, itaddr.s_addr == myaddr.s_addr, 0);
if (la  (rt = la-la_rt)  (sdl = SDL(rt-rt_gateway))) {
+   struct mbuf *hold;
+
/* the following is not an error when doing bridging */
if (!bridged  rt-rt_ifp != ifp
 #ifdef DEV_CARP
@@ -729,11 +731,13 @@
if (rt-rt_expire)
rt-rt_expire = time_second + arpt_keep;
rt-rt_flags = ~RTF_REJECT;
-   RT_UNLOCK(rt);
la-la_asked = 0;
la-la_preempt