Re: Resending: RT patches expose netdev race [was Re: [RFC] [patch 2/2] powerpc 2.6.21-rt1: fix kernel hang and/or panic

2007-05-17 Thread Linas Vepstas
On Thu, May 17, 2007 at 10:49:45AM +1000, Benjamin Herrenschmidt wrote:
 
  I do not know why sk_buff-head would be null, or
  would be set in a racy kind of way, or why the rt patches
  would cause this. But the evidence implicates that.
 
 Would it be possible that a locking bug in spidernet would cause it
 under some circumstances to get a stale skb pointer ?

The skb pointer should be brand-spanking new/fresh. 
It is passed to spidernet by the netdev-hard_start_xmit
callback:

netdev-hard_start_xmit = spider_net_xmit;

I'd expect that anything that hard_start_xmit() passed to 
a device driver should have a fully valid skb.  Locking
problems in spidernet could cause it to work with the wrong 
skb; however, in this case, the skb pointer is passed 
unmodified, directly to the spot where it fails.

Maybe there is some make ip header fresh and clean on skb call
that should have been made; if so, I don't know what it is. 

--linas
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Resending: RT patches expose netdev race [was Re: [RFC] [patch 2/2] powerpc 2.6.21-rt1: fix kernel hang and/or panic

2007-05-16 Thread Linas Vepstas
(resending , Owa-san was cut from cc list!??)

Hi,

On Tue, May 15, 2007 at 08:09:02PM +1000, Benjamin Herrenschmidt wrote:
 On Tue, 2007-05-15 at 17:47 +0900, Tsutomu OWA wrote:
I encountered the following error when doing netperf from other machine 
  to Celleb running RT kernel.  PREEPT_NONE kernel works just fine as well.
 
 Hrm... sounds a bit weird. I wonder if there's a locking bug in the
 driver in the first place.
 
 Linas, what's your take ?

Heh. I almost deleted the entire email thread cause it
didn't say spidernet in the subject line. :-)
Seriously, I really almost did 

Since this is a long email; let me put a summary up front:
I think the RT/premption patches are exposing some sort
of race in the ip header handling code. The rest of the 
note is forensics pointing to this.



Reading the patch, it looks like all it did was to move
around the locks, without changing the semantics. Two
comments about that:

-- The current spidernet locks are very fine-grained;
   this makes the whole thing function more smoothly.
   The patch would make them coarse-grained, I don't
   like that.

-- Moving around locks like that changes the timing
   completely, and changing the timing makes races
   come and go. The races seem to vanish, but that's
   only cause you are getting lucky.

Since I'm sick-n-tired of dealing with spidernet, I thought
I'd give this one a little extra attention.

The crash is a null pointer deref. The spidernet doesn't
use locks to protect null pointers. The spidernet mostly
doesn't play with pointers at all; they're mostly static.
So this crash is unusual from the get-go.

 Instruction dump:
 6000 81790088 901f000c 913f0018 913f0008 917f0004 48132e8d
 6000
 a019009e 2f800800 409e0038 e9390038 88690009 2f830006 419e0010
 2f830011

The crashing instruction is 88690009 which is very unique:
  lbz r3,9(r9)

load byte ... at an offset of 9 bytes!? spidernet does
nothing with bytes, so its another reason its not spidernet.

Below follows a manual disassembly. The guilty party appears
to the the skb, and spcifically, skb-head has not been set.
You'll have to read the details below to see why.

I do not know why sk_buff-head would be null, or
would be set in a racy kind of way, or why the rt patches
would cause this. But the evidence implicates that.

--linas

Long stuff below. For the record:

  Unable to handle kernel paging request for data at address 0x0009
  Faulting instruction address: 0xc0295434
  Oops: Kernel access of bad area, sig: 11 [#1]
  PREEMPT SMP NR_CPUS=2 NUMA 
  Modules linked in:
  NIP: C0295434 LR: C0295420 CTR: 
  REGS: c95d6e30 TRAP: 0300   Not tainted  (2.6.21-rc5-rt7)
  MSR: 80009032 EE,ME,IR,DR  CR: 24000482  XER: 2000
  DAR: 0009, DSISR: 4000
  TASK = c1e7c440[626] 'netserver' THREAD: c95d4000 CPU: 0
  GPR00: 0800 C95D70B0 C05D77B8 0001 
  GPR04: 0001  C95D7080  
  GPR08: C95D7030  C95D7040  
  GPR12: FC69925300080D5D C04DE680  00422208 
  GPR16: 0040 00420D10  C95D7C88 
  GPR20: C1E7C440  0001 C8ACEAE0 
  GPR24: 0020 C0E50C80 81F84C5E C1C00BE0 
  GPR28: C1C05430 C1C00B80 C0570F30 C1FD1720 
  NIP [C0295434] .spider_net_xmit+0x1dc/0x448
  LR [C0295420] .spider_net_xmit+0x1c8/0x448
  Call Trace:
  [C95D70B0] [C0295420] .spider_net_xmit+0x1c8/0x448 
  (unreliable)
  [C95D7160] [C0327EE8] .dev_hard_start_xmit+0x238/0x300
  [C95D7200] [C033A7F4] .__qdisc_run+0xdc/0x2a4
  [C95D72B0] [C032A948] .dev_queue_xmit+0x1b0/0x2fc
  [C95D7350] [C034B470] .ip_output+0x280/0x2d8
  [C95D73F0] [C034C6CC] .ip_queue_xmit+0x448/0x4d8
  [C95D74F0] [C035F6D8] .tcp_transmit_skb+0x850/0x8c0
  [C95D75C0] [C035C394] .__tcp_ack_snd_check+0x84/0xc0
  [C95D7650] [C035E114] .tcp_rcv_established+0x4f0/0x8ac
  [C95D7700] [C0365B24] .tcp_v4_do_rcv+0x5c/0x448
  [C95D77D0] [C031C2C4] .release_sock+0x94/0x11c
  [C95D7870] [C0354E7C] .tcp_recvmsg+0x374/0x8d8
  [C95D7960] [C031B8A0] .sock_common_recvmsg+0x5c/0x84
  [C95D79F0] [C031921C] .sock_recvmsg+0x110/0x15c
  [C95D7C00] [C031AA50] .sys_recvfrom+0xf0/0x174
  [C95D7D90] [C0339368] .compat_sys_socketcall+0x178/0x214
  [C95D7E30] [C0008634] syscall_exit+0x0/0x40
  Instruction dump:
  6000 81790088 901f000c 913f0018 913f0008 917f0004 48132e8d 6000 
  a019009e 2f800800 409e0038 e9390038 88690009 2f830006 419e0010 

Re: Resending: RT patches expose netdev race [was Re: [RFC] [patch 2/2] powerpc 2.6.21-rt1: fix kernel hang and/or panic

2007-05-16 Thread Benjamin Herrenschmidt

 I do not know why sk_buff-head would be null, or
 would be set in a racy kind of way, or why the rt patches
 would cause this. But the evidence implicates that.

Would it be possible that a locking bug in spidernet would cause it
under some circumstances to get a stale skb pointer ?

Ben.


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html