On Jan 1, 2008 1:59 PM, Torsten Kaiser <[EMAIL PROTECTED]> wrote: > On Jan 1, 2008 1:04 PM, Herbert Xu <[EMAIL PROTECTED]> wrote: > > On Mon, Dec 31, 2007 at 09:15:19PM +0100, Torsten Kaiser wrote: > > > > > > I then tried to "fix" it with this suspect. > > > I changed "skb_release_all(dst);" back to "skb_release_data(dst);" in > > > skb_morph() (net/core/skbuff.c).
I can't explain, why this seems to fix 2.6.24-rc3-mm2 for me, but at least in 2.6.24-rc6-mm1 it does not seem to be involved. > > Check /proc/net/snmp to see if you're getting any fragments, if not > > then skb_morph shouldn't even be getting called. > > OK, thanks for that hint. > I look at this after my next tests. During normal work I did not see the frag counters increase. I used ping -s 10000 to create some frags, worked perfectly. I used netio -b 63k -u [target] to create around half a million frags, worked too. And what really is strange is that I changed skb_morph into this: struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src) { printk(KERN_ERR "morph %p:%p",dst,src); WARN_ON(1); skb_release_all(dst); return __skb_clone(dst, src); } ... that warning was not triggered once. > > > I'm now at 205 of 210 packages completed without a further hang. I > > > also do not see an obvious memory leak. > > > > In any case, I suspect the cause of your problem is that somebody > > somewhere is doing a double-free on an skb. > > > > Since you're the only person who can reproduce this, we really need > > your help to track this down. Since bisecting the mm tree is not > > practical, you could start by checking whether the bug is in mm only > > or whether it affects rc6 too. The problem bisecting this, is that I can't seem to trigger this on demand. Today I was just about giving up on triggering it in -rc6-mm1 with doing package complies when did happen again. But that was after more then 4 hours... > I will try -rc6-mm1 and vanilla -rc6 and report back. As noted above, my WARN_ON(1) in skb_morph did not trigger once before the system died with this OOPS: [18663.909931] Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: [18663.915489] [<ffffffff8055f2e8>] tcp_read_sock+0x58/0x1b0 [18663.918652] PGD 73442067 PUD 7480e067 PMD 0 [18663.918652] Oops: 0000 [1] SMP [18663.918652] last sysfs file: /sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_map [18663.918652] CPU 1 [18663.918652] Modules linked in: radeon drm nfsd exportfs w83792d ipv6 tuner tea5767 tda8290 tuner_xc2028 tda9887 tuner_simple mt20xx tea5761 tvaudio msp3400 bttv ir_common compat_ioctl32 videobuf_dma_sg videobuf_core btcx_risc tveeprom usbhid videodev v4l2_common v4l1_compat hid sg pata_amd i2c_nforce2 [18663.918652] Pid: 0, comm: swapper Not tainted 2.6.24-rc6-mm1 #13 [18663.918652] RIP: 0010:[<ffffffff8055f2e8>] [<ffffffff8055f2e8>] tcp_read_sock+0x58/0x1b0 [18663.918652] RSP: 0018:ffff81007ff4fb60 EFLAGS: 00010286 [18663.918652] RAX: 0000000000000038 RBX: 0000000000000000 RCX: 0000000000000000 [18663.918652] RDX: ffff8100141a40b0 RSI: ffff81007ff4fbc0 RDI: 0000000000000000 [18663.918652] RBP: ffff81007ff4fbb0 R08: 0000000000000002 R09: 0000000000000000 [18663.918652] R10: ffffffff805b2afb R11: 000000000520cde8 R12: 00000000c05a019a [18663.918652] R13: 000000000f26378b R14: ffff810066469d38 R15: ffff81004b4e4000 [18663.918652] FS: 00007f58ac9a0700(0000) GS:ffff81007ff12580(0000) knlGS:0000000000000000 [18663.918652] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b [18663.918652] CR2: 0000000000000000 CR3: 0000000073441000 CR4: 00000000000006e0 [18663.918652] DR0: 00007fffe1e55cbc DR1: 0000000000000000 DR2: 0000000000000000 [18663.918652] DR3: 0000000000000000 DR6: 00000000ffff4ff0 DR7: 0000000000000400 [18663.918652] Process swapper (pid: 0, threadinfo ffff81011ff2c000, task ffff81007ff4a000) [18663.918652] Stack: ffff810066469d38 ffff81004b4e4148 ffffffff805b1ab0 ffff81007ff4fbc0 [18663.918652] Stack: ffff810066469d38 ffff81004b4e4148 ffffffff805b1ab0 ffff81007ff4fbc0 [18663.918652] 00000000805b2afb ffff81004b4e4000 ffff81004b4e4298 ffff810066469d00 [18663.918652] ffff810066469d38 0000000000000000 ffff81007ff4fbf0 ffffffff805b2b41 [18663.918652] Call Trace: [18663.918652] <IRQ> [<ffffffff805b1ab0>] xs_tcp_data_recv+0x0/0x560 [18663.918652] [<ffffffff805b2b41>] xs_tcp_data_ready+0x71/0x90 [18663.918652] [<ffffffff80568bec>] __tcp_ack_snd_check+0x5c/0xa0 [18663.918652] [<ffffffff8056a458>] tcp_rcv_established+0x3c8/0x800 [18663.918652] [<ffffffff80571451>] tcp_v4_do_rcv+0x2e1/0x4e0 [18663.918652] [<ffffffff80573cb1>] tcp_v4_rcv+0x721/0x850 [18663.918652] [<ffffffff80553d63>] ip_local_deliver_finish+0xd3/0x250 [18663.918652] [<ffffffff8055433b>] ip_local_deliver+0x3b/0x90 [18663.918652] [<ffffffff80553988>] ip_rcv_finish+0x118/0x420 [18663.918652] [<ffffffff8022e313>] enqueue_task_fair+0x73/0xd0 [18663.918652] [<ffffffff80554236>] ip_rcv+0x226/0x2f0 [18663.918652] [<ffffffff80537576>] netif_receive_skb+0x1d6/0x280 [18663.918652] [<ffffffff8053a1ea>] process_backlog+0x8a/0xf0 [18663.918652] [<ffffffff80539e84>] net_rx_action+0xb4/0x130 [18663.918652] [<ffffffff8023d624>] __do_softirq+0x84/0x110 [18663.918652] [<ffffffff8020c82c>] call_softirq+0x1c/0x30 [18663.918652] [<ffffffff8020eaa5>] do_softirq+0x65/0xc0 [18663.918652] [<ffffffff8023d595>] irq_exit+0x95/0xa0 [18663.918652] [<ffffffff8020ebbf>] do_IRQ+0x8f/0x100 [18663.918652] [<ffffffff8020a4b0>] default_idle+0x0/0x80 [18663.918652] [<ffffffff8020bb26>] ret_from_intr+0x0/0xf [18663.918652] <EOI> [<ffffffff80252310>] __atomic_notifier_call_chain+0x0/0xa0 [18663.918652] [<ffffffff8020a4f3>] default_idle+0x43/0x80 [18663.918652] [<ffffffff8020a4f1>] default_idle+0x41/0x80 [18663.918652] [<ffffffff8020a4b0>] default_idle+0x0/0x80 [18663.918652] [<ffffffff8020a59c>] cpu_idle+0x6c/0xa0 [18663.918652] [<ffffffff808109b8>] start_secondary+0x2f8/0x420 [18663.918652] [18663.918652] [18663.918652] Code: 48 8b 3b 0f 18 0f 74 75 8b 93 a0 00 00 00 45 89 ec 44 2b 63 [18663.918652] RIP [<ffffffff8055f2e8>] tcp_read_sock+0x58/0x1b0 [18663.918652] RSP <ffff81007ff4fb60> [18663.918652] CR2: 0000000000000000 [18663.918680] ---[ end trace 1dc6b1bf3734ac14 ]--- (gdb) list *0xffffffff8055f2e8 0xffffffff8055f2e8 is in tcp_read_sock (net/ipv4/tcp.c:1173). 1168 static inline struct sk_buff *tcp_recv_skb(struct sock *sk, u32 seq, u32 *off) 1169 { 1170 struct sk_buff *skb; 1171 u32 offset; 1172 1173 skb_queue_walk(&sk->sk_receive_queue, skb) { 1174 offset = seq - TCP_SKB_CB(skb)->seq; 1175 if (tcp_hdr(skb)->syn) 1176 offset--; 1177 if (offset < skb->len || tcp_hdr(skb)->fin) { (gdb) list *0xffffffff805b2b41 0xffffffff805b2b41 is in xs_tcp_data_ready (net/sunrpc/xprtsock.c:1079). 1074 goto out; 1075 1076 /* We use rd_desc to pass struct xprt to xs_tcp_data_recv */ 1077 rd_desc.arg.data = xprt; 1078 rd_desc.count = 65536; 1079 tcp_read_sock(sk, &rd_desc, xs_tcp_data_recv); 1080 out: 1081 read_unlock(&sk->sk_callback_lock); 1082 } 1083 I will see what vanilla -rc6 will do... Torsten -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/