Re: 2.6.24-rc6-mm1

Torsten Kaiser Tue, 01 Jan 2008 10:29:39 -0800

On Jan 1, 2008 1:59 PM, Torsten Kaiser <[EMAIL PROTECTED]> wrote:
> On Jan 1, 2008 1:04 PM, Herbert Xu <[EMAIL PROTECTED]> wrote:
> > On Mon, Dec 31, 2007 at 09:15:19PM +0100, Torsten Kaiser wrote:
> > >
> > > I then tried to "fix" it with this suspect.
> > > I changed "skb_release_all(dst);" back to "skb_release_data(dst);" in
> > > skb_morph() (net/core/skbuff.c).


I can't explain, why this seems to fix 2.6.24-rc3-mm2 for me, but at
least in 2.6.24-rc6-mm1 it does not seem to be involved.

> > Check /proc/net/snmp to see if you're getting any fragments, if not
> > then skb_morph shouldn't even be getting called.
>
> OK, thanks for that hint.
> I look at this after my next tests.

During normal work I did not see the frag counters increase.
I used ping -s 10000 to create some frags, worked perfectly.
I used netio -b 63k -u [target] to create around half a million frags,
worked too.

And what really is strange is that I changed skb_morph into this:
struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src)
{
        printk(KERN_ERR "morph %p:%p",dst,src);
        WARN_ON(1);
        skb_release_all(dst);
        return __skb_clone(dst, src);
}
... that warning was not triggered once.

> > > I'm now at 205 of 210 packages completed without a further hang. I
> > > also do not see an obvious memory leak.
> >
> > In any case, I suspect the cause of your problem is that somebody
> > somewhere is doing a double-free on an skb.
> >
> > Since you're the only person who can reproduce this, we really need
> > your help to track this down.  Since bisecting the mm tree is not
> > practical, you could start by checking whether the bug is in mm only
> > or whether it affects rc6 too.

The problem bisecting this, is that I can't seem to trigger this on
demand. Today I was just about giving up on triggering it in -rc6-mm1
with doing package complies when did happen again. But that was after
more then 4 hours...

> I will try -rc6-mm1 and vanilla -rc6 and report back.

As noted above, my WARN_ON(1) in skb_morph did not trigger once before
the system died with this OOPS:
[18663.909931] Unable to handle kernel NULL pointer dereference at
0000000000000000 RIP:
[18663.915489]  [<ffffffff8055f2e8>] tcp_read_sock+0x58/0x1b0
[18663.918652] PGD 73442067 PUD 7480e067 PMD 0
[18663.918652] Oops: 0000 [1] SMP
[18663.918652] last sysfs file:
/sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_map
[18663.918652] CPU 1
[18663.918652] Modules linked in: radeon drm nfsd exportfs w83792d
ipv6 tuner tea5767 tda8290 tuner_xc2028 tda9887 tuner_simple mt20xx
tea5761 tvaudio msp3400 bttv ir_common compat_ioctl32 videobuf_dma_sg
videobuf_core btcx_risc tveeprom usbhid videodev v4l2_common
v4l1_compat hid sg pata_amd i2c_nforce2
[18663.918652] Pid: 0, comm: swapper Not tainted 2.6.24-rc6-mm1 #13
[18663.918652] RIP: 0010:[<ffffffff8055f2e8>]  [<ffffffff8055f2e8>]
tcp_read_sock+0x58/0x1b0
[18663.918652] RSP: 0018:ffff81007ff4fb60  EFLAGS: 00010286
[18663.918652] RAX: 0000000000000038 RBX: 0000000000000000 RCX: 0000000000000000
[18663.918652] RDX: ffff8100141a40b0 RSI: ffff81007ff4fbc0 RDI: 0000000000000000
[18663.918652] RBP: ffff81007ff4fbb0 R08: 0000000000000002 R09: 0000000000000000
[18663.918652] R10: ffffffff805b2afb R11: 000000000520cde8 R12: 00000000c05a019a
[18663.918652] R13: 000000000f26378b R14: ffff810066469d38 R15: ffff81004b4e4000
[18663.918652] FS:  00007f58ac9a0700(0000) GS:ffff81007ff12580(0000)
knlGS:0000000000000000
[18663.918652] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[18663.918652] CR2: 0000000000000000 CR3: 0000000073441000 CR4: 00000000000006e0
[18663.918652] DR0: 00007fffe1e55cbc DR1: 0000000000000000 DR2: 0000000000000000
[18663.918652] DR3: 0000000000000000 DR6: 00000000ffff4ff0 DR7: 0000000000000400
[18663.918652] Process swapper (pid: 0, threadinfo ffff81011ff2c000,
task ffff81007ff4a000)
[18663.918652] Stack:  ffff810066469d38 ffff81004b4e4148
ffffffff805b1ab0 ffff81007ff4fbc0
[18663.918652] Stack:  ffff810066469d38 ffff81004b4e4148
ffffffff805b1ab0 ffff81007ff4fbc0
[18663.918652]  00000000805b2afb ffff81004b4e4000 ffff81004b4e4298
ffff810066469d00
[18663.918652]  ffff810066469d38 0000000000000000 ffff81007ff4fbf0
ffffffff805b2b41
[18663.918652] Call Trace:
[18663.918652]  <IRQ>  [<ffffffff805b1ab0>] xs_tcp_data_recv+0x0/0x560
[18663.918652]  [<ffffffff805b2b41>] xs_tcp_data_ready+0x71/0x90
[18663.918652]  [<ffffffff80568bec>] __tcp_ack_snd_check+0x5c/0xa0
[18663.918652]  [<ffffffff8056a458>] tcp_rcv_established+0x3c8/0x800
[18663.918652]  [<ffffffff80571451>] tcp_v4_do_rcv+0x2e1/0x4e0
[18663.918652]  [<ffffffff80573cb1>] tcp_v4_rcv+0x721/0x850
[18663.918652]  [<ffffffff80553d63>] ip_local_deliver_finish+0xd3/0x250
[18663.918652]  [<ffffffff8055433b>] ip_local_deliver+0x3b/0x90
[18663.918652]  [<ffffffff80553988>] ip_rcv_finish+0x118/0x420
[18663.918652]  [<ffffffff8022e313>] enqueue_task_fair+0x73/0xd0
[18663.918652]  [<ffffffff80554236>] ip_rcv+0x226/0x2f0
[18663.918652]  [<ffffffff80537576>] netif_receive_skb+0x1d6/0x280
[18663.918652]  [<ffffffff8053a1ea>] process_backlog+0x8a/0xf0
[18663.918652]  [<ffffffff80539e84>] net_rx_action+0xb4/0x130
[18663.918652]  [<ffffffff8023d624>] __do_softirq+0x84/0x110
[18663.918652]  [<ffffffff8020c82c>] call_softirq+0x1c/0x30
[18663.918652]  [<ffffffff8020eaa5>] do_softirq+0x65/0xc0
[18663.918652]  [<ffffffff8023d595>] irq_exit+0x95/0xa0
[18663.918652]  [<ffffffff8020ebbf>] do_IRQ+0x8f/0x100
[18663.918652]  [<ffffffff8020a4b0>] default_idle+0x0/0x80
[18663.918652]  [<ffffffff8020bb26>] ret_from_intr+0x0/0xf
[18663.918652]  <EOI>  [<ffffffff80252310>]
__atomic_notifier_call_chain+0x0/0xa0
[18663.918652]  [<ffffffff8020a4f3>] default_idle+0x43/0x80
[18663.918652]  [<ffffffff8020a4f1>] default_idle+0x41/0x80
[18663.918652]  [<ffffffff8020a4b0>] default_idle+0x0/0x80
[18663.918652]  [<ffffffff8020a59c>] cpu_idle+0x6c/0xa0
[18663.918652]  [<ffffffff808109b8>] start_secondary+0x2f8/0x420
[18663.918652]
[18663.918652]
[18663.918652] Code: 48 8b 3b 0f 18 0f 74 75 8b 93 a0 00 00 00 45 89 ec 44 2b 63
[18663.918652] RIP  [<ffffffff8055f2e8>] tcp_read_sock+0x58/0x1b0
[18663.918652]  RSP <ffff81007ff4fb60>
[18663.918652] CR2: 0000000000000000
[18663.918680] ---[ end trace 1dc6b1bf3734ac14 ]---

(gdb) list *0xffffffff8055f2e8
0xffffffff8055f2e8 is in tcp_read_sock (net/ipv4/tcp.c:1173).
1168    static inline struct sk_buff *tcp_recv_skb(struct sock *sk,
u32 seq, u32 *off)
1169    {
1170            struct sk_buff *skb;
1171            u32 offset;
1172
1173            skb_queue_walk(&sk->sk_receive_queue, skb) {
1174                    offset = seq - TCP_SKB_CB(skb)->seq;
1175                    if (tcp_hdr(skb)->syn)
1176                            offset--;
1177                    if (offset < skb->len || tcp_hdr(skb)->fin) {

(gdb) list *0xffffffff805b2b41
0xffffffff805b2b41 is in xs_tcp_data_ready (net/sunrpc/xprtsock.c:1079).
1074                    goto out;
1075
1076            /* We use rd_desc to pass struct xprt to xs_tcp_data_recv */
1077            rd_desc.arg.data = xprt;
1078            rd_desc.count = 65536;
1079            tcp_read_sock(sk, &rd_desc, xs_tcp_data_recv);
1080    out:
1081            read_unlock(&sk->sk_callback_lock);
1082    }
1083

I will see what vanilla -rc6 will do...

Torsten
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.24-rc6-mm1

Reply via email to