Hi Jon,

I think I found the problem, which ultimately may only exist on our end (see 
below for an explanation, and let me know if you agree).

The fellow that was maintaining our O/S previously (no longer with the company) 
had made some patches to the 4.4.0 kernel TIPC code, and indeed one of them is 
in the offending tipc_sk_rcv() function.  

Specifically, note this segment of code from our kernel source tree:

                       /* Send pending response/rejected messages, if any */
                       while (!skb_queue_empty(&sk->sk_write_queue)) {
                               skb = skb_dequeue(&sk->sk_write_queue);
                               dnode = msg_destnode(buf_msg(skb));
                               tipc_node_xmit_skb(net, skb, dnode, dport);
                       }

Whereas the latest and greatest official longterm 4.9.11 kernel has:

         /* Send pending response/rejected messages, if any */
         while ((skb = __skb_dequeue(&xmitq))) {
            dnode = msg_destnode(buf_msg(skb));
            tipc_node_xmit_skb(net, skb, dnode, dport);
         }

The code path that triggers the oops (in our source code) is from:

dnode = msg_destnode(buf_msg(skb));

where msg_destnode() calls msg_word() which calls:

ntohl(m->hdr[pos]);

which is precisely where the oops occurred.

I'm not exactly sure where he got that code change - my guess is he posted a 
question on the tipc-discussion list and got a suggestion to try a code 
snippet, but in the end the actual changes (that were officially released at 
kernel.org) differed, as per above.  Indeed, on Google I can see some threads 
discussing a 'deadly embrace' deadlock (for example 
http://www.spinics.net/lists/netdev/msg382379.html) between yourself and him.  
Another possibility is that the offending source code in question was indeed 
released sometime after 4.4.0, but has since modified/fixed, thus explaining 
the discrepancy.

If either of possibilities is what actually happened, then this may not a bug 
you need to worry about.  Granted, the same msg_destnode() call still exists in 
the current (4.9.11 and 4.10) code, but the semantics of the encapsulating 
while loop are different, and maybe as such that eliminates the issue.  
Thoughts?

Peter





-----Original Message-----
From: Jon Maloy [mailto:jon.ma...@ericsson.com] 
Sent: February-22-17 3:01 PM
To: Butler, Peter <pbut...@sonusnet.com>; tipc-discussion@lists.sourceforge.net
Subject: RE: TIPC Oops in tipc_sk_recv



> -----Original Message-----
> From: Butler, Peter [mailto:pbut...@sonusnet.com]
> Sent: Wednesday, February 22, 2017 02:15 PM
> To: Jon Maloy <jon.ma...@ericsson.com>; tipc- 
> discuss...@lists.sourceforge.net
> Cc: Butler, Peter <pbut...@sonusnet.com>
> Subject: RE: TIPC Oops in tipc_sk_recv
> 
> For the " Source file is more recent than executable" message, could 
> this simply be due to the fact that I copied the kernel source to the 
> lab and then ran the gdb commands as shown?  As such, the newly copied 
> files would have a newer timestamp than the kernel/tipc.ko files.  
> (The kernel is actual built on a separate compiler than the test lab 
> machine.)

If you are certain that the build was made from the same source this is false 
alarm, caused by the timestamp as you suggest.

///jon

> 
> Or could I get that message for another reason?
> 
> 
> 
> -----Original Message-----
> From: Jon Maloy [mailto:jon.ma...@ericsson.com]
> Sent: February-22-17 2:11 PM
> To: Butler, Peter <pbut...@sonusnet.com>; tipc- 
> discuss...@lists.sourceforge.net
> Subject: RE: TIPC Oops in tipc_sk_recv
> 
> 
> 
> > -----Original Message-----
> > From: Butler, Peter [mailto:pbut...@sonusnet.com]
> > Sent: Wednesday, February 22, 2017 01:04 PM
> > To: Jon Maloy <jon.ma...@ericsson.com>; tipc- 
> > discuss...@lists.sourceforge.net
> > Cc: Butler, Peter <pbut...@sonusnet.com>
> > Subject: RE: TIPC Oops in tipc_sk_recv
> >
> > I took a stab at it this way - not sure if I am doing this correctly or not.
> >
> > [root@myVMslot12 ~]# gdb /boot/vmlinuz-4.4.0 /proc/kcore GNU gdb
> > (GDB) Fedora (7.3.50.20110722-13.fc16) Copyright (C) 2011 Free 
> > Software Foundation, Inc.
> > License GPLv3+: GNU GPL version 3 or later 
> > <http://gnu.org/licenses/gpl.html>
> > This is free software: you are free to change and redistribute it.
> > There is NO WARRANTY, to the extent permitted by law.  Type "show 
> > copying"
> > and "show warranty" for details.
> > This GDB was configured as "x86_64-redhat-linux-gnu".
> > For bug reporting instructions, please see:
> > <http://www.gnu.org/software/gdb/bugs/>...
> > BFD: /boot/vmlinuz-4.4.0: Warning: Ignoring section flag 
> > IMAGE_SCN_MEM_NOT_PAGED in section .bss
> > BFD: /boot/vmlinuz-4.4.0: Warning: Ignoring section flag 
> > IMAGE_SCN_MEM_NOT_PAGED in section .bss Reading symbols from 
> > /boot/vmlinuz-4.4.0...(no debugging symbols found)...done.
> >
> > warning: core file may not match specified executable file.
> > [New process 1]
> > Core was generated by `BOOT_IMAGE=/vmlinuz-4.4.0
> root=UUID=b419f9ff-
> > 80ce-459e-855c-614d86a48105 ro rd.'.
> > #0  0x0000000000000000 in ?? ()
> >  (gdb) file /lib/modules/4.4.0/kernel/net/tipc/tipc.ko
> > warning: core file may not match specified executable file.
> > Reading symbols from /lib/modules/4.4.0/kernel/net/tipc/tipc.ko...done.
> > (gdb) list *(tipc_sk_rcv+0x238)
> > 0x14898 is in tipc_sk_rcv (net/tipc/msg.h:131).
> > warning: Source file is more recent than executable.
> 
> Seems like you didn't rebuild after you updated the source file?
> Try again just to make sure.
> 
> > 126             return (struct tipc_msg *)skb->data;
> > 127     }
> > 128
> > 129     static inline u32 msg_word(struct tipc_msg *m, u32 pos)
> > 130     {
> > 131             return ntohl(m->hdr[pos]);
> 
> If this is correct, you are receiving a corrupt buffer where the data 
> pointer is invalid. This is typical if the buffer already has been released.
> 
> ///jon
> 
> > 132     }
> > 133
> > 134     static inline void msg_set_word(struct tipc_msg *m, u32 w, u32 val)
> > 135     {
> >
> >
> >
> >
> > -----Original Message-----
> > From: Butler, Peter
> > Sent: February-22-17 12:45 PM
> > To: Jon Maloy <jon.ma...@ericsson.com>; tipc- 
> > discuss...@lists.sourceforge.net
> > Cc: Butler, Peter <pbut...@sonusnet.com>
> > Subject: RE: TIPC Oops in tipc_sk_recv
> >
> > Hi Jon
> >
> > Thanks for the info.
> >
> > One thing I should clarify.  Although we are running the 4.4.0 
> > kernel, we had backported a number of post-4.4.0 TIPC patches into 
> > our 4.4.0 kernel.  As such, the offset in question 
> > (tipc_sk_rcv+0x238) will not match that in the vanilla 4.4.0 source.
> >
> > Should I post the entire socket.c file to this list for your review?
> > Or is there an easy way for me to do a similar listing using our 
> > actual tipc.ko file here in the lab?
> >
> > Peter
> >
> >
> >
> >
> > -----Original Message-----
> > From: Jon Maloy [mailto:jon.ma...@ericsson.com]
> > Sent: February-22-17 12:29 PM
> > To: Butler, Peter <pbut...@sonusnet.com>; tipc- 
> > discuss...@lists.sourceforge.net
> > Subject: RE: TIPC Oops in tipc_sk_recv
> >
> > Hi Peter,
> > Very hard to make any suggestions on how to reproduce this. What I 
> > can see is that it is a STREAM message being sent from a node local 
> > socket, i.e., it doesn't go via any interface. The crash seems to 
> > happen when the receiving socket is owned by the user, and while we 
> > are instead adding the message to the backlog queue:
> >
> > Reading symbols from net/tipc/tipc.ko...done.
> > (gdb) list *(tipc_sk_rcv+0x238)
> > 0x13d78 is in tipc_sk_rcv (./arch/x86/include/asm/atomic.h:214).
> > 209     static __always_inline int __atomic_add_unless(atomic_t *v, int a, 
> > int
> > u)
> > 210     {
> > 211             int c, old;
> > 212             c = atomic_read(v);
> > 213             for (;;) {
> > 214                     if (unlikely(c == (u)))
> > 215                             break;
> > 216                     old = atomic_cmpxchg((v), c, c + (a));
> > 217                     if (likely(old == c))
> > 218                             break;
> >
> > This is about what I can get out of it at the moment. Maybe you 
> > should try a high-load test between two local sockets (try the 
> > benchmark demo from
> > tipcutils) and see what you can achieve.
> >
> > BR
> > ///jon
> >
> >
> > > -----Original Message-----
> > > From: Butler, Peter [mailto:pbut...@sonusnet.com]
> > > Sent: Wednesday, February 22, 2017 10:40 AM
> > > To: Jon Maloy <jon.ma...@ericsson.com>; tipc- 
> > > discuss...@lists.sourceforge.net
> > > Cc: Butler, Peter <pbut...@sonusnet.com>
> > > Subject: RE: TIPC Oops in tipc_sk_recv
> > >
> > > If you have any suggestions as to procedures/tricks you think 
> > > might trigger this bug I can certainly attempt to do so in the lab.
> > > Obviously we can't attempt to reproduce it on the customer's 
> > > (live)
> system.
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: Butler, Peter
> > > Sent: February-21-17 3:39 PM
> > > To: Jon Maloy <jon.ma...@ericsson.com>; tipc- 
> > > discuss...@lists.sourceforge.net
> > > Cc: Butler, Peter <pbut...@sonusnet.com>
> > > Subject: RE: TIPC Oops in tipc_sk_recv
> > >
> > > Unfortunately this occurred on a customer system so it is not 
> > > readily reproducible.  We have not seen this occur in our lab.
> > >
> > > For what it's worth, it occurred while the process was in 
> > > TASK_UNINTERRUPTIBLE.  As such, the kernel could not actually kill 
> > > off the associated process despite the Oops, and the process 
> > > remained forever frozen in the 'D' state and the card had to be rebooted.
> > >
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: Jon Maloy [mailto:jon.ma...@ericsson.com]
> > > Sent: February-21-17 3:36 PM
> > > To: Butler, Peter <pbut...@sonusnet.com>; tipc- 
> > > discuss...@lists.sourceforge.net
> > > Subject: RE: TIPC Oops in tipc_sk_recv
> > >
> > > Hi Peter,
> > > I don't think this is any known bug. Is it repeatable?
> > >
> > > ///jon
> > >
> > > > -----Original Message-----
> > > > From: Butler, Peter [mailto:pbut...@sonusnet.com]
> > > > Sent: Tuesday, February 21, 2017 12:14 PM
> > > > To: tipc-discussion@lists.sourceforge.net
> > > > Cc: Butler, Peter <pbut...@sonusnet.com>
> > > > Subject: [tipc-discussion] TIPC Oops in tipc_sk_recv
> > > >
> > > > This was with kernel 4.4.0, however I don't see any fix 
> > > > specifically related to this in any subsequent 4.4.x kernel...
> > > >
> > > > BUG: unable to handle kernel NULL pointer dereference at
> > > > 00000000000000d8
> > > > IP: [<ffffffffa0148868>] tipc_sk_rcv+0x238/0x4d0 [tipc] PGD
> > > > 34f4c0067 PUD
> > > > 34ed95067 PMD 0
> > > > Oops: 0000 [#1] SMP
> > > > Modules linked in: nf_log_ipv4 nf_log_common xt_LOG sctp 
> > > > libcrc32c e1000e tipc udp_tunnel ip6_udp_tunnel iTCO_wdt 8021q 
> > > > garp
> > xt_physdev
> > > > br_netfilter bridge stp llc nf_conntrack_ipv4 ipmiq_drv(O)
> > > > nf_defrag_ipv4
> > > > sio_mmc(O) ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6
> > > > nf_defrag_ipv6 xt_state nf_conntrack event_drv(O) 
> > > > ip6table_filter lockd ip6_tables
> > > > pt_timer_info(O) ddi(O) grace usb_storage ixgbe igb 
> > > > iTCO_vendor_support i2c_algo_bit ptp i2c_i801 pps_core lpc_ich 
> > > > i2c_core intel_ips mfd_core pcspkr ioatdma sunrpc dca tpm_tis 
> > > > mdio tpm
> > > [last unloaded: iTCO_wdt]
> > > > CPU: 2 PID: 12144 Comm: dinamo Tainted: G           O    4.4.0 #23
> > > > Hardware name: PT AMC124/Base Board Product Name, BIOS
> > > > LGNAJFIP.PTI.0012.P15 01/15/2014
> > > > task: ffff880036ad8000 ti: ffff880036900000 task.ti:
> > > > ffff880036900000
> > > > RIP: 0010:[<ffffffffa0148868>]  [<ffffffffa0148868>]
> > > > tipc_sk_rcv+0x238/0x4d0 [tipc]
> > > > RSP: 0018:ffff880036903bb8  EFLAGS: 00010292
> > > > RAX: 0000000000000000 RBX: ffff88034def3970 RCX: 
> > > > 0000000000000001
> > > > RDX: 0000000000000101 RSI: 0000000000000292 RDI: 
> > > > ffff88034def3984
> > > > RBP: ffff880036903c28 R08: 0000000000000101 R09: 
> > > > 0000000000000004
> > > > R10: 0000000000000001 R11: 0000000000000000 R12: 
> > > > ffff880036903d28
> > > > R13: 00000000bd1fd8b2 R14: ffff88034def3840 R15: 
> > > > ffff880036903d3c
> > > > FS:  00007f1e86299740(0000) GS:ffff88035fc40000(0000)
> > > > knlGS:0000000000000000
> > > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > CR2: 00000000000000d8 CR3: 0000000036835000 CR4: 
> > > > 00000000000006e0
> > > > Stack:
> > > >  000000000000009b ffff880036903d28 0000000000000018
> > > > ffff88034def38c8
> > > >  ffffffff81ce6240 ffff8802b9bdba00 ffff880036903ca8 
> > > > ffffffffa013bd7e
> > > >  ffff8802b99d5ee8 ffff880036903c60 0000000000000000
> > > > ffff88003693cb00 Call
> > > > Trace:
> > > >  [<ffffffffa013bd7e>] ? tipc_msg_build+0xde/0x4f0 [tipc] 
> > > > [<ffffffffa014358f>] tipc_node_xmit+0x11f/0x150 [tipc] 
> > > > [<ffffffffa01470ba>]
> > > > __tipc_send_stream+0x16a/0x300 [tipc]  [<ffffffff81625eb5>] ?
> > > > tcp_sendmsg+0x4d5/0xb00  [<ffffffffa0147292>]
> > > > tipc_send_stream+0x42/0x70 [tipc]  [<ffffffff815bcf77>]
> > > > sock_sendmsg+0x47/0x50  [<ffffffff815bd03f>]
> > > > sock_write_iter+0x7f/0xd0 [<ffffffff811d799a>]
> > > > __vfs_write+0xaa/0xe0 [<ffffffff811d8b16>]
> > > > vfs_write+0xb6/0x1a0  [<ffffffff811d8e3f>] SyS_write+0x4f/0xb0 
> > > > [<ffffffff816de6d7>] entry_SYSCALL_64_fastpath+0x12/0x6a
> > > > Code: 89 de 4c 89 f7 e8 29 d3 ff ff 48 8b 7d a8 e8 60 59 59 e1 
> > > > 49 8d 9e 30 01 00
> > > > 00 49 3b 9e 30 01 00 00 74 30 48 89 df e8 b8 b6 47 e1 <48> 8b 90
> > > > d8
> > > > 00
> > > > 00 00 48 8b 7d b0 44 89 e9 48 89 c6 48 89 45 c0 RIP 
> > > > [<ffffffffa0148868>]
> > > > tipc_sk_rcv+0x238/0x4d0 [tipc]  RSP <ffff880036903bb8>
> > > > CR2: 00000000000000d8
> > > > ---[ end trace 1c2d69738941d565 ]---
> > > >
> > > >
> > > > ----------------------------------------------------------------
> > > > --
> > > > --
> > > > --
> > > > -------- Check out the vibrant tech community on one of the 
> > > > world's most engaging tech sites, SlashDot.org!
> > > > http://sdm.link/slashdot
> > > > _______________________________________________
> > > > tipc-discussion mailing list
> > > > tipc-discussion@lists.sourceforge.net
> > > > https://lists.sourceforge.net/lists/listinfo/tipc-discussion


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
tipc-discussion mailing list
tipc-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tipc-discussion

Reply via email to