Hi Partha,

In our situation we do not need to support the delivery of RPMs in any way.  
Literally the only thing we are changing on the target systems is the tipc.ko 
file.  That is, the original 4.4.0 kernel and all other 4.4.0-specific kernel 
modules will be left untouched.

I am doing this actually removing the include/uapi/linux/tipc* and net/tipc/* 
files from within our 4.4.0 kernel source tree, and replacing them with the 
files from kernel 4.9.11.  (Note that kernel 4.9.11 actually has a couple more 
TIPC-related files than the 4.4.0 kernel.)  To accomplish I had to make a few 
changes (as per the email thread between Jon and myself) to get it to compile.

Then, when I kick off a 'make' (no 'make clean' is performed) at the top level 
of the kernel source tree the build process detects that everything 
TIPC-related requires building, and a new tipc.ko is generated.  This tipc.ko 
is literally taken and installed onto the existing 4.4.0 systems without any 
other changes (e.g. no new bzImage is installed - the original kernel file is 
left untouched).

We're not concerned about maintainability for now, as we plan on doing a full 
upgrade of the entire kernel at some point in the next few months.  The hybrid 
of a 4.4.0 kernel running a TIPC source from 4.9.11 is only a stop-gap measure 
for an emergency fix needed asap.

If you can foresee any issues with our short-term plan here let me know.  As it 
stands I have the module built and running - but that of course doesn't mean 
that run-time issues won't occur.

/Peter

From: Parthasarathy Bhuvaragan [mailto:parthasarathy.bhuvara...@ericsson.com]
Sent: February-24-17 5:21 AM
To: Butler, Peter <pbut...@sonusnet.com>
Cc: Jon Maloy <jon.ma...@ericsson.com>; tipc-discussion@lists.sourceforge.net
Subject: Re: TIPC Oops in tipc_sk_recv


Hi Peter,



The backporting strategy varies depending on:

1. Supporting upgrades of rpm's. Ex: can you deliver a new tipc rpm and update 
it on an existing kernel.

2. Delivering / Upgrading the entire kernel. No individual rpm updates are 
delivered.



If its option 2, then you may be allowed to update tipc ABI i.e include the 
commits which touch include/uapi/linux/tipc*.

I have to support option 1, so I cannot include any commit which touches files 
outside net/tipc/ without manual intervention.



The way I do it is to using git and walk all the commits from a specific 
version upto say v4.9 and follow these rules:

1. Skip commits which are not tipc specific, i.e its introduced as a part of 
core net cleanup. They usually break the ABI.

2. If you skip a commit, the subsequent commit needs to be amended to apply 
cleanly.

3. When cherry-picking commits, use option "-x" to record the upstream commit 
id. This way you can do git-blame and find out the history.

This is a slow process, but you will be sure of the commits you pick and its 
history.



If you copy the tipc source from a later kernel to say v4.4.x, then you loose 
the history. This will hinder maintainability in the long run.



/Partha

________________________________
From: Butler, Peter <pbut...@sonusnet.com<mailto:pbut...@sonusnet.com>>
Sent: Thursday, February 23, 2017 9:29 PM
To: Jon Maloy; 
tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>;
 Parthasarathy Bhuvaragan
Cc: Butler, Peter
Subject: RE: TIPC Oops in tipc_sk_recv

I have made the following final change: this change works around the different 
function signature for udp_tunnel6_xmit_skb() in udp_media.c (function is 
defined in net/ipv6/ip6_udp_tunnel.c):

Change:
      err = udp_tunnel6_xmit_skb(ndst, ub->ubsock->sk, skb,
                  ndst->dev, &src->ipv6,
                  &dst->ipv6, 0, ttl, 0, src->port,
                  dst->port, false);

To be:
      err = udp_tunnel6_xmit_skb(ndst, ub->ubsock->sk, skb,
                  ndst->dev, &src->ipv6,
                  &dst->ipv6, 0, ttl, src->port,
                  dst->port, false);

That is, simply remove the '0' parameter (which comes immediately after the ttl 
parameter).  In 4.9.11 this is a variable called 'label' and is being passed as 
'0', while in 4.4.0 it appears to be explicitly set to 0 directly within the 
udp_tunnel6_xmit_skb() function anyway.

With that last change in effect, everything now compiles.  (I have not tested 
anything, mind you.)

Note that I did not come across any errors regarding the iov handling in 
msg_build() that you mentioned.   Were you expecting compilation to fail there? 
 Or were you expecting it to succeed, but the resulting TIPC functionality to 
simply be erroneous at run-time?

Peter





-----Original Message-----
From: Butler, Peter
Sent: February-23-17 2:48 PM
To: Jon Maloy <jon.ma...@ericsson.com<mailto:jon.ma...@ericsson.com>>; 
tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>;
 Parthasarathy Bhuvaragan 
<parthasarathy.bhuvara...@ericsson.com<mailto:parthasarathy.bhuvara...@ericsson.com>>
Cc: Butler, Peter <pbut...@sonusnet.com<mailto:pbut...@sonusnet.com>>
Subject: RE: TIPC Oops in tipc_sk_recv

I have made the following change so as to work around the missing 
skwq_has_sleeper() function in our 4.4.0 kernel source stream (as required for 
the 4.9.11 TIPC source).  This change was based on a comparison of 4.4.0 and 
4.9.11 kernel code (include/net/sock.h and include/linux/wait.h).

Change:
if (skwq_has_sleeper(wq))

To be:
if (wq && wq_has_sleeper(wq))

Let me know if that seems reasonable to you.

With this change in effect, my compilation now proceeds further (see below).  
As always, any insight is much appreciated.

  CHK     include/config/kernel.release
  CHK     include/generated/uapi/linux/version.h
  CHK     include/generated/utsrelease.h
  CHK     include/generated/bounds.h
  CHK     include/generated/timeconst.h
  CHK     include/generated/asm-offsets.h
  CALL    scripts/checksyscalls.sh
  LD      net/tipc/built-in.o
  CC [M]  net/tipc/addr.o
  CC [M]  net/tipc/bcast.o
  CC [M]  net/tipc/bearer.o
  CC [M]  net/tipc/core.o
  CC [M]  net/tipc/link.o
  CC [M]  net/tipc/discover.o
  CC [M]  net/tipc/msg.o
  CC [M]  net/tipc/name_distr.o
  CC [M]  net/tipc/subscr.o
  CC [M]  net/tipc/monitor.o
  CC [M]  net/tipc/name_table.o
  CC [M]  net/tipc/net.o
  CC [M]  net/tipc/netlink.o
  CC [M]  net/tipc/netlink_compat.o
  CC [M]  net/tipc/node.o
  CC [M]  net/tipc/socket.o
  CC [M]  net/tipc/eth_media.o
  CC [M]  net/tipc/server.o
  CC [M]  net/tipc/udp_media.o
net/tipc/udp_media.c: In function 'tipc_udp_xmit':
net/tipc/udp_media.c:199:9: error: too many arguments to function 
'udp_tunnel6_xmit_skb'
include/net/udp_tunnel.h:87:5: note: declared here
make[1]: *** [net/tipc/udp_media.o] Error 1
make: *** [net/tipc/] Error 2

-----Original Message-----
From: Butler, Peter
Sent: February-23-17 2:14 PM
To: Jon Maloy <jon.ma...@ericsson.com<mailto:jon.ma...@ericsson.com>>; 
tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>;
 Parthasarathy Bhuvaragan 
<parthasarathy.bhuvara...@ericsson.com<mailto:parthasarathy.bhuvara...@ericsson.com>>
Cc: Butler, Peter <pbut...@sonusnet.com<mailto:pbut...@sonusnet.com>>
Subject: RE: TIPC Oops in tipc_sk_recv

I have changed TIPC_DEF_MON_THRESHOLD (in core.h) from 32 to 100 as suggested.

 I still (of course) had to comment all functionality within 
__tipc_nl_add_monitor_peer() so as to get around the undefined 
nla_put_u64_64bit() function call.  As such, __tipc_nl_add_monitor_peer() is 
now reduced to nothing more than a "return 0" statement.

Note that I did not bother to similarly comment out other 
netlink-monitoring-related functions in monitor.c, since I assume that 
monitoring is now explicitly disabled (as per your suggestion to change 
TIPC_DEF_MON_THRESHOLD) - correct?

As such my compilation now makes it this far (see below).  I will look at this 
error but as always am open to (more enlightened) insight.

  CHK     include/config/kernel.release
  CHK     include/generated/uapi/linux/version.h
  CHK     include/generated/utsrelease.h
  CHK     include/generated/bounds.h
  CHK     include/generated/timeconst.h
  CHK     include/generated/asm-offsets.h
  CALL    scripts/checksyscalls.sh
  LD      net/tipc/built-in.o
  CC [M]  net/tipc/addr.o
  CC [M]  net/tipc/bcast.o
  CC [M]  net/tipc/bearer.o
  CC [M]  net/tipc/core.o
  CC [M]  net/tipc/link.o
  CC [M]  net/tipc/discover.o
  CC [M]  net/tipc/msg.o
  CC [M]  net/tipc/name_distr.o
  CC [M]  net/tipc/subscr.o
  CC [M]  net/tipc/monitor.o
  CC [M]  net/tipc/name_table.o
  CC [M]  net/tipc/net.o
  CC [M]  net/tipc/netlink.o
  CC [M]  net/tipc/netlink_compat.o
  CC [M]  net/tipc/node.o
  CC [M]  net/tipc/socket.o
net/tipc/socket.c: In function 'tipc_write_space':
net/tipc/socket.c:1492:2: error: implicit declaration of function 
'skwq_has_sleeper' [-Werror=implicit-function-declaration]
cc1: some warnings being treated as errors
make[1]: *** [net/tipc/socket.o] Error 1
make: *** [net/tipc/] Error 2

-----Original Message-----
From: Butler, Peter
Sent: February-23-17 1:45 PM
To: Jon Maloy <jon.ma...@ericsson.com<mailto:jon.ma...@ericsson.com>>; 
tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>;
 Parthasarathy Bhuvaragan 
<parthasarathy.bhuvara...@ericsson.com<mailto:parthasarathy.bhuvara...@ericsson.com>>
Cc: Butler, Peter <pbut...@sonusnet.com<mailto:pbut...@sonusnet.com>>
Subject: RE: TIPC Oops in tipc_sk_recv

I definitely don't want to be moving into dangerous waters, so I'll take your 
suggestion right now and start over....

-----Original Message-----
From: Jon Maloy [mailto:jon.ma...@ericsson.com]
Sent: February-23-17 1:43 PM
To: Butler, Peter <pbut...@sonusnet.com<mailto:pbut...@sonusnet.com>>; 
tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>;
 Parthasarathy Bhuvaragan 
<parthasarathy.bhuvara...@ericsson.com<mailto:parthasarathy.bhuvara...@ericsson.com>>
Subject: RE: TIPC Oops in tipc_sk_recv



> -----Original Message-----
> From: Butler, Peter [mailto:pbut...@sonusnet.com]
> Sent: Thursday, February 23, 2017 01:23 PM
> To: Jon Maloy <jon.ma...@ericsson.com<mailto:jon.ma...@ericsson.com>>; tipc-
> discuss...@lists.sourceforge.net<mailto:discuss...@lists.sourceforge.net>; 
> Parthasarathy Bhuvaragan
> <parthasarathy.bhuvara...@ericsson.com<mailto:parthasarathy.bhuvara...@ericsson.com>>
> Cc: Butler, Peter <pbut...@sonusnet.com<mailto:pbut...@sonusnet.com>>
> Subject: RE: TIPC Oops in tipc_sk_recv
>
> That might be a possibility - I know the customer is close to 32 nodes
> however, so it might not be.
>
> I'm also looking at porting the required functionality from
> include/net/netlink.h and lib/nlattr.c directly into the TIPC
> monitor.c file (as opposed to changing any code directly in include/net and 
> lib/.....

I think you are moving into dangerous waters here, unless you only want the 
code to compile.
A simpler and safer option: change #define TIPC_DEF_MON_THRESHOLD in core.h 
from  32 to e.g. 100, and the hierarchical monitoring will be disabled. This is 
the way we have been running forever until 4.7, so this is a safe bet.

//jon

>
>
>
> -----Original Message-----
> From: Jon Maloy [mailto:jon.ma...@ericsson.com]
> Sent: February-23-17 1:19 PM
> To: Butler, Peter <pbut...@sonusnet.com<mailto:pbut...@sonusnet.com>>; tipc-
> discuss...@lists.sourceforge.net<mailto:discuss...@lists.sourceforge.net>; 
> Parthasarathy Bhuvaragan
> <parthasarathy.bhuvara...@ericsson.com<mailto:parthasarathy.bhuvara...@ericsson.com>>
> Subject: RE: TIPC Oops in tipc_sk_recv
>
>
>
> > -----Original Message-----
> > From: Butler, Peter [mailto:pbut...@sonusnet.com]
> > Sent: Thursday, February 23, 2017 01:09 PM
> > To: Jon Maloy <jon.ma...@ericsson.com<mailto:jon.ma...@ericsson.com>>; tipc-
> > discuss...@lists.sourceforge.net<mailto:discuss...@lists.sourceforge.net>; 
> > Parthasarathy Bhuvaragan
> > <parthasarathy.bhuvara...@ericsson.com<mailto:parthasarathy.bhuvara...@ericsson.com>>
> > Cc: Butler, Peter <pbut...@sonusnet.com<mailto:pbut...@sonusnet.com>>
> > Subject: RE: TIPC Oops in tipc_sk_recv
> >
> > Partha - an update for you
> >
> > I've ported all the TIPC code from 4.9.11 into our 4.4.0 kernel code
> > base.  By this I mean I have completely removed all the existing
> > TIPC files in their entirety from:
> >
> > include/uapi/linux/tipc*
> > net/tipc/*
> >
> > in our 4.4.0 kernel source tree, and replaced these with all the
> > files from 4.9.11.
> >
> > As Jon indeed forewarned me, there will be a hurdle or two to
> > integrate this with the 4.4.0 kernel's internal API.  As it stands
> > this is where the compilation first fails.  I can certainly look
> > into this myself
> but am told you are the expert.
> > (I am far from a kernel expert myself.)
> >
> >   LD      net/tipc/built-in.o
> >   CC [M]  net/tipc/addr.o
> >   CC [M]  net/tipc/bcast.o
> >   CC [M]  net/tipc/bearer.o
> >   CC [M]  net/tipc/core.o
> >   CC [M]  net/tipc/link.o
> >   CC [M]  net/tipc/discover.o
> >   CC [M]  net/tipc/msg.o
> >   CC [M]  net/tipc/name_distr.o
> >   CC [M]  net/tipc/subscr.o
> >   CC [M]  net/tipc/monitor.o
> > net/tipc/monitor.c: In function '__tipc_nl_add_monitor_peer':
>
> Unless you are running a cluster > 32 nodes and need the hierarchical
> neighbor monitoring feature, you can just comment out the contents of
> this function and other monitor-related netlink function.
>
> ///jon
>
> > net/tipc/monitor.c:707:3: error: implicit declaration of function
> > 'nla_put_u64_64bit' [-Werror=implicit-function-declaration]
> > cc1: some warnings being treated as errors
> > make[2]: *** [net/tipc/monitor.o] Error 1
> > make[1]: *** [net/tipc] Error 2
> > make: *** [net] Error 2
> >
> >
> >
> > -----Original Message-----
> > From: Butler, Peter
> > Sent: February-23-17 10:56 AM
> > To: Jon Maloy <jon.ma...@ericsson.com<mailto:jon.ma...@ericsson.com>>; tipc-
> > discuss...@lists.sourceforge.net<mailto:discuss...@lists.sourceforge.net>; 
> > Parthasarathy Bhuvaragan
> > <parthasarathy.bhuvara...@ericsson.com<mailto:parthasarathy.bhuvara...@ericsson.com>>
> > Cc: Butler, Peter <pbut...@sonusnet.com<mailto:pbut...@sonusnet.com>>
> > Subject: RE: TIPC Oops in tipc_sk_recv
> >
> > Hi Partha,
> >
> > I'll give you the short version here to save you the time of reading
> > this entire thread.
> >
> > Basically I need to port the latest and greatest TIPC code (i.e.
> > from the latest longterm kernel release, namely 4.9.11) into a 4.4.0
> > kernel source base.  (I know that sounds ugly but it's for an
> > emergency quick-fix and upgrading the entire kernel is not an option
> > at this
> > time...)
> >
> > Jon has said this is entirely doable but that you are the expert,
> > and that there will be at least one minor hurdle in doing so, namely
> > in iov handling in msg_build().
> >
> > Thanks,
> >
> > Peter
> >
> >
> >
> > -----Original Message-----
> > From: Jon Maloy [mailto:jon.ma...@ericsson.com]
> > Sent: February-23-17 10:45 AM
> > To: Butler, Peter <pbut...@sonusnet.com<mailto:pbut...@sonusnet.com>>; tipc-
> > discuss...@lists.sourceforge.net<mailto:discuss...@lists.sourceforge.net>; 
> > Parthasarathy Bhuvaragan
> > <parthasarathy.bhuvara...@ericsson.com<mailto:parthasarathy.bhuvara...@ericsson.com>>
> > Subject: RE: TIPC Oops in tipc_sk_recv
> >
> >
> >
> > > -----Original Message-----
> > > From: Butler, Peter [mailto:pbut...@sonusnet.com]
> > > Sent: Thursday, February 23, 2017 10:25 AM
> > > To: Jon Maloy <jon.ma...@ericsson.com<mailto:jon.ma...@ericsson.com>>; 
> > > tipc-
> > > discuss...@lists.sourceforge.net<mailto:discuss...@lists.sourceforge.net>
> > > Cc: Butler, Peter <pbut...@sonusnet.com<mailto:pbut...@sonusnet.com>>
> > > Subject: RE: TIPC Oops in tipc_sk_recv
> > >
> > > Hi Jon,
> > >
> > > Thanks for the info.  The solution we are considering (to give the
> > > customer an emergency patch) is backport the TIPC code from kernel
> > > 4.4.50 into our 4.4.0 kernel source tree.  From what I can see, I
> > > should be able to do so with little effort.  I am assuming (?)
> > > that since 4.4.x is a longterm kernel release that the
> > > 4.4.50 TIPC code is considered stable and devoid of the original
> > > bug associated with this section of code in tipc_sk_rcv() - am I
> > > wrong to assume that?
> >
> > Unfortunately yes. The only safe solution to the deadlock problem is
> > the one you find in later versions.
> > The patch fixing this particular problem hasn't been applied this
> > far back, probably because it didn't apply cleanly.
> >
> > > The section of code in question is entirely different in 4.4.50
> > > than what we currently have:
> > >
> > >       if (likely(tsk)) {
> > >          sk = &tsk->sk;
> > >          if (likely(spin_trylock_bh(&sk->sk_lock.slock))) {
> > >             tipc_sk_enqueue(inputq, sk, dport);
> > >             spin_unlock_bh(&sk->sk_lock.slock);
> > >          }
> > >          sock_put(sk);
> > >          continue;
> > >       }
> > >
> > > Does this mean that the 4.4.50 version (as shown above) is still
> > > susceptible to the original bug?  (Our original O/S maintainer
> > > patched this section because of the original bug that was causing
> > > an oops there - but obviously the patch he implemented was also
> > > buggy, as previously discussed.)
> > >
> > > Ultimately we would rather upgrade our entire kernel (say, to
> > > 4.9.11
> > > - the latest and greatest longterm release) but I see the TIPC
> > > design has changed significantly and I'm not sure if it would
> > > backport into our 4.4.0 kernel without significant effort; i.e.
> > > perhaps this change in design also depends on other API changes
> > > within other layers of the kernel.  If I am wrong in this and you
> > > think that the 4.9.11 TIPC code should be able to be backported to
> > > our 4.4.0 base then I will do so,
> >
> > It is absolutely doable. As a matter of fact, this is what Partha
> > has been doing in one of our own product lines.
> > AFAIK, the only build issue you will encounter is a change to the
> > iov handling in msg_build(), and that is easily fixed by reverting
> > to the old
> method.
> > (Correct me Partha, if I am wrong here). But, with new functionality
> > (e.g., new flow control) there are new issues which still haven't
> > been ironed out completely. I think Partha is the one to give a
> > better update
> here.
> >
> > ///jon
> >
> > > as there are far more fixes in 4.9.11 than in 4.4.50.  The reason
> > > we can't upgrade the entire kernel to 4.4.50 or 4.9.11 in the
> > > short term is a bit of a long story (which I will spare you), but
> > > suffice it to say that that is only an option for a long-term fix
> > > for our customers and not for this short term emergency fix which
> > > we need
> released asap.
> > >
> > > All this to say, the goal here is to move to the latest possible
> > > TIPC code which will (relatively) seamlessly integrate with our
> > > 4.4.0 kernel, and also be free of the aforementioned bug.  Let me
> > > know what
> > you think.
> > >
> > > Thanks,
> > >
> > > Peter
> > >
> > >
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: Jon Maloy [mailto:jon.ma...@ericsson.com]
> > > Sent: February-23-17 8:22 AM
> > > To: Butler, Peter <pbut...@sonusnet.com<mailto:pbut...@sonusnet.com>>; 
> > > tipc-
> > > discuss...@lists.sourceforge.net<mailto:discuss...@lists.sourceforge.net>
> > > Subject: RE: TIPC Oops in tipc_sk_recv
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Butler, Peter [mailto:pbut...@sonusnet.com]
> > > > Sent: Wednesday, February 22, 2017 04:31 PM
> > > > To: Jon Maloy <jon.ma...@ericsson.com<mailto:jon.ma...@ericsson.com>>; 
> > > > tipc-
> > > > discuss...@lists.sourceforge.net<mailto:discuss...@lists.sourceforge.net>
> > > > Cc: Butler, Peter <pbut...@sonusnet.com<mailto:pbut...@sonusnet.com>>
> > > > Subject: RE: TIPC Oops in tipc_sk_recv
> > > >
> > > > Hi Jon,
> > > >
> > > > I think I found the problem, which ultimately may only exist on
> > > > our end (see below for an explanation, and let me know if you agree).
> > > >
> > > > The fellow that was maintaining our O/S previously (no longer
> > > > with the
> > > > company) had made some patches to the 4.4.0 kernel TIPC code,
> > > > and indeed one of them is in the offending tipc_sk_rcv() function.
> > > >
> > > > Specifically, note this segment of code from our kernel source tree:
> > > >
> > > >                        /* Send pending response/rejected messages, if 
> > > > any */
> > > >                        while (!skb_queue_empty(&sk->sk_write_queue)) {
> > > >                                skb = skb_dequeue(&sk->sk_write_queue);
> > > >                                dnode = msg_destnode(buf_msg(skb));
> > > >                                tipc_node_xmit_skb(net, skb, dnode, 
> > > > dport);
> > > >                        }
> > >
> > > Yes, this is wrong. The socket write queue is only used for
> > > outgoing regular messages (Partha has later changed that), and
> > > should only be emptied by the sending thread. Running this code in
> > > interrupt context will give exactly the symptom you see, because
> > > the writing thread might already have freed or sent the buffer in 
> > > question.
> > > >
> > > > Whereas the latest and greatest official longterm 4.9.11 kernel has:
> > > >
> > > >          /* Send pending response/rejected messages, if any */
> > > >          while ((skb = __skb_dequeue(&xmitq))) {
> > > >             dnode = msg_destnode(buf_msg(skb));
> > > >             tipc_node_xmit_skb(net, skb, dnode, dport);
> > > >          }
> > > >
> > > > The code path that triggers the oops (in our source code) is from:
> > > >
> > > > dnode = msg_destnode(buf_msg(skb));
> > > >
> > > > where msg_destnode() calls msg_word() which calls:
> > > >
> > > > ntohl(m->hdr[pos]);
> > > >
> > > > which is precisely where the oops occurred.
> > > >
> > > > I'm not exactly sure where he got that code change - my guess is
> > > > he posted a question on the tipc-discussion list and got a
> > > > suggestion to try a code snippet, but in the end the actual
> > > > changes (that were officially released at kernel.org) differed,
> > > > as per
> above.
> > >
> > > I rather suspect he might have looked at the more recent code and
> > > tried to do the same, while misunderstanding the role of the write
> queue.
> > >
> > > > Indeed, on Google I can see some threads discussing a 'deadly
> embrace'
> > > > deadlock (for example
> > > > http://www.spinics.net/lists/netdev/msg382379.html) between
> > > > yourself and him.  Another possibility is that the offending
> > > > source code in question was indeed released sometime after
> > > > 4.4.0, but has since modified/fixed, thus explaining the discrepancy.
> > >
> > > The loop was introduced in conjunction with that discussion, but
> > > it should not be done in the way it is done above. Indeed, I
> > > cannot see that this can have solved the "deadly embrace" problem
> > > at all, unless he made other changes and added the
> > > rejected/returned messages to the write queue. That might work
> > > most of the time, but will still sooner or later interfere with a sending 
> > > thread.
> > >
> > > There are two ways you can solve this:
> > > 1: Introduce a stack based queue for reject/return messages, as we
> > > do, and pass it along in the calls.
> > > 2: Put send messages on a stack based queue, as Partha has done in
> > > the later versions. This assuming that the rejected messages are
> > > added to the write queue, as I am speculating above.
> > >
> > > BR
> > > ///jon
> > >
> > > >
> > > > If either of possibilities is what actually happened, then this
> > > > may not a bug you need to worry about.  Granted, the same
> > > > msg_destnode() call still exists in the current (4.9.11 and
> > > > 4.10) code, but the semantics of the encapsulating while loop
> > > > are different, and maybe as such
> > > that eliminates the issue.
> > > > Thoughts?
> > > >
> > > > Peter
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > -----Original Message-----
> > > > From: Jon Maloy [mailto:jon.ma...@ericsson.com]
> > > > Sent: February-22-17 3:01 PM
> > > > To: Butler, Peter <pbut...@sonusnet.com<mailto:pbut...@sonusnet.com>>; 
> > > > tipc-
> > > > discuss...@lists.sourceforge.net<mailto:discuss...@lists.sourceforge.net>
> > > > Subject: RE: TIPC Oops in tipc_sk_recv
> > > >
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Butler, Peter [mailto:pbut...@sonusnet.com]
> > > > > Sent: Wednesday, February 22, 2017 02:15 PM
> > > > > To: Jon Maloy 
> > > > > <jon.ma...@ericsson.com<mailto:jon.ma...@ericsson.com>>; tipc-
> > > > > discuss...@lists.sourceforge.net<mailto:discuss...@lists.sourceforge.net>
> > > > > Cc: Butler, Peter <pbut...@sonusnet.com<mailto:pbut...@sonusnet.com>>
> > > > > Subject: RE: TIPC Oops in tipc_sk_recv
> > > > >
> > > > > For the " Source file is more recent than executable" message,
> > > > > could this simply be due to the fact that I copied the kernel
> > > > > source to the lab and then ran the gdb commands as shown?  As
> > > > > such, the newly copied files would have a newer timestamp than
> > > > > the
> > kernel/tipc.ko files.
> > > > > (The kernel is actual built on a separate compiler than the
> > > > > test lab
> > > > > machine.)
> > > >
> > > > If you are certain that the build was made from the same source
> > > > this is false alarm, caused by the timestamp as you suggest.
> > > >
> > > > ///jon
> > > >
> > > > >
> > > > > Or could I get that message for another reason?
> > > > >
> > > > >
> > > > >
> > > > > -----Original Message-----
> > > > > From: Jon Maloy [mailto:jon.ma...@ericsson.com]
> > > > > Sent: February-22-17 2:11 PM
> > > > > To: Butler, Peter 
> > > > > <pbut...@sonusnet.com<mailto:pbut...@sonusnet.com>>; tipc-
> > > > > discuss...@lists.sourceforge.net<mailto:discuss...@lists.sourceforge.net>
> > > > > Subject: RE: TIPC Oops in tipc_sk_recv
> > > > >
> > > > >
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Butler, Peter [mailto:pbut...@sonusnet.com]
> > > > > > Sent: Wednesday, February 22, 2017 01:04 PM
> > > > > > To: Jon Maloy 
> > > > > > <jon.ma...@ericsson.com<mailto:jon.ma...@ericsson.com>>; tipc-
> > > > > > discuss...@lists.sourceforge.net<mailto:discuss...@lists.sourceforge.net>
> > > > > > Cc: Butler, Peter 
> > > > > > <pbut...@sonusnet.com<mailto:pbut...@sonusnet.com>>
> > > > > > Subject: RE: TIPC Oops in tipc_sk_recv
> > > > > >
> > > > > > I took a stab at it this way - not sure if I am doing this
> > > > > > correctly or
> not.
> > > > > >
> > > > > > [root@myVMslot12 ~]# gdb /boot/vmlinuz-4.4.0 /proc/kcore GNU
> > gdb
> > > > > > (GDB) Fedora (7.3.50.20110722-13.fc16) Copyright (C) 2011
> > > > > > Free Software Foundation, Inc.
> > > > > > License GPLv3+: GNU GPL version 3 or later
> > > > > > <http://gnu.org/licenses/gpl.html>
> > > > > > This is free software: you are free to change and redistribute it.
> > > > > > There is NO WARRANTY, to the extent permitted by law.  Type
> > > > > > "show copying"
> > > > > > and "show warranty" for details.
> > > > > > This GDB was configured as "x86_64-redhat-linux-gnu".
> > > > > > For bug reporting instructions, please see:
> > > > > > <http://www.gnu.org/software/gdb/bugs/>...
> > > > > > BFD: /boot/vmlinuz-4.4.0: Warning: Ignoring section flag
> > > > > > IMAGE_SCN_MEM_NOT_PAGED in section .bss
> > > > > > BFD: /boot/vmlinuz-4.4.0: Warning: Ignoring section flag
> > > > > > IMAGE_SCN_MEM_NOT_PAGED in section .bss Reading symbols
> > from
> > > > > > /boot/vmlinuz-4.4.0...(no debugging symbols found)...done.
> > > > > >
> > > > > > warning: core file may not match specified executable file.
> > > > > > [New process 1]
> > > > > > Core was generated by `BOOT_IMAGE=/vmlinuz-4.4.0
> > > > > root=UUID=b419f9ff-
> > > > > > 80ce-459e-855c-614d86a48105 ro rd.'.
> > > > > > #0  0x0000000000000000 in ?? ()
> > > > > >  (gdb) file /lib/modules/4.4.0/kernel/net/tipc/tipc.ko
> > > > > > warning: core file may not match specified executable file.
> > > > > > Reading symbols from
> > > /lib/modules/4.4.0/kernel/net/tipc/tipc.ko...done.
> > > > > > (gdb) list *(tipc_sk_rcv+0x238)
> > > > > > 0x14898 is in tipc_sk_rcv (net/tipc/msg.h:131).
> > > > > > warning: Source file is more recent than executable.
> > > > >
> > > > > Seems like you didn't rebuild after you updated the source file?
> > > > > Try again just to make sure.
> > > > >
> > > > > > 126             return (struct tipc_msg *)skb->data;
> > > > > > 127     }
> > > > > > 128
> > > > > > 129     static inline u32 msg_word(struct tipc_msg *m, u32 pos)
> > > > > > 130     {
> > > > > > 131             return ntohl(m->hdr[pos]);
> > > > >
> > > > > If this is correct, you are receiving a corrupt buffer where
> > > > > the data pointer is invalid. This is typical if the buffer
> > > > > already has been
> > > released.
> > > > >
> > > > > ///jon
> > > > >
> > > > > > 132     }
> > > > > > 133
> > > > > > 134     static inline void msg_set_word(struct tipc_msg *m, u32 w,
> u32
> > > val)
> > > > > > 135     {
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > -----Original Message-----
> > > > > > From: Butler, Peter
> > > > > > Sent: February-22-17 12:45 PM
> > > > > > To: Jon Maloy 
> > > > > > <jon.ma...@ericsson.com<mailto:jon.ma...@ericsson.com>>; tipc-
> > > > > > discuss...@lists.sourceforge.net<mailto:discuss...@lists.sourceforge.net>
> > > > > > Cc: Butler, Peter 
> > > > > > <pbut...@sonusnet.com<mailto:pbut...@sonusnet.com>>
> > > > > > Subject: RE: TIPC Oops in tipc_sk_recv
> > > > > >
> > > > > > Hi Jon
> > > > > >
> > > > > > Thanks for the info.
> > > > > >
> > > > > > One thing I should clarify.  Although we are running the
> > > > > > 4.4.0 kernel, we had backported a number of post-4.4.0 TIPC
> > > > > > patches into our 4.4.0 kernel.  As such, the offset in
> > > > > > question
> > > > > > (tipc_sk_rcv+0x238) will not match that in the vanilla 4.4.0 source.
> > > > > >
> > > > > > Should I post the entire socket.c file to this list for your review?
> > > > > > Or is there an easy way for me to do a similar listing using
> > > > > > our actual tipc.ko file here in the lab?
> > > > > >
> > > > > > Peter
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > -----Original Message-----
> > > > > > From: Jon Maloy [mailto:jon.ma...@ericsson.com]
> > > > > > Sent: February-22-17 12:29 PM
> > > > > > To: Butler, Peter 
> > > > > > <pbut...@sonusnet.com<mailto:pbut...@sonusnet.com>>; tipc-
> > > > > > discuss...@lists.sourceforge.net<mailto:discuss...@lists.sourceforge.net>
> > > > > > Subject: RE: TIPC Oops in tipc_sk_recv
> > > > > >
> > > > > > Hi Peter,
> > > > > > Very hard to make any suggestions on how to reproduce this.
> > > > > > What I can see is that it is a STREAM message being sent
> > > > > > from a node local socket, i.e., it doesn't go via any interface.
> > > > > > The crash seems to happen when the receiving socket is owned
> > > > > > by the user, and while we are instead adding the message to
> > > > > > the
> backlog queue:
> > > > > >
> > > > > > Reading symbols from net/tipc/tipc.ko...done.
> > > > > > (gdb) list *(tipc_sk_rcv+0x238)
> > > > > > 0x13d78 is in tipc_sk_rcv (./arch/x86/include/asm/atomic.h:214).
> > > > > > 209     static __always_inline int __atomic_add_unless(atomic_t *v,
> int
> > > a,
> > > > int
> > > > > > u)
> > > > > > 210     {
> > > > > > 211             int c, old;
> > > > > > 212             c = atomic_read(v);
> > > > > > 213             for (;;) {
> > > > > > 214                     if (unlikely(c == (u)))
> > > > > > 215                             break;
> > > > > > 216                     old = atomic_cmpxchg((v), c, c + (a));
> > > > > > 217                     if (likely(old == c))
> > > > > > 218                             break;
> > > > > >
> > > > > > This is about what I can get out of it at the moment. Maybe
> > > > > > you should try a high-load test between two local sockets
> > > > > > (try the benchmark demo from
> > > > > > tipcutils) and see what you can achieve.
> > > > > >
> > > > > > BR
> > > > > > ///jon
> > > > > >
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Butler, Peter [mailto:pbut...@sonusnet.com]
> > > > > > > Sent: Wednesday, February 22, 2017 10:40 AM
> > > > > > > To: Jon Maloy 
> > > > > > > <jon.ma...@ericsson.com<mailto:jon.ma...@ericsson.com>>; tipc-
> > > > > > > discuss...@lists.sourceforge.net<mailto:discuss...@lists.sourceforge.net>
> > > > > > > Cc: Butler, Peter 
> > > > > > > <pbut...@sonusnet.com<mailto:pbut...@sonusnet.com>>
> > > > > > > Subject: RE: TIPC Oops in tipc_sk_recv
> > > > > > >
> > > > > > > If you have any suggestions as to procedures/tricks you
> > > > > > > think might trigger this bug I can certainly attempt to do
> > > > > > > so in the
> lab.
> > > > > > > Obviously we can't attempt to reproduce it on the
> > > > > > > customer's
> > > > > > > (live)
> > > > > system.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Butler, Peter
> > > > > > > Sent: February-21-17 3:39 PM
> > > > > > > To: Jon Maloy 
> > > > > > > <jon.ma...@ericsson.com<mailto:jon.ma...@ericsson.com>>; tipc-
> > > > > > > discuss...@lists.sourceforge.net<mailto:discuss...@lists.sourceforge.net>
> > > > > > > Cc: Butler, Peter 
> > > > > > > <pbut...@sonusnet.com<mailto:pbut...@sonusnet.com>>
> > > > > > > Subject: RE: TIPC Oops in tipc_sk_recv
> > > > > > >
> > > > > > > Unfortunately this occurred on a customer system so it is
> > > > > > > not readily reproducible.  We have not seen this occur in our lab.
> > > > > > >
> > > > > > > For what it's worth, it occurred while the process was in
> > > > > > > TASK_UNINTERRUPTIBLE.  As such, the kernel could not
> > > > > > > actually kill off the associated process despite the Oops,
> > > > > > > and the process remained forever frozen in the 'D' state
> > > > > > > and the card had to be
> > > > rebooted.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Jon Maloy [mailto:jon.ma...@ericsson.com]
> > > > > > > Sent: February-21-17 3:36 PM
> > > > > > > To: Butler, Peter 
> > > > > > > <pbut...@sonusnet.com<mailto:pbut...@sonusnet.com>>; tipc-
> > > > > > > discuss...@lists.sourceforge.net<mailto:discuss...@lists.sourceforge.net>
> > > > > > > Subject: RE: TIPC Oops in tipc_sk_recv
> > > > > > >
> > > > > > > Hi Peter,
> > > > > > > I don't think this is any known bug. Is it repeatable?
> > > > > > >
> > > > > > > ///jon
> > > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: Butler, Peter [mailto:pbut...@sonusnet.com]
> > > > > > > > Sent: Tuesday, February 21, 2017 12:14 PM
> > > > > > > > To: 
> > > > > > > > tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>
> > > > > > > > Cc: Butler, Peter 
> > > > > > > > <pbut...@sonusnet.com<mailto:pbut...@sonusnet.com>>
> > > > > > > > Subject: [tipc-discussion] TIPC Oops in tipc_sk_recv
> > > > > > > >
> > > > > > > > This was with kernel 4.4.0, however I don't see any fix
> > > > > > > > specifically related to this in any subsequent 4.4.x kernel...
> > > > > > > >
> > > > > > > > BUG: unable to handle kernel NULL pointer dereference at
> > > > > > > > 00000000000000d8
> > > > > > > > IP: [<ffffffffa0148868>] tipc_sk_rcv+0x238/0x4d0 [tipc]
> > > > > > > > PGD
> > > > > > > > 34f4c0067 PUD
> > > > > > > > 34ed95067 PMD 0
> > > > > > > > Oops: 0000 [#1] SMP
> > > > > > > > Modules linked in: nf_log_ipv4 nf_log_common xt_LOG sctp
> > > > > > > > libcrc32c e1000e tipc udp_tunnel ip6_udp_tunnel iTCO_wdt
> > > > > > > > 8021q garp
> > > > > > xt_physdev
> > > > > > > > br_netfilter bridge stp llc nf_conntrack_ipv4
> > > > > > > > ipmiq_drv(O)
> > > > > > > > nf_defrag_ipv4
> > > > > > > > sio_mmc(O) ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6
> > > > > > > > nf_defrag_ipv6 xt_state nf_conntrack event_drv(O)
> > > > > > > > ip6table_filter lockd ip6_tables
> > > > > > > > pt_timer_info(O) ddi(O) grace usb_storage ixgbe igb
> > > > > > > > iTCO_vendor_support i2c_algo_bit ptp i2c_i801 pps_core
> > > > > > > > lpc_ich i2c_core intel_ips mfd_core pcspkr ioatdma
> > > > > > > > sunrpc dca tpm_tis mdio tpm
> > > > > > > [last unloaded: iTCO_wdt]
> > > > > > > > CPU: 2 PID: 12144 Comm: dinamo Tainted: G           O    4.4.0 
> > > > > > > > #23
> > > > > > > > Hardware name: PT AMC124/Base Board Product Name, BIOS
> > > > > > > > LGNAJFIP.PTI.0012.P15 01/15/2014
> > > > > > > > task: ffff880036ad8000 ti: ffff880036900000 task.ti:
> > > > > > > > ffff880036900000
> > > > > > > > RIP: 0010:[<ffffffffa0148868>]  [<ffffffffa0148868>]
> > > > > > > > tipc_sk_rcv+0x238/0x4d0 [tipc]
> > > > > > > > RSP: 0018:ffff880036903bb8  EFLAGS: 00010292
> > > > > > > > RAX: 0000000000000000 RBX: ffff88034def3970 RCX:
> > > > > > > > 0000000000000001
> > > > > > > > RDX: 0000000000000101 RSI: 0000000000000292 RDI:
> > > > > > > > ffff88034def3984
> > > > > > > > RBP: ffff880036903c28 R08: 0000000000000101 R09:
> > > > > > > > 0000000000000004
> > > > > > > > R10: 0000000000000001 R11: 0000000000000000 R12:
> > > > > > > > ffff880036903d28
> > > > > > > > R13: 00000000bd1fd8b2 R14: ffff88034def3840 R15:
> > > > > > > > ffff880036903d3c
> > > > > > > > FS:  00007f1e86299740(0000) GS:ffff88035fc40000(0000)
> > > > > > > > knlGS:0000000000000000
> > > > > > > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > > > > CR2: 00000000000000d8 CR3: 0000000036835000 CR4:
> > > > > > > > 00000000000006e0
> > > > > > > > Stack:
> > > > > > > >  000000000000009b ffff880036903d28 0000000000000018
> > > > > > > > ffff88034def38c8
> > > > > > > >  ffffffff81ce6240 ffff8802b9bdba00 ffff880036903ca8
> > > > > > > > ffffffffa013bd7e
> > > > > > > >  ffff8802b99d5ee8 ffff880036903c60 0000000000000000
> > > > > > > > ffff88003693cb00 Call
> > > > > > > > Trace:
> > > > > > > >  [<ffffffffa013bd7e>] ? tipc_msg_build+0xde/0x4f0 [tipc]
> > > > > > > > [<ffffffffa014358f>] tipc_node_xmit+0x11f/0x150 [tipc]
> > > > > > > > [<ffffffffa01470ba>]
> > > > > > > > __tipc_send_stream+0x16a/0x300 [tipc]  [<ffffffff81625eb5>] ?
> > > > > > > > tcp_sendmsg+0x4d5/0xb00  [<ffffffffa0147292>]
> > > > > > > > tipc_send_stream+0x42/0x70 [tipc]  [<ffffffff815bcf77>]
> > > > > > > > sock_sendmsg+0x47/0x50  [<ffffffff815bd03f>]
> > > > > > > > sock_write_iter+0x7f/0xd0 [<ffffffff811d799a>]
> > > > > > > > __vfs_write+0xaa/0xe0 [<ffffffff811d8b16>]
> > > > > > > > vfs_write+0xb6/0x1a0  [<ffffffff811d8e3f>]
> > > > > > > > SyS_write+0x4f/0xb0 [<ffffffff816de6d7>]
> > > > > > > > entry_SYSCALL_64_fastpath+0x12/0x6a
> > > > > > > > Code: 89 de 4c 89 f7 e8 29 d3 ff ff 48 8b 7d a8 e8 60 59
> > > > > > > > 59
> > > > > > > > e1
> > > > > > > > 49 8d 9e 30 01 00
> > > > > > > > 00 49 3b 9e 30 01 00 00 74 30 48 89 df e8 b8 b6 47 e1
> > > > > > > > <48> 8b
> > > > > > > > 90
> > > > > > > > d8
> > > > > > > > 00
> > > > > > > > 00 00 48 8b 7d b0 44 89 e9 48 89 c6 48 89 45 c0 RIP
> > > > > > > > [<ffffffffa0148868>]
> > > > > > > > tipc_sk_rcv+0x238/0x4d0 [tipc]  RSP <ffff880036903bb8>
> > > > > > > > CR2: 00000000000000d8
> > > > > > > > ---[ end trace 1c2d69738941d565 ]---
> > > > > > > >
> > > > > > > >
> > > > > > > > --------------------------------------------------------
> > > > > > > > --
> > > > > > > > --
> > > > > > > > --
> > > > > > > > --
> > > > > > > > --
> > > > > > > > --
> > > > > > > > --
> > > > > > > > -------- Check out the vibrant tech community on one of
> > > > > > > > the world's most engaging tech sites, SlashDot.org!
> > > > > > > > http://sdm.link/slashdot
> > > > > > > > _______________________________________________
> > > > > > > > tipc-discussion mailing list
> > > > > > > > tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>
> > > > > > > > https://lists.sourceforge.net/lists/listinfo/tipc-discus
> > > > > > > > si
> > > > > > > > on
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
tipc-discussion mailing list
tipc-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tipc-discussion

Reply via email to