Re: [tipc-discussion] Tipc: name table mismatch between different cards in a system

2016-05-04 Thread GUNA
I have seen the patch: tipc: move netlink policies to netlink.c
Does this patch has a tipc-config compatibility fix that Jon found earlier ?
However I don't see any change in netlink_compat.c.

From Jon:
"
When I run the "tipc" tool  I see the correct value, i.e., key == (portid + 1).
When I run tipc-config (which is deprecated in the new version), I see
the wrong value key == portid for the same publications!
"

// Guna

On Mon, May 2, 2016 at 1:22 PM, GUNA  wrote:
> Is there any possibility getting the fix soon? Our audit scripts cause
> alarm due to incorrect table mismatch. If you point me the code to be
> fixed then I will fix it in my kernel. I am using kernel 4.4.0 on
> Fedora dist.
> Thanks in advance.
> Guna
>
> On Fri, Apr 29, 2016 at 11:55 AM, Jon Maloy  wrote:
>>
>>
>>> -Original Message-
>>> From: GUNA [mailto:gbala...@gmail.com]
>>> Sent: Friday, 29 April, 2016 10:48
>>> To: Jon Maloy
>>> Cc: tipc-discussion@lists.sourceforge.net
>>> Subject: Re: Tipc: name table mismatch between different cards in a system
>>>
>>> The two skb_linearize() calls and the update of ‘hdr' fixes are
>>> already in my load did not solve this issue. The issue remains same
>>> even after today's ACTIVE state fix (before one of link is STANDBY
>>> even same priority)
>>>
>>> // IO card, note this does not run latest kernel or tipc
>>> [root@10 ~]# tipc-config -nt |grep 2334480598
>>>20012  20012  <1.1.12:2334480598>2334480599  
>>> cluster
>>>
>>> // runs latest kernel on all CPU cards.
>>> [root@2 ~]# tipc-config -nt |grep 2334480598
>>> 50009  20012  20012  <1.1.12:2334480598>2334480598  
>>> cluster
>>
>> This was easy to reproduce, and actually looks like another presentation 
>> problem.
>>
>> When I run the "tipc" tool  I see the correct value, i.e., key == (portid + 
>> 1).
>> When I run tipc-config (which is deprecated in the new version), I see the 
>> wrong value key == portid for the same publications!
>>
>> So, your code will probably work correct, but the values presented will be 
>> wrong on the new version and correct on the old one.
>> I think this is something for Richard Alpe, who wrote the new netlink 
>> compatibility code, to take a look at.
>>
>> ///jon
>>
>>>
>>>
>>> On Thu, Apr 28, 2016 at 8:19 PM, GUNA  wrote:
>>> > Thanks Jon. I already applied this patch on currently running module.
>>> >
>>> > On Thursday, April 28, 2016, Jon Maloy  wrote:
>>> >>
>>> >> Here it is. (This is just pasted into Outlook, so don’t try to apply it)
>>> >>
>>> >> If you manually add the two skb_linearize() calls and the update of ‘hdr’
>>> >> you should be safe.
>>> >>
>>> >>
>>> >>
>>> >> Good luck!
>>> >>
>>> >> ///jon
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> commit c7cad0d6f70cd4ce8644ffe528a4df1cdc2e77f5
>>> >>
>>> >> Author: Jon Paul Maloy 
>>> >>
>>> >> Date:   Thu Nov 19 14:30:40 2015 -0500
>>> >>
>>> >>
>>> >>
>>> >> tipc: move linearization of buffers to generic code
>>> >>
>>> >>
>>> >>
>>> >> In commit 5cbb28a4bf65c7e4 ("tipc: linearize arriving NAME_DISTR
>>> >>
>>> >> and LINK_PROTO buffers") we added linearization of NAME_DISTRIBUTOR,
>>> >>
>>> >> LINK_PROTOCOL/RESET and LINK_PROTOCOL/ACTIVATE to the function
>>> >>
>>> >> tipc_udp_recv(). The location of the change was selected in order
>>> >>
>>> >> to make the commit easily appliable to 'net' and 'stable'.
>>> >>
>>> >>
>>> >>
>>> >> We now move this linearization to where it should be done, in the
>>> >>
>>> >> functions tipc_named_rcv() and tipc_link_proto_rcv() respectively.
>>> >>
>>> >>
>>> >>
>>> >> Reviewed-by: Ying Xue 
>>> >>
>>> >> Signed-off-by: Jon Maloy 
>>> >>
>>> >> Signed-off-by: David S. Miller 
>>> >>
>>> >>
>>> >>
>>> >> diff --git a/net/tipc/link.c b/net/tipc/link.c
>>> >>
>>> >> index 9efbdbd..fa452fb 100644
>>> >>
>>> >> --- a/net/tipc/link.c
>>> >>
>>> >> +++ b/net/tipc/link.c
>>> >>
>>> >> @@ -1260,6 +1260,8 @@ static int tipc_link_proto_rcv(struct tipc_link *l,
>>> >> struct sk_buff *skb,
>>> >>
>>> >> /* fall thru' */
>>> >>
>>> >> case ACTIVATE_MSG:
>>> >>
>>> >> +   skb_linearize(skb);
>>> >>
>>> >> +   hdr = buf_msg(skb);
>>> >>
>>> >> /* Complete own link name with peer's interface name */
>>> >>
>>> >> if_name =  strrchr(l->name, ':') + 1;
>>> >>
>>> >> diff --git a/net/tipc/name_distr.c b/net/tipc/name_distr.c
>>> >>
>>> >> index c07612b..f51c8bd 100644
>>> >>
>>> >> --- a/net/tipc/name_distr.c
>>> >>
>>> >> +++ b/net/tipc/name_distr.c
>>> >>
>>> >> @@ -397,6 +397,7 @@ void tipc_named_rcv(struct net *net, struct
>>> >> sk_buff_head *inputq)
>>> >>
>>> >> spin_lock_bh(>nametbl_lock);
>>> >>
>>> >> for (skb = skb_dequeue(inputq); skb; skb = 

Re: [tipc-discussion] tipc: tipc_recv_stream with kernel panic

2016-05-04 Thread Xue, Ying
Thank you for the testing and report!

To be honest, we never met so many issues before. So I doubt all problems you 
encountered may be involved by the recent changes. 
If you have more detailed failure logs, please share them with us so that we 
can look into its root cause.

Thanks,
Ying

-Original Message-
From: GUNA [mailto:gbala...@gmail.com] 
Sent: 2016年5月4日 0:41
To: Xue, Ying
Cc: Erik Hugne; tipc-discussion@lists.sourceforge.net; 
parthasarathy.bhuvara...@ericsson.com; Richard Alpe
Subject: Re: [tipc-discussion] tipc: tipc_recv_stream with kernel panic

Thanks Ying.

As you suggested, I will revert the "tipc: avoid packets leaking on socket 
receive queue" patch. Since the issue is not reproducible on demand, I may need 
to wait the issue is seen again or not with the new driver.

We do experience in traffic throughput due to TIPC connections are failing (not 
with this change). If anyone aware of the failures please let me know.


Background 
System is originally based on Kernel 3.4.2 on Fedora 16 and stable.
Recently, I updated the system with kernel 4.4.0. All stock kernel drivers are 
being used and no customization except ported some latest TIPC patches to fix 
some TIPC issues.

All 6 routing CPUs went down over the course of the weekend. The noted output 
is from one CPU; others are unknown, but assumed to have gone down with the 
same cause. Also note that in addition, one of the routing cards (no output 
available) went down again on Monday.

The new kernel 4.4.0 is being used in the system Since April 1st, and seen the 
issue 3 times so far. All these times, mostly heartbeat type traffic, not heavy 
traffic.
=



On Tue, May 3, 2016 at 6:26 AM, Xue, Ying  wrote:
> I agree with Erik too.
>
>
>
> The oops should be caused by socket was freed early. But
>
>
>
> GUNA, can you reproduce the issue? If so, please try to revert the 
> commit
> f4195d1eac954a67adf112dd53404560cc55b942 (“tipc: avoid packets leaking 
> on socket receive queue”), and verify whether the issue occurs or not.
>
>
>
> I suspect the commit bring some unknown side effect, leading to the panic.
>
>
>
> Thanks,
>
> Ying
>
> From: Erik Hugne [mailto:erik.hu...@gmail.com]
> Sent: 2016年5月3日 13:09
> To: GUNA
> Cc: tipc-discussion@lists.sourceforge.net;
> parthasarathy.bhuvara...@ericsson.com; Richard Alpe; Xue, Ying
> Subject: Re: [tipc-discussion] tipc: tipc_recv_stream with kernel 
> panic
>
>
>
> (On mobile)
>
> At first glance, it seems that the socket was freed, but there was a 
> pending wakeup signal for it. Which then causes the subsequent 
> spin_lock_bh() to deref freed mem.
>
> //E
>
> On May 3, 2016 02:43, "GUNA"  wrote [...]
>>> [375832.498126] BUG: unable to handle kernel paging request at
>>> 01a400015ff4
>> [375832.505300] IP: []
>> queued_spin_lock_slowpath+0xe6/0x160
>> [375832.512394] PGD 0
>> [375832.514657] Oops: 0002 [#1] SMP
>> [375832.518306] Modules linked in: nf_log_ipv6 nf_log_ipv4 
>> nf_log_common xt_LOG sctp libcrc32c e1000e tipc udp_tunnel 
>> ip6_udp_tunnel 8021q garp iTCO_wdt xt_physdev br_netfilter bridge stp 
>> llc nf_conntrack_ipv4 nf_defrag_ipv4 ipmiq_drv(O) sio_mmc(O) 
>> ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state 
>> nf_conntrack lockd ip6table_filter event_drv(O) ip6_tables grace
>> pt_timer_info(O) ddi(O) usb_storage ixgbe igb i2c_i801 
>> iTCO_vendor_support i2c_algo_bit ioatdma intel_ips i2c_core pcspkr 
>> sunrpc ptp mdio dca pps_core lpc_ich tpm_tis mfd_core tpm [last
>> unloaded: iTCO_wdt]
>> [375832.573693] CPU: 4 PID: 0 Comm: swapper/4 Tainted: G   O
>>  4.4.0 #14
>> [375832.581385] Hardware name: PT AMC124/Base Board Product Name, 
>> BIOS
>> LGNAJFIP.PTI.0012.P15 01/15/2014
>> [375832.591028] task: 880351a89b40 ti: 880351a9 task.ti:
>> 880351a9
>> [375832.599026] RIP: 0010:[]  []
>> queued_spin_lock_slowpath+0xe6/0x160
>> [375832.608964] RSP: 0018:88035fc83d58  EFLAGS: 00010002 
>> [375832.614825] RAX: 1447 RBX: 0292 RCX:
>> 88035fc95fc0
>> [375832.622743] RDX: 01a400015ff4 RSI: 0014 RDI:
>> 880351232f80
>> [375832.630567] RBP: 88035fc83d58 R08: 0101 R09:
>> 0004
>> [375832.638348] R10:  R11:  R12:
>> 01001002
>> [375832.645919] R13: 0001 R14:  R15:
>> 
>> [375832.653610] FS:  () GS:88035fc8()
>> knlGS:
>> [375832.662317] CS:  0010 DS:  ES:  CR0: 8005003b 
>> [375832.668483] CR2: 01a400015ff4 CR3: 01c0a000 CR4:
>> 06e0
>> [375832.676133] Stack:
>> [375832.678344]  88035fc83d78 816de2c1 88034a8bba60
>> 880351232f80
>> [375832.686163]  88035fc83db8 810bc592 88035fc83dc8
>> 880351758000
>> [375832.694139]  01001002  b802f4bd
>>