Re: [tipc-discussion] [Kernel oops in 4.4.18]

Jon Maloy Mon, 12 Sep 2016 07:41:48 -0700

Hi,
What you are seeing is a typical symptom of a problem that was fixed in
commit c7cad0d6f70cd4ce86 (“tipc: move linearization of buffers to generic 
code”)


For unknown reason this fix doesn’t seem to have made it back to 4.20 (the code 
I was checking), and then probably not to 4.18 either.

BR
///jon


From: Arndt, Jonas [mailto:jonas.ar...@hpe.com]
Sent: Sunday, 11 September, 2016 20:12
To: Parthasarathy Bhuvaragan <parthasarathy.bhuvara...@ericsson.com>
Cc: Jon Maloy <jon.ma...@ericsson.com>; tipc-discussion@lists.sourceforge.net; 
Ying Xue <ying....@windriver.com>
Subject: Re: [tipc-discussion] [Kernel oops in 4.4.18]


On 09/02/2016 02:02 AM, Parthasarathy Bhuvaragan wrote:

Hi,

You need this fix:

https://sourceforge.net/p/tipc/mailman/message/34768934/

But it wont apply cleanly, so you need this entire series to fix all issues 
related to topology server.

https://sourceforge.net/p/tipc/mailman/message/34768927/

They were too intrusive to be pushed to net, hence were pushed to net-next and 
were merged in 4.7.

/Partha
Parta,
Thanks for this. While it appears TIPC is up and my nodes can join the cluster, 
they cannot leave and come back. What I get in the syslog on the nod that tries 
to join is:

2016-09-11T14:14:33.394673-06:00 rack13-ctrl4 kernel: [  880.688856] Dropping 
name table update (0) of {1651649891, 1819082752, 0} from <1.1.1> key=402710022
2016-09-11T14:14:33.394688-06:00 rack13-ctrl4 kernel: [  880.688862] Dropping 
name table update (0) of {4029808599, 2711729614, 1639218685} from <1.1.1> 
key=18102394
2016-09-11T14:14:33.394690-06:00 rack13-ctrl4 kernel: [  880.688865] Dropping 
name table update (0) of {134218495, 4278191616, 100669184} from <1.1.1> key=0
2016-09-11T14:14:33.394692-06:00 rack13-ctrl4 kernel: [  880.688868] Dropping 
name table update (0) of {0, 0, 0} from <1.1.1> key=0
2016-09-11T14:14:33.394693-06:00 rack13-ctrl4 kernel: [  880.688870] Dropping 
name table update (0) of {0, 0, 0} from <1.1.1> key=0
2016-09-11T14:14:33.394694-06:00 rack13-ctrl4 kernel: [  880.688872] Dropping 
name table update (0) of {0, 0, 0} from <1.1.1> key=0
2016-09-11T14:14:33.394696-06:00 rack13-ctrl4 kernel: [  880.688875] Dropping 
name table update (0) of {0, 0, 0} from <1.1.1> key=0
2016-09-11T14:14:33.394697-06:00 rack13-ctrl4 kernel: [  880.688877] Dropping 
name table update (0) of {0, 0, 0} from <1.1.1> key=0
2016-09-11T14:14:33.394699-06:00 rack13-ctrl4 kernel: [  880.688879] Dropping 
name table update (0) of {0, 0, 16463} from <1.1.1> key=4294915584
2016-09-11T14:14:33.394700-06:00 rack13-ctrl4 kernel: [  880.688882] Dropping 
name table update (0) of {0, 0, 0} from <1.1.1> key=0

We are running an HA cluster called OpenSAF that is using the TIPC protocol.

I have also traced (tshark) the traffic. Not sure if I can attach a file here 
though as that is what made my earlier mail not go through (MIME key 
attachment). Please advise if you want to look at the trace.

I also tried:


0086-tipc-fix-nullptr-crash-during-subscription-cancel.patch
0134-tipc-fix-an-infoleak-in-tipc_nl_compat_link_dump.patch
0135-tipc-fix-nl-compat-regression-for-link-statistics.patch
From the stable-queue and got the same result. With 4.5 kernel it works great.

Thanks,

// Jonas


On 09/01/2016 08:43 PM, Jon Maloy wrote:

Hi Jonas,

I don’t think there is any such thing as a “long-term” kernel from the 
community viewpoint. But distros such as SLES or Ubuntu use this term, so I 
suspect that is what you mean. I believe the latest version of both of those 
are based on 4.4.

I honestly don’t know how often and on which criteria those distros pick 
upgrades from the upstream kernel, but if this is a serious problem we 
certainly have to push them to adopt a fix for this.



I believe Partha will recognize this bug, and can tell whether there is a fix 
to it or not. If so he can also tell what has happened to it.  If this is a 
distro specific problem we need to know which one you are using.



Regards

///jon



From: Arndt, Jonas [mailto:jonas.ar...@hpe.com]

Sent: Thursday, 01 September, 2016 14:11

To: Jon Maloy <jon.ma...@ericsson.com><mailto:jon.ma...@ericsson.com>

Subject: Fwd: [tipc-discussion] [Kernel oops in 4.4.18]



Jon,



Sorry for reaching out to you directly. I have posted to the mailing list 
multiple time and I don't understand why it is getting stuck. I am a subscriber 
and got and email indicating that I can post.



Cheers,



// Jonas



-------- Forwarded Message --------



Subject:

[tipc-discussion] [Kernel oops in 4.4.18]



Date:

Wed, 31 Aug 2016 09:11:42 -0600



From:

Jonas Arndt  <mailto:jonas.ar...@hpe.com><mailto:jonas.ar...@hpe.com> 
<jonas.ar...@hpe.com><mailto:jonas.ar...@hpe.com>



To:

tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion@lists.sourceforge.net>
 
<mailto:tipc-discussion@lists.sourceforge.net><mailto:tipc-discussion@lists.sourceforge.net>





Resending as it appears it didn't show up on the mailing list. Sorry for any 
duplicates....



Hi Guys,



My apologies if this has been covered before.



I am getting this kernel null pointer when trying TIPC with 4.4.18 kernel 
(running OpenSAF). It works fine with 4.5.x. There seems to have been a number 
of patches applied to net/tipc between the versions. Why is it not back-ported 
to 4.4.x? Isn't that a longterm kernel?



Thanks,



// Jonas



================================================================================

2016-08-17T09:19:49.656792-06:00 rack13-ctrl2 kernel: [ 302.348407] BUG: unable 
to handle kernel NULL pointer dereference at 0000000000000018

2016-08-17T09:19:49.656808-06:00 rack13-ctrl2 kernel: [ 302.348474] IP: 
[<ffffffffa0702749>] tipc_nametbl_subscribe+0x19/0x180 [tipc]

2016-08-17T09:19:49.656810-06:00 rack13-ctrl2 kernel: [ 302.348540] PGD 0

2016-08-17T09:19:49.656812-06:00 rack13-ctrl2 kernel: [ 302.348559] Oops: 0000 
1 SMP

2016-08-17T09:19:49.656814-06:00 rack13-ctrl2 kernel: [ 302.348585] Modules 
linked in: tipc rpcsec_gss_krb5 nfsv4 dns_resolver ebtable_filter ebtables 
ip6table_filter ip6_tables iptable_filter ip_tables x_tables openvswitch 
nf_defrag_ipv6 nf_conntrack libcrc32c crc32c_generic nfsd auth_rpcgss nfs_acl 
nfs lockd grace fscache sunrpc x86_pkg_temp_thermal intel_powerclamp coretemp 
kvm_intel kvm irqbypass mgag200 ttm crc32_pclmul drm_kms_helper drm hmac drbg 
fb_sys_fops ansi_cprng syscopyarea aesni_intel sysfillrect aes_x86_64 sysimgblt 
lrw gf128mul glue_helper ablk_helper cryptd ipmi_si iTCO_wdt hpilo evdev pcspkr 
wmi ipmi_msghandler iTCO_vendor_support hpwdt acpi_power_meter button sb_edac 
ioatdma lpc_ich edac_core pcc_cpufreq mfd_core acpi_cpufreq processor autofs4 
ext4 crc16 mbcache jbd2 dm_mod sg sd_mod ata_generic pata_acpi crc32c_intel 
psmouse ata_piix libata uhci_hcd ehci_pci ehci_hcd igb scsi_mod i2c_algo_bit 
i2c_core usbcore usb_common ixgbe dca mdio ptp pps_core thermal

2016-08-17T09:19:49.656817-06:00 rack13-ctrl2 kernel: [ 302.349237] CPU: 16 
PID: 98 Comm: kworker/u130:0 Not tainted 4.4.18-tipc #1

2016-08-17T09:19:49.656843-06:00 rack13-ctrl2 kernel: [ 302.349278] Hardware 
name: HP ProLiant SL210t Gen8/, BIOS P83 11/01/2014

2016-08-17T09:19:49.656846-06:00 rack13-ctrl2 kernel: [ 302.349321] Workqueue: 
tipc_rcv tipc_recv_work [tipc]

2016-08-17T09:19:49.656848-06:00 rack13-ctrl2 kernel: [ 302.349354] task: 
ffff881ff93a5640 ti: ffff881ff93b0000 task.ti: ffff881ff93b0000

2016-08-17T09:19:49.656850-06:00 rack13-ctrl2 kernel: [ 302.349395] RIP: 
0010:[<ffffffffa0702749>] [<ffffffffa0702749>] 
tipc_nametbl_subscribe+0x19/0x180 [tipc]

2016-08-17T09:19:49.656852-06:00 rack13-ctrl2 kernel: [ 302.349464] RSP: 
0018:ffff881ff93b3cc0 EFLAGS: 00010286

2016-08-17T09:19:49.656853-06:00 rack13-ctrl2 kernel: [ 302.349494] RAX: 
0000000000000000 RBX: 0000000000000000 RCX: 0000000180200017

2016-08-17T09:19:49.656855-06:00 rack13-ctrl2 kernel: [ 302.349534] RDX: 
0000000180200018 RSI: 0000000000000200 RDI: 0000000000000000

2016-08-17T09:19:49.656857-06:00 rack13-ctrl2 kernel: [ 302.349573] RBP: 
ffff881ff93b3d00 R08: 00000000f7970601 R09: 0000000180200017

2016-08-17T09:19:49.656858-06:00 rack13-ctrl2 kernel: [ 302.349613] R10: 
ffffea003fde5c00 R11: ffff880ff7970600 R12: 0000000000000000

2016-08-17T09:19:49.656859-06:00 rack13-ctrl2 kernel: [ 302.349652] R13: 
ffff881ff54ac0a0 R14: ffff880fee6edd00 R15: ffff880ff7970200

2016-08-17T09:19:49.656860-06:00 rack13-ctrl2 kernel: [ 302.349692] FS: 
0000000000000000(0000) GS:ffff88203f880000(0000) knlGS:0000000000000000

2016-08-17T09:19:49.656860-06:00 rack13-ctrl2 kernel: [ 302.349736] CS: 0010 
DS: 0000 ES: 0000 CR0: 0000000080050033

2016-08-17T09:19:49.656861-06:00 rack13-ctrl2 kernel: [ 302.349785] CR2: 
0000000000000018 CR3: 0000000001a09000 CR4: 00000000001406e0

2016-08-17T09:19:49.656863-06:00 rack13-ctrl2 kernel: [ 302.349833] Stack:

2016-08-17T09:19:49.656865-06:00 rack13-ctrl2 kernel: [ 302.349853] 
ffffffff811a5d1b ffff880ff7970200 ffff880ff80f6000 0000000000000000

2016-08-17T09:19:49.656865-06:00 rack13-ctrl2 kernel: [ 302.349915] 
ffff880ff87898c0 ffff881ff54ac0a0 ffff880fee6edd00 ffff880ff7970200

2016-08-17T09:19:49.656866-06:00 rack13-ctrl2 kernel: [ 302.349976] 
ffff881ff93b3d48 ffffffffa070143a ffff881ff93b3d48 ffff880ff87898c8

2016-08-17T09:19:49.656867-06:00 rack13-ctrl2 kernel: [ 302.350037] Call Trace:

2016-08-17T09:19:49.656868-06:00 rack13-ctrl2 kernel: [ 302.350069] 
[<ffffffff811a5d1b>] ? kfree+0x13b/0x150

2016-08-17T09:19:49.656870-06:00 rack13-ctrl2 kernel: [ 302.350114] 
[<ffffffffa070143a>] tipc_subscrb_rcv_cb+0xfa/0x370 [tipc]

2016-08-17T09:19:49.656872-06:00 rack13-ctrl2 kernel: [ 302.350165] 
[<ffffffffa070d43f>] tipc_receive_from_sock+0xaf/0x100 [tipc]

2016-08-17T09:19:49.656874-06:00 rack13-ctrl2 kernel: [ 302.350219] 
[<ffffffffa070d61b>] tipc_recv_work+0x2b/0x60 [tipc]

2016-08-17T09:19:49.656874-06:00 rack13-ctrl2 kernel: [ 302.350266] 
[<ffffffff8107bad8>] process_one_work+0x158/0x420

2016-08-17T09:19:49.656875-06:00 rack13-ctrl2 kernel: [ 302.350310] 
[<ffffffff8107c529>] worker_thread+0x69/0x480

2016-08-17T09:19:49.656876-06:00 rack13-ctrl2 kernel: [ 302.350351] 
[<ffffffff8107c4c0>] ? rescuer_thread+0x310/0x310

2016-08-17T09:19:49.656877-06:00 rack13-ctrl2 kernel: [ 302.350395] 
[<ffffffff810818cb>] kthread+0xdb/0x100

2016-08-17T09:19:49.656879-06:00 rack13-ctrl2 kernel: [ 302.350434] 
[<ffffffff810817f0>] ? kthread_park+0x60/0x60

2016-08-17T09:19:49.656880-06:00 rack13-ctrl2 kernel: [ 302.350487] 
[<ffffffff815575cf>] ret_from_fork+0x3f/0x70

2016-08-17T09:19:49.656881-06:00 rack13-ctrl2 kernel: [ 302.350528] 
[<ffffffff810817f0>] ? kthread_park+0x60/0x60

2016-08-17T09:19:49.656882-06:00 rack13-ctrl2 kernel: [ 302.350567] Code: 41 5c 
41 5d 41 5e 41 5f 5d c3 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 
57 41 56 41 55 41 54 49 89 fc 53 48 83 ec 18 <48> 8b 47 18 8b 5f 08 48 8b 90 e0 
12 00 00 8b 05 27 ff 00 00 83

2016-08-17T09:19:49.656883-06:00 rack13-ctrl2 kernel: [ 302.350870] RIP 
[<ffffffffa0702749>] tipc_nametbl_subscribe+0x19/0x180 [tipc]

2016-08-17T09:19:49.656884-06:00 rack13-ctrl2 kernel: [ 302.352594] RSP 
<ffff881ff93b3cc0>

2016-08-17T09:19:49.656886-06:00 rack13-ctrl2 kernel: [ 302.354220] CR2: 
0000000000000018

2016-08-17T09:19:49.656888-06:00 rack13-ctrl2 kernel: [ 302.355816] --[ end 
trace 3bc92e0fb0a9c178 ]--

2016-08-17T09:19:49.656894-06:00 rack13-ctrl2 kernel: [ 302.362309] BUG: unable 
to handle kernel paging request at ffffffffffffffd8

2016-08-17T09:19:49.670952-06:00 rack13-ctrl2 osafntfd[1776]: Started

2016-08-17T09:19:57.670994-06:00 rack13-ctrl2 osafntfd[1776]: MDTM:TIPC Failed 
to connect to topology server in mdtm_check_for_endianness err :Connection 
timed out

2016-08-17T09:19:57.671340-06:00 rack13-ctrl2 osafntfd[1776]: ER 
ncs_core_agents_startup FAILED

2016-08-17T09:19:57.671695-06:00 rack13-ctrl2 osafntfd[1776]: 
ncs_sel_obj_rmv_ind: recv failed - Socket operation on non-socket, raise_obj: 0 
rmv_obj: 0

2016-08-17T09:19:57.671935-06:00 rack13-ctrl2 osafntfd[1776]: osaf_abort(-1) 
called from 0x7f3fca8d8938 with errno=88

2016-08-17T09:19:57.693637-06:00 rack13-ctrl2 osafclmd[1783]: Started

2016-08-17T09:20:05.695009-06:00 rack13-ctrl2 osafclmd[1783]: MDTM:TIPC Failed 
to connect to topology server in mdtm_check_for_endianness err :Connection 
timed out

2016-08-17T09:20:05.695408-06:00 rack13-ctrl2 osafclmd[1783]: ER clms_init 
failed

2016-08-17T09:20:05.695678-06:00 rack13-ctrl2 osafclmd[1783]: ER Failed, 
exiting...

2016-08-17T09:20:05.695932-06:00 rack13-ctrl2 opensafd[1699]: ER Failed #012 
DESC:CLMD

2016-08-17T09:20:05.696303-06:00 rack13-ctrl2 opensafd[1699]: ER Going for 
recovery

====================================================================================




------------------------------------------------------------------------------
_______________________________________________
tipc-discussion mailing list
tipc-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tipc-discussion

Re: [tipc-discussion] [Kernel oops in 4.4.18]

Reply via email to