Looking at the changes and the backtraces, I would guess something is getting into a timer loop. But I'm not sure how that would happen in this case, the timer start is jiffies + 10ms. And none of the backtraces have anything IPMI in them, and one is in the netdev code.
Does this system use interrupts on IPMI? You can look at the system log messages when loading the IPMI module or look in /proc/interrupts after it is loaded. I can't see why it would matter, but it might be useful information. The easiest way to debug this would be to add some tracing to the driver to see what is happening, then do a kdump and pull the data from the kernel core. The other way would be to add the patch a bit at a time and see where it breaks. You could start by commenting out all but the last line of start_new_msg(), that should put things functionally back exactly like they were before, and would tell if it's due to starting the timer/thread or if the problem is in the restructure. This is going to be hard to do remote. Any way I can get access and load kernels onto a system and test? -corey On 03/10/2016 03:55 PM, Jaroslav Pulchart wrote: > Hello, > > thanks for response. Getting backstrace is little bit difficult, I > have to use iDrac console at terminal server and VNC cross continents, > nevertheless it is "possible". I stored video, merged several frames > into one picture, final merge is attached as png file. > > There are several lockups, the beginning is different for each reboot > and depends on "Not tainted" / locked process. Sometime, I can copy > paste the trace at this state (thanks to still working ssh if I'm lucky): > > > general protection fault: 0000 [#1] SMP > Modules linked in: uas usb_storage ip6table_filter ip6_tables > ebtable_nat ebtables mpt3sas mpt2sas scsi_transport_sas raid_class > mptctl mptbase dell_rbu xt_comment xt_CHECKSUM xt_conntrack xt_nat > iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat > nf_conntrack iptable_mangle iptable_filter ip_tables nfsv3 nfs_acl nfs > fscache lockd sunrpc grace 8021q garp bonding be2iscsi > iscsi_boot_sysfs bnx2i cnic uio cxgb4i iw_cxgb4 cxgb4 cxgb3i libcxgbi > iw_cxgb3 cxgb3 ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core > ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi vfat fat > dm_multipath vhost_net macvtap macvlan vhost tun br_netfilter bridge > ipv6 stp llc ipmi_devintf joydev sg 8250_fintek ipmi_ssif > ipmi_msghandler acpi_pad iTCO_wdt iTCO_vendor_support acpi_power_meter > dcdbas ixgbe ptp pps_core vxlan udp_tunnel ip6_udp_tunnel mdio > coretemp hwmon x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm > crct10dif_pclmul crc32_pclmul crc32c_intel microcode pcspkr sb_edac > edac_core lpc_ich mei_me mei ioatdma dca shpchp ext4 jbd2 mbcache > sd_mod megaraid_sas aesni_intel ablk_helper cryptd lrw gf128mul > glue_helper aes_x86_64 wmi ttm drm_kms_helper drm i2c_algo_bit > sysimgblt sysfillrect syscopyarea dm_mirror dm_region_hash dm_log > dm_mod [last unloaded: ipmi_si] > CPU: 31 PID: 11700 Comm: check_iostat.sh Not tainted > 4.1.19-1.1.el6.gdc.x86_64 #1 > Hardware name: Dell Inc. PowerEdge R720xd/0020HJ, BIOS 2.5.2 01/28/2015 > task: ffff885f50a40e80 ti: ffff88018d82c000 task.ti: ffff88018d82c000 > RIP: 0010:[<ffffffff8111b1f7>] [<ffffffff8111b1f7>] > __audit_syscall_exit+0x117/0x2d0 > RSP: 0018:ffff88018d82fed0 EFLAGS: 00010213 > RAX: 0e41280ec1020683 RBX: ffff8801a1a3e800 RCX: ffff88018d82c000 > RDX: 0000000000000080 RSI: 0000000000000000 RDI: dead000000200200 > RBP: ffff88018d82ff10 R08: dead000000100100 R09: 0000000000000000 > R10: ffffffff81070f82 R11: 0000000000000000 R12: ffff885f50a40e80 > R13: 00000000004c4b40 R14: 300e410586280e41 R15: 300e410586280e41 > FS: 00007f3243365700(0000) GS:ffff885f6f3c0000(0000) > knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00000000004c4b40 CR3: 0000000199ae3000 CR4: 00000000001406e0 > Stack: > 000000000001ad60 ffff8801a1a3ea48 00007f32433659d0 ffff88018d82ff58 > 0000000000000000 0000000000000000 0000000000e2cf60 0000000000e37180 > ffff88018d82ff40 ffffffff8102331d ffff88018d82ff30 00007ffcaf7217f0 > Call Trace: > [<ffffffff8102331d>] syscall_trace_leave+0x9d/0x110 > [<ffffffff8163c921>] int_very_careful+0x38/0x41 > Code: eb 11 66 90 4c 3b 7d c8 4d 8b 36 74 53 4d 89 fd 4d 89 f7 49 8b > 45 08 48 bf 00 02 20 00 00 00 ad de 49 b8 00 01 10 00 00 00 ad de <49> > 89 46 08 4c 89 30 49 89 7d 08 49 8b 7d 10 4d 89 45 00 48 85 > RIP [<ffffffff8111b1f7>] __audit_syscall_exit+0x117/0x2d0 > RSP <ffff88018d82fed0> > ---[ end trace 6b5da3183e739ab3 ]--- > > > However "the end" i always same (see attached PNG) system is > completely unresponsive. > > -Jaroslav > > 2016-03-09 23:39 GMT+01:00 Corey Minyard <[email protected] > <mailto:[email protected]>>: > > On 03/09/2016 09:51 PM, Jaroslav Pulchart wrote: > > Hello Corey and Gouji > > I'm sorry for contacting you directly, however I have "bad" > experience in using Kernel's bugzilla to report some issues. I > would like to start some discussion about problem introduced > by 0cfec916e86d881e209de4b4ae9959a6271e6660 commit of Linux > Kernel (4.1.x, 4.4.x): > > > Contacting directly is fine, that what's normally done, though > it's best to copy the mail list, too. > > Nobody else has reported this and it has been quite a while. So > that's a little strange, but not unheard of. > > Can you enable nmi watchdog and get a backtrace for this? I have > no idea how that change could have caused a lockup. It's just > doing something for some messages (ones generated internally) that > was done on all other messages, so it's really nothing new. > > -corey > > > -------------------------------------------------------------------------------------- > commit 8dfca273353b9131dfd82c2720ccd78f89fd44ae > Author: Corey Minyard <[email protected] > <mailto:[email protected]> <mailto:[email protected] > <mailto:[email protected]>>> > Date: Sat Sep 5 17:44:13 2015 -0500 > > ipmi: Start the timer and thread on internal msgs > > commit 0cfec916e86d881e209de4b4ae9959a6271e6660 upstream. > > The timer and thread were not being started for internal > messages, > so in interrupt mode if something hung the timer would > never go > off and clean things up. Factor out the internal message > sending > and start the timer for those messages, too. > > Signed-off-by: Corey Minyard <[email protected] > <mailto:[email protected]> <mailto:[email protected] > <mailto:[email protected]>>> > Tested-by: Gouji, Masayuki <[email protected] > <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>>> > Signed-off-by: Greg Kroah-Hartman > <[email protected] > <mailto:[email protected]> > <mailto:[email protected] > <mailto:[email protected]>>> > > -------------------------------------------------------------------------------------- > > I found that linux kernel >= 4.1.17 (with this commit) running > on DELL R720xd servers will always panic with report about > "hard LOCKUP" after Dell's services are started (using IPMI). > > Reverting this commit from 4.1.17 (or .18, .19) fix the issue. > > Please propose next steps. I can help you with the testing on > these servers. > > Best regards, > Jaroslav Pulchart > > > ------------------------------------------------------------------------------ Transform Data into Opportunity. Accelerate data analysis in your applications with Intel Data Analytics Acceleration Library. Click to learn more. http://pubads.g.doubleclick.net/gampad/clk?id=278785111&iu=/4140 _______________________________________________ Openipmi-developer mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/openipmi-developer
