On 03/14/2016 04:35 PM, Jaroslav Pulchart wrote:
> Hello Corey,
>
> please see inline.
>
> 2016-03-10 16:40 GMT+01:00 Corey Minyard <[email protected] 
> <mailto:[email protected]>>:
>
>     Looking at the changes and the backtraces, I would guess something
>     is getting into a timer loop. But I'm not sure how that would
>     happen in this case, the timer start is jiffies + 10ms.  And none
>     of the backtraces have anything IPMI in them, and one is in the
>     netdev code.
>
>     Does this system use interrupts on IPMI?  You can look at the
>     system log messages when loading the IPMI module or look in
>     /proc/interrupts after it is loaded.  I can't see why it would
>     matter, but it might be useful information.
>
>
> yes, it uses interrupts according this documentation 
> <http://en.community.dell.com/techcenter/b/techcenter/archive/2012/03/08/ipmi-kcs-interrupt-support-on-12g-servers-on-linux-oses>
>  
> and output from /proc/ipmi/0/si_stats:
>
> # cat /proc/ipmi/0/si_stats
> interrupts_enabled:    1
> short_timeouts:        0
> long_timeouts:         171361
> idles:                 1641377
> interrupts:            25665558
> attentions:            0
> flag_fetches:          0
> hosed_count:           0
> complete_transactions: 1572192
> events:                0
> watchdog_pretimeouts:  0
> incoming_messages:     0
>
> dmesg output:
>
> IPMI System Interface driver.
> ipmi_si: probing via SMBIOS
> ipmi_si: SMBIOS: io 0xca8 regsize 1 spacing 4 irq 10
> ipmi_si: Adding SMBIOS-specified kcs state machine
> ipmi_si: Trying SMBIOS-specified kcs state machine at i/o address 
> 0xca8, slave address 0x20, irq 10
> ipmi_si ipmi_si.0: Using irq 10
> ipmi_si ipmi_si.0: Could not set the global enables: 0xcc.

This message is not good.  Is this the only time it comes out, or does 
it come out continuously
when the system doesn't crash?

> ipmi_si ipmi_si.0: Found new BMC (man_id: 0x0002a2, prod_id: 0x0100, 
> dev_id: 0x20)
> ipmi_si ipmi_si.0: IPMI kcs interface initialized
>
>
>     The easiest way to debug this would be to add some tracing to the
>     driver to see what is happening, then do a kdump and pull the data
>     from the kernel core.  The other way would be to add the patch a
>     bit at a time and see where it breaks.  You could start by
>     commenting out all but the last line of start_new_msg(), that
>     should put things functionally back exactly like they were before,
>     and would tell if it's due to starting the timer/thread or if the
>     problem is in the restructure.
>
>
> I started with commenting all except last line in start_new_msg(),
>
> @@ -417,10 +417,13 @@ static void smi_mod_timer(struct smi_info 
> *smi_info, unsigned long new_val)
>  static void start_new_msg(struct smi_info *smi_info, unsigned char *msg,
>                           unsigned int size)
>  {
> -       smi_mod_timer(smi_info, jiffies + SI_TIMEOUT_JIFFIES);
> +
> +/* Lets comment this new stuff */
> +/*     smi_mod_timer(smi_info, jiffies + SI_TIMEOUT_JIFFIES);
>
>         if (smi_info->thread)
> wake_up_process(smi_info->thread);
> +*/
>
> smi_info->handlers->start_transaction(smi_info->si_sm, msg, size);
>  }
>
> with this change the system is stable. Lets make this as confirmation 
> that  that the issue is not in the restructure.
>

Dang.  I have no idea why this could happen.  Could you print out 
SI_TIMEOUT_JIFFIES in the init code to make sure it is not zero?

-corey


>
>     This is going to be hard to do remote.  Any way I can get access
>     and load kernels onto a system and test?
>
>
> I agree, unfortunately I cannot give you an access to these servers. 
> We have to manage this issue remotely by emails.
>
> -Jaroslav
>
>     -corey
>
>
>     On 03/10/2016 03:55 PM, Jaroslav Pulchart wrote:
>
>         Hello,
>
>         thanks for response. Getting backstrace is little bit
>         difficult, I have to use iDrac console at terminal server and
>         VNC cross continents, nevertheless it is "possible". I stored
>         video, merged several frames into one picture, final merge is
>         attached as png file.
>
>         There are several lockups, the beginning is different for each
>         reboot and depends on "Not tainted" / locked process.
>         Sometime, I can copy paste the trace at this state (thanks to
>         still working ssh if I'm lucky):
>
>
>         general protection fault: 0000 [#1] SMP
>         Modules linked in: uas usb_storage ip6table_filter ip6_tables
>         ebtable_nat ebtables mpt3sas mpt2sas scsi_transport_sas
>         raid_class mptctl mptbase dell_rbu xt_comment xt_CHECKSUM
>         xt_conntrack xt_nat iptable_nat nf_conntrack_ipv4
>         nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle
>         iptable_filter ip_tables nfsv3 nfs_acl nfs fscache lockd
>         sunrpc grace 8021q garp bonding be2iscsi iscsi_boot_sysfs
>         bnx2i cnic uio cxgb4i iw_cxgb4 cxgb4 cxgb3i libcxgbi iw_cxgb3
>         cxgb3 ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr
>         iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi vfat fat
>         dm_multipath vhost_net macvtap macvlan vhost tun br_netfilter
>         bridge ipv6 stp llc ipmi_devintf joydev sg 8250_fintek
>         ipmi_ssif ipmi_msghandler acpi_pad iTCO_wdt
>         iTCO_vendor_support acpi_power_meter dcdbas ixgbe ptp pps_core
>         vxlan udp_tunnel ip6_udp_tunnel mdio coretemp hwmon
>         x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm
>         crct10dif_pclmul crc32_pclmul crc32c_intel microcode pcspkr
>         sb_edac edac_core lpc_ich mei_me mei ioatdma dca shpchp ext4
>         jbd2 mbcache sd_mod megaraid_sas aesni_intel ablk_helper
>         cryptd lrw gf128mul glue_helper aes_x86_64 wmi ttm
>         drm_kms_helper drm i2c_algo_bit sysimgblt sysfillrect
>         syscopyarea dm_mirror dm_region_hash dm_log dm_mod [last
>         unloaded: ipmi_si]
>         CPU: 31 PID: 11700 Comm: check_iostat.sh Not tainted
>         4.1.19-1.1.el6.gdc.x86_64 #1
>         Hardware name: Dell Inc. PowerEdge R720xd/0020HJ, BIOS 2.5.2
>         01/28/2015
>         task: ffff885f50a40e80 ti: ffff88018d82c000 task.ti:
>         ffff88018d82c000
>         RIP: 0010:[<ffffffff8111b1f7>] [<ffffffff8111b1f7>]
>         __audit_syscall_exit+0x117/0x2d0
>         RSP: 0018:ffff88018d82fed0  EFLAGS: 00010213
>         RAX: 0e41280ec1020683 RBX: ffff8801a1a3e800 RCX: ffff88018d82c000
>         RDX: 0000000000000080 RSI: 0000000000000000 RDI: dead000000200200
>         RBP: ffff88018d82ff10 R08: dead000000100100 R09: 0000000000000000
>         R10: ffffffff81070f82 R11: 0000000000000000 R12: ffff885f50a40e80
>         R13: 00000000004c4b40 R14: 300e410586280e41 R15: 300e410586280e41
>         FS:  00007f3243365700(0000) GS:ffff885f6f3c0000(0000)
>         knlGS:0000000000000000
>         CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>         CR2: 00000000004c4b40 CR3: 0000000199ae3000 CR4: 00000000001406e0
>         Stack:
>          000000000001ad60 ffff8801a1a3ea48 00007f32433659d0
>         ffff88018d82ff58
>          0000000000000000 0000000000000000 0000000000e2cf60
>         0000000000e37180
>          ffff88018d82ff40 ffffffff8102331d ffff88018d82ff30
>         00007ffcaf7217f0
>         Call Trace:
>          [<ffffffff8102331d>] syscall_trace_leave+0x9d/0x110
>          [<ffffffff8163c921>] int_very_careful+0x38/0x41
>         Code: eb 11 66 90 4c 3b 7d c8 4d 8b 36 74 53 4d 89 fd 4d 89 f7
>         49 8b 45 08 48 bf 00 02 20 00 00 00 ad de 49 b8 00 01 10 00 00
>         00 ad de <49> 89 46 08 4c 89 30 49 89 7d 08 49 8b 7d 10 4d 89
>         45 00 48 85
>         RIP  [<ffffffff8111b1f7>] __audit_syscall_exit+0x117/0x2d0
>          RSP <ffff88018d82fed0>
>         ---[ end trace 6b5da3183e739ab3 ]---
>
>
>         However "the end" i always same (see attached PNG) system is
>         completely unresponsive.
>
>         -Jaroslav
>
>         2016-03-09 23:39 GMT+01:00 Corey Minyard <[email protected]
>         <mailto:[email protected]> <mailto:[email protected]
>         <mailto:[email protected]>>>:
>
>             On 03/09/2016 09:51 PM, Jaroslav Pulchart wrote:
>
>                 Hello Corey and Gouji
>
>                 I'm sorry for contacting you directly, however I have
>         "bad"
>                 experience in using Kernel's bugzilla to report some
>         issues. I
>                 would like to start some discussion about problem
>         introduced
>                 by 0cfec916e86d881e209de4b4ae9959a6271e6660 commit of
>         Linux
>                 Kernel (4.1.x, 4.4.x):
>
>
>             Contacting directly is fine, that what's normally done, though
>             it's best to copy the mail list, too.
>
>             Nobody else has reported this and it has been quite a
>         while.  So
>             that's a little strange, but not unheard of.
>
>             Can you enable nmi watchdog and get a backtrace for this? 
>         I have
>             no idea how that change could have caused a lockup. It's just
>             doing something for some messages (ones generated
>         internally) that
>             was done on all other messages, so it's really nothing new.
>
>             -corey
>
>         
> --------------------------------------------------------------------------------------
>                 commit 8dfca273353b9131dfd82c2720ccd78f89fd44ae
>                 Author: Corey Minyard <[email protected]
>         <mailto:[email protected]>
>                 <mailto:[email protected]
>         <mailto:[email protected]>> <mailto:[email protected]
>         <mailto:[email protected]>
>                 <mailto:[email protected]
>         <mailto:[email protected]>>>>
>                 Date:   Sat Sep 5 17:44:13 2015 -0500
>
>                     ipmi: Start the timer and thread on internal msgs
>
>                     commit 0cfec916e86d881e209de4b4ae9959a6271e6660
>         upstream.
>
>                     The timer and thread were not being started for
>         internal
>                 messages,
>                     so in interrupt mode if something hung the timer would
>                 never go
>                     off and clean things up.  Factor out the internal
>         message
>                 sending
>                     and start the timer for those messages, too.
>
>                     Signed-off-by: Corey Minyard <[email protected]
>         <mailto:[email protected]>
>                 <mailto:[email protected]
>         <mailto:[email protected]>> <mailto:[email protected]
>         <mailto:[email protected]>
>                 <mailto:[email protected]
>         <mailto:[email protected]>>>>
>                     Tested-by: Gouji, Masayuki
>         <[email protected]
>         <mailto:[email protected]>
>                 <mailto:[email protected]
>         <mailto:[email protected]>>
>                 <mailto:[email protected]
>         <mailto:[email protected]>
>                 <mailto:[email protected]
>         <mailto:[email protected]>>>>
>                     Signed-off-by: Greg Kroah-Hartman
>                 <[email protected]
>         <mailto:[email protected]>
>                 <mailto:[email protected]
>         <mailto:[email protected]>>
>                 <mailto:[email protected]
>         <mailto:[email protected]>
>                 <mailto:[email protected]
>         <mailto:[email protected]>>>>
>         
> --------------------------------------------------------------------------------------
>
>                 I found that linux kernel >= 4.1.17 (with this commit)
>         running
>                 on DELL R720xd servers will always panic with report about
>                 "hard LOCKUP" after Dell's services are started (using
>         IPMI).
>
>                 Reverting this commit from 4.1.17 (or .18, .19) fix
>         the issue.
>
>                 Please propose next steps. I can help you with the
>         testing on
>                 these servers.
>
>                 Best regards,
>                 Jaroslav Pulchart
>
>
>
>
>


------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://pubads.g.doubleclick.net/gampad/clk?id=278785231&iu=/4140
_______________________________________________
Openipmi-developer mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/openipmi-developer

Reply via email to