Mike Christie (mchristi) writes:
> 
> I've never seen this kernel crash before. It might be helpful to send
> more of the log before the kernel warning below.

        These are the messages leading up to the warning (pretty much the same, 
with
        the occasional notice about an ongoing deep scrub
        (2 active+clean+scrubbing+deep+repair), and at 05:09:34, the first 
ABORT_TASK
        shows up. I don't know how busy the cluster was, but we normally have no
        backups at that time (but veeam might have been taking snapshots).


May 12 05:09:32 ceph4 ceph-mon[485955]: 2020-05-12 05:09:32.727 7f80cb9c4700  0 
mon.ceph4@3(peon) e13 handle_command mon_command({"prefix": "status", "format": 
"json"} v 0) v1
May 12 05:09:33 ceph8 ceph-mon[256815]: 2020-05-12 05:09:33.431 7f2082335700  0 
log_channel(audit) log [DBG] : from='client.61723502 10.254.212.60:0/72439561' 
entity='client.admin' cmd=[{"prefix": "osd lspools", "format": "json"}]: 
dispatch
May 12 05:09:33 ceph8 ceph-mon[256815]: 2020-05-12 05:09:33.431 7f2082335700  0 
mon.ceph8@2(peon) e13 handle_command mon_command({"prefix": "osd lspools", 
"format": "json"} v 0) v1
May 12 05:09:34 ceph10 ceph-mgr[242781]: 2020-05-12 05:09:34.547 7f2813721700  
0 log_channel(cluster) log [DBG] : pgmap v848512: 896 pgs: 2 
active+clean+scrubbing+deep+repair, 894 active+clean; 10 TiB data, 32 TiB used, 
87 TiB / 120 TiB avail; 3.3 MiB/s rd, 4.8 MiB/s wr, 313 op/s
May 12 05:09:34 ceph1 kernel: ABORT_TASK: Found referenced iSCSI task_tag: 
94555972
May 12 05:09:34 ceph1 kernel: Call Trace:
May 12 05:09:34 ceph1 kernel:  __cancel_work_timer+0x10a/0x190
May 12 05:09:34 ceph1 kernel: Code: 69 2c 04 00 0f 0b e9 4a d3 ff ff 48 c7 c7 
d8 f9 83 be e8 56 2c 04 00 0f 0b e9 41 d6 ff ff 48 c7 c7 d8 f9 83 be e8 43 2c 
04 00 <0f> 0b 45 31 ed e9 2b d6 ff ff 49 8d b4 24 b0 00 00 00 48 c7 c7 b8
May 12 05:09:34 ceph1 kernel:  core_tmr_abort_task+0xd6/0x130 [target_core_mod]
May 12 05:09:34 ceph1 kernel: CPU: 11 PID: 2448784 Comm: kworker/u32:0 Tainted: 
G        W         4.19.0-8-amd64 #1 Debian 4.19.98-1
May 12 05:09:34 ceph1 kernel: CR2: 0000555de2e07000 CR3: 000000048240a005 CR4: 
00000000003606e0
May 12 05:09:34 ceph1 kernel:  ? create_worker+0x1a0/0x1a0
May 12 05:09:34 ceph1 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 12 05:09:34 ceph1 kernel: ------------[ cut here ]------------
May 12 05:09:34 ceph1 kernel:  dca usbcore i2c_i801 scsi_mod mfd_core mdio 
usb_common nvme_core crc32c_intel i2c_algo_bit wmi button
May 12 05:09:34 ceph1 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
May 12 05:09:34 ceph1 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
0000000000000400
May 12 05:09:34 ceph1 kernel: ---[ end trace 3835b5fe0aa98ff0 ]---
May 12 05:09:34 ceph1 kernel: FS:  0000000000000000(0000) 
GS:ffff94f77fac0000(0000) knlGS:0000000000000000
May 12 05:09:34 ceph1 kernel:  ? get_work_pool+0x40/0x40
May 12 05:09:34 ceph1 kernel: Hardware name: Supermicro SYS-6018R-TD8/X10DDW-i, 
BIOS 3.2 12/16/2019
May 12 05:09:34 ceph1 kernel:  ? irq_work_queue+0x46/0x50
May 12 05:09:34 ceph1 kernel:  ? __irq_work_queue_local+0x50/0x60
May 12 05:09:34 ceph1 kernel:  kthread+0x112/0x130
May 12 05:09:34 ceph1 kernel:  ? kthread_bind+0x30/0x30
May 12 05:09:34 ceph1 kernel: Modules linked in: fuse cbc ceph libceph 
libcrc32c crc32c_generic fscache target_core_pscsi target_core_file 
target_core_iblock iscsi_target_mod target_core_user uio target_core_mod 
configfs binfmt_misc 8021q garp stp mrp llc bonding intel_rapl sb_edac 
x86_pkg_temp_thermal intel_powerclamp coretemp kvm irqbypass crct10dif_pclmul 
crc32_pclmul ghash_clmulni_intel ast cryptd ipmi_ssif ttm intel_cstate 
drm_kms_helper intel_uncore mei_me iTCO_wdt drm pcc_cpufreq intel_rapl_perf 
joydev pcspkr sg iTCO_vendor_support mei ioatdma ipmi_si ipmi_devintf 
ipmi_msghandler acpi_pad acpi_power_meter evdev dm_mod sunrpc ip_tables 
x_tables autofs4 squashfs zstd_decompress xxhash loop overlay hid_generic 
usbhid hid sd_mod ahci xhci_pci ehci_pci libahci ehci_hcd xhci_hcd libata ixgbe 
igb nvme mxm_wmi lpc_ich
May 12 05:09:34 ceph1 kernel:  ? printk+0x58/0x6f
May 12 05:09:34 ceph1 kernel:  process_one_work+0x1a7/0x3a0

        [...]

> From the bit of log you sent, it looks like commands might have started
> to timeout on the vmware side. That then kicked off the target layer's
> error handler which we then barf in.
> 
> You used the ceph-iscsi tools to set this up right? You didn't use
> targetcli with tcmu-runner directly?

        Actually this bit is managed by croit.io, so I'm not 100% sure
        (we're using their software as we didn't want to deal with iscsi setup,
        among other things). They said they hadn't seen this particular error
        before, which is why I'm asking here (and totally ok if that means we're
        on our own).

        I've got the option of trying a 5.x kernel, if it helps.

> Just to double check you didn't see old/slow requests like described here:
> 
> https://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/
> 
> right?

        Absolutely none in the logs (going back 2 weeks).

> When see the "ABORT_TASK" messages in the log it means vmware timed out
> the command. We see that when disks start to die, a heavier than normal
> workload starts and commands take longer than normal, connection issues,
> sometimes when operations like scrubs are running.

        Apart from the 2 pgs scrubbing, no significant activity that I could
        see. There are only 4 VMs using iscsi at this point, with veeam doing
        regular snapshots in the background.

        Wrt. LUN provisioning, would it make improve performance or spread
        the workload more efficiently to allocate multiple RBDs vs one large
        RBD for all VMs ? (If that could be the issue)

        Cheers,
        Phil
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to