[Bug 199435] HPSA + P420i resetting logical Direct-Access never complete
https://bugzilla.kernel.org/show_bug.cgi?id=199435 --- Comment #27 from Gaetan Trellu (gaetan.tre...@incloudus.com) --- By compiling the hpsa kernel module from SourceForge on Ubuntu 16.04 with kernel 4.4 solved the issue for us. Steps: # apt-get install dkms build-essential # tar xjvf hpsa-3.4.20-141.tar.bz2 # cd hpsa-3.4.20/drivers/ # sudo cp -a scsi /usr/src/hpsa-3.4.20.141 # dkms add -m hpsa -v 3.4.20.141 # dkms build -m hpsa -v 3.4.20.141 # dkms install -m hpsa -v 3.4.20.141 Link: https://sourceforge.net/projects/cciss/ -- You are receiving this mail because: You are the assignee for the bug.
[Bug 199435] HPSA + P420i resetting logical Direct-Access never complete
https://bugzilla.kernel.org/show_bug.cgi?id=199435 --- Comment #26 from Gaetan Trellu (gaetan.tre...@incloudus.com) --- Moved from Ubuntu 16.04.5 to CentOS 7.5 with hpsa kernel module (kmod-hpsa-3.4.20-141.rhel7u5.x86_64.rpm) from HPE website. Running without kernel panic since more than a week. -- You are receiving this mail because: You are the assignee for the bug.
[Bug 199435] HPSA + P420i resetting logical Direct-Access never complete
https://bugzilla.kernel.org/show_bug.cgi?id=199435 --- Comment #25 from Gaetan Trellu (gaetan.tre...@incloudus.com) --- More logs. [5.272077] HP HPSA Driver (v 3.4.14-0) [5.340589] hpsa :03:00.0: can't disable ASPM; OS doesn't have ASPM control [5.352372] hpsa :03:00.0: MSI-X capable controller [5.358775] hpsa :03:00.0: Logical aborts not supported [5.410577] scsi host6: hpsa [5.620173] hpsa :03:00.0: scsi 6:3:0:0: added RAID HP P440ar controller SSDSmartPathCap- En- Exp=1 [5.633345] hpsa :03:00.0: scsi 6:0:0:0: masked Direct-Access ATA TK0120GDJXT PHYS DRV SSDSmartPathCap- En- Exp=0 [5.651921] hpsa :03:00.0: scsi 6:0:1:0: masked Direct-Access ATA TK0120GDJXT PHYS DRV SSDSmartPathCap- En- Exp=0 [5.682879] ata6.00: ATA-9: VR0120GEJXL, 4IWTHPG0, max UDMA/100 [5.682891] ata5.00: ATA-9: VR0120GEJXL, 4IWTHPG0, max UDMA/100 [5.800257] hpsa :03:00.0: scsi 6:0:2:0: masked Direct-Access ATA MB3000GCWDB PHYS DRV SSDSmartPathCap- En- Exp=0 [5.813417] hpsa :03:00.0: scsi 6:0:3:0: masked Direct-Access ATA MB3000GCWDB PHYS DRV SSDSmartPathCap- En- Exp=0 [5.826488] hpsa :03:00.0: scsi 6:0:4:0: masked Direct-Access ATA MB3000GCWDB PHYS DRV SSDSmartPathCap- En- Exp=0 [5.839558] hpsa :03:00.0: scsi 6:0:5:0: masked Direct-Access ATA MB3000GCWDB PHYS DRV SSDSmartPathCap- En- Exp=0 [5.852628] hpsa :03:00.0: scsi 6:0:6:0: masked Direct-Access ATA MB3000GCWDB PHYS DRV SSDSmartPathCap- En- Exp=0 [5.865698] hpsa :03:00.0: scsi 6:0:7:0: masked Direct-Access ATA MB3000GCWDB PHYS DRV SSDSmartPathCap- En- Exp=0 [5.878769] hpsa :03:00.0: scsi 6:0:8:0: masked Direct-Access ATA MB3000GCWDB PHYS DRV SSDSmartPathCap- En- Exp=0 [5.891839] hpsa :03:00.0: scsi 6:0:9:0: masked Direct-Access ATA MB3000GCWDB PHYS DRV SSDSmartPathCap- En- Exp=0 [5.904910] hpsa :03:00.0: scsi 6:0:10:0: masked Direct-Access ATA MB3000GCWDB PHYS DRV SSDSmartPathCap- En- Exp=0 [5.918076] hpsa :03:00.0: scsi 6:0:11:0: masked Direct-Access ATA MB3000GCWDB PHYS DRV SSDSmartPathCap- En- Exp=0 [5.931242] hpsa :03:00.0: scsi 6:0:12:0: masked Direct-Access ATA TK0120GDJXT PHYS DRV SSDSmartPathCap- En- Exp=0 [5.92] hpsa :03:00.0: scsi 6:0:13:0: masked Direct-Access ATA TK0120GDJXT PHYS DRV SSDSmartPathCap- En- Exp=0 [5.957609] hpsa :03:00.0: scsi 6:0:14:0: masked Enclosure HPE 12G SAS Exp Card enclosure SSDSmartPathCap- En- Exp=0 [5.970871] hpsa :03:00.0: scsi 6:1:0:0: added Direct-Access HP LOGICAL VOLUME RAID-1(+0) SSDSmartPathCap+ En+ Exp=1 [5.984038] hpsa :03:00.0: scsi 6:1:0:1: added Direct-Access HP LOGICAL VOLUME RAID-0 SSDSmartPathCap+ En+ Exp=1 [5.996822] hpsa :03:00.0: scsi 6:1:0:2: added Direct-Access HP LOGICAL VOLUME RAID-0 SSDSmartPathCap+ En+ Exp=1 [6.009606] hpsa :03:00.0: scsi 6:1:0:3: added Direct-Access HP LOGICAL VOLUME RAID-0 SSDSmartPathCap- En- Exp=1 [6.022391] hpsa :03:00.0: scsi 6:1:0:4: added Direct-Access HP LOGICAL VOLUME RAID-0 SSDSmartPathCap- En- Exp=1 [6.035176] hpsa :03:00.0: scsi 6:1:0:5: added Direct-Access HP LOGICAL VOLUME RAID-0 SSDSmartPathCap- En- Exp=1 [6.047960] hpsa :03:00.0: scsi 6:1:0:6: added Direct-Access HP LOGICAL VOLUME RAID-0 SSDSmartPathCap- En- Exp=1 [6.060759] hpsa :03:00.0: scsi 6:1:0:7: added Direct-Access HP LOGICAL VOLUME RAID-0 SSDSmartPathCap- En- Exp=1 [6.073545] hpsa :03:00.0: scsi 6:1:0:8: added Direct-Access HP LOGICAL VOLUME RAID-0 SSDSmartPathCap- En- Exp=1 [6.086329] hpsa :03:00.0: scsi 6:1:0:9: added Direct-Access HP LOGICAL VOLUME RAID-0 SSDSmartPathCap- En- Exp=1 [6.099113] hpsa :03:00.0: scsi 6:1:0:10: added Direct-Access HP LOGICAL VOLUME RAID-0 SSDSmartPathCap- En- Exp=1 [6.111991] hpsa :03:00.0: scsi 6:1:0:11: added Direct-Access HP LOGICAL VOLUME RAID-0 SSDSmartPathCap- En- Exp=1 [6.124869] hpsa :03:00.0: scsi 6:1:0:12: added Direct-Access HP LOGICAL VOLUME RAID-0 SSDSmartPathCap- En- Exp=1 [6.138251] scsi 6:0:0:0: RAID HP P440ar 6.60 PQ: 0 ANSI: 5 [6.147610] scsi 6:1:0:0: Direct-Access HP LOGICAL VOLUME 6.60 PQ: 0 ANSI: 5 [6.156967] scsi 6:1:0:1: Direct-Access HP LOGICAL VOLUME 6.60 PQ: 0 ANSI: 5 [6.171837] scsi 6:1:0:2: Direct-Access HP LOGICAL VOLUME 6.60 PQ: 0 ANSI: 5 [6.181197] scsi 6:1:0:3: Direct-Access HP LOGICAL VOLUME 6.60 PQ: 0 ANSI: 5 [6.190653] scsi 6:1:0:4: Direct-Access HP LOGICAL VOLUME 6.60 PQ: 0 ANSI: 5 [
[Bug 199435] HPSA + P420i resetting logical Direct-Access never complete
https://bugzilla.kernel.org/show_bug.cgi?id=199435 goldyfruit (gaetan.tre...@incloudus.com) changed: What|Removed |Added CC||gaetan.tre...@incloudus.com --- Comment #24 from goldyfruit (gaetan.tre...@incloudus.com) --- Same behavior here with controllers P440ar and P420i on DL480 G8 and DL480p G8. Firmware: - P440ar: 6.60 - P420i: 8.32 [128958.979859] hpsa :03:00.0: scsi 0:1:0:9: resetting logical Direct-Access HP LOGICAL VOLUME RAID-0 SSDSmartPathCap- En- Exp=1 [129170.663840] INFO: task scsi_eh_0:446 blocked for more than 120 seconds. [129170.671251] Not tainted 4.15.0-33-generic #36~16.04.1-Ubuntu [129170.678176] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [129170.686930] scsi_eh_0 D0 446 2 0x8000 [129170.686934] Call Trace: [129170.686945] __schedule+0x3d6/0x8b0 [129170.686947] schedule+0x36/0x80 [129170.686950] schedule_timeout+0x1db/0x370 [129170.686954] ? __dev_printk+0x3c/0x80 [129170.686956] ? dev_printk+0x56/0x80 [129170.686959] io_schedule_timeout+0x1e/0x50 [129170.686961] wait_for_completion_io+0xb4/0x140 [129170.686965] ? wake_up_q+0x70/0x70 [129170.686972] hpsa_scsi_do_simple_cmd.isra.56+0xc7/0xf0 [hpsa] [129170.686975] hpsa_eh_device_reset_handler+0x3bb/0x790 [hpsa] [129170.686978] ? sched_clock_cpu+0x11/0xb0 [129170.686983] ? scsi_device_put+0x2b/0x30 [129170.686987] scsi_eh_ready_devs+0x368/0xc10 [129170.686993] ? __pm_runtime_resume+0x5b/0x80 [129170.686995] scsi_error_handler+0x4c3/0x5c0 [129170.687000] kthread+0x105/0x140 [129170.687003] ? scsi_eh_get_sense+0x240/0x240 [129170.687005] ? kthread_destroy_worker+0x50/0x50 [129170.687012] ? do_syscall_64+0x73/0x130 [129170.687015] ? SyS_exit_group+0x14/0x20 [129170.687017] ret_from_fork+0x35/0x40 [129170.687021] INFO: task jbd2/sda1-8:636 blocked for more than 120 seconds. [129170.694649] Not tainted 4.15.0-33-generic #36~16.04.1-Ubuntu [129170.701598] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [129170.710343] jbd2/sda1-8 D0 636 2 0x8000 [129170.710346] Call Trace: [129170.710349] __schedule+0x3d6/0x8b0 [129170.710351] ? bit_wait+0x60/0x60 [129170.710352] schedule+0x36/0x80 [129170.710354] io_schedule+0x16/0x40 [129170.710359] bit_wait_io+0x11/0x60 [129170.710362] __wait_on_bit+0x63/0x90 [129170.710367] out_of_line_wait_on_bit+0x8e/0xb0 [129170.710373] ? bit_waitqueue+0x40/0x40 [129170.710377] __wait_on_buffer+0x32/0x40 [129170.710381] jbd2_journal_commit_transaction+0xdf6/0x1760 [129170.710387] kjournald2+0xc8/0x250 [129170.710392] ? kjournald2+0xc8/0x250 [129170.710395] ? wait_woken+0x80/0x80 [129170.710398] kthread+0x105/0x140 [129170.710399] ? commit_timeout+0x20/0x20 [129170.710402] ? kthread_destroy_worker+0x50/0x50 [129170.710404] ? do_syscall_64+0x73/0x130 [129170.710407] ? SyS_exit_group+0x14/0x20 [129170.710412] ret_from_fork+0x35/0x40 [129170.710423] INFO: task rs:main Q:Reg:2907 blocked for more than 120 seconds. [129170.718358] Not tainted 4.15.0-33-generic #36~16.04.1-Ubuntu [129170.725305] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [129170.734076] rs:main Q:Reg D0 2907 1 0x [129170.734079] Call Trace: [129170.734082] __schedule+0x3d6/0x8b0 [129170.734086] ? bit_waitqueue+0x40/0x40 [129170.734087] ? bit_wait+0x60/0x60 [129170.734089] schedule+0x36/0x80 [129170.734091] io_schedule+0x16/0x40 [129170.734092] bit_wait_io+0x11/0x60 [129170.734094] __wait_on_bit+0x63/0x90 [129170.734096] out_of_line_wait_on_bit+0x8e/0xb0 [129170.734098] ? bit_waitqueue+0x40/0x40 [129170.734100] do_get_write_access+0x202/0x410 [129170.734102] jbd2_journal_get_write_access+0x51/0x70 [129170.734107] __ext4_journal_get_write_access+0x3b/0x80 [129170.734111] ext4_reserve_inode_write+0x95/0xc0 [129170.734115] ? ext4_dirty_inode+0x48/0x70 [129170.734117] ext4_mark_inode_dirty+0x53/0x1d0 [129170.734119] ? __ext4_journal_start_sb+0x6d/0x120 [129170.734121] ext4_dirty_inode+0x48/0x70 [129170.734125] __mark_inode_dirty+0x184/0x3b0 [129170.734129] generic_update_time+0x7b/0xd0 [129170.734132] ? current_time+0x32/0x70 [129170.734134] file_update_time+0xbe/0x110 [129170.734140] __generic_file_write_iter+0x9d/0x1f0 [129170.734142] ext4_file_write_iter+0xc4/0x3f0 [129170.734147] ? futex_wake+0x90/0x170 [129170.734151] new_sync_write+0xe5/0x140 [129170.734155] __vfs_write+0x29/0x40 [129170.734156] vfs_write+0xb8/0x1b0 [129170.734158] SyS_write+0x55/0xc0 [129170.734160] do_syscall_64+0x73/0x130 [129170.734163] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [129170.734165] RIP: 0033:0x7feefa9394bd [129170.734166] RSP: 002b:7feef7ce8600 EFLAGS: 0293 ORIG_RAX: 0001 [129170.734168] RAX: ffda RBX: 7feeec00d120 RCX:
[Bug 199435] HPSA + P420i resetting logical Direct-Access never complete
https://bugzilla.kernel.org/show_bug.cgi?id=199435 --- Comment #23 from Anthony Hausman (anthonyhaussm...@gmail.com) --- Ho, I have forgotten to say that before the hpsa do some actions, I had several errors on the disk where the badblocks ran: ... [Mon Apr 30 22:21:18 2018] sd 0:1:0:19: [sdt] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [Mon Apr 30 22:21:18 2018] sd 0:1:0:19: [sdt] tag#0 Sense Key : Medium Error [current] [Mon Apr 30 22:21:18 2018] sd 0:1:0:19: [sdt] tag#0 Add. Sense: Unrecovered read error [Mon Apr 30 22:21:18 2018] sd 0:1:0:19: [sdt] tag#0 CDB: Read(16) 88 00 00 00 00 01 37 0c c5 b0 00 00 00 08 00 00 [Mon Apr 30 22:21:18 2018] print_req_error: critical medium error, dev sdt, sector 5218551216 [Mon Apr 30 22:21:18 2018] sd 0:1:0:19: [sdt] Unaligned partial completion (resid=242, sector_sz=512) [Mon Apr 30 22:21:18 2018] sd 0:1:0:19: [sdt] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [Mon Apr 30 22:21:18 2018] sd 0:1:0:19: [sdt] tag#0 Sense Key : Medium Error [current] [Mon Apr 30 22:21:18 2018] sd 0:1:0:19: [sdt] tag#0 Add. Sense: Unrecovered read error [Mon Apr 30 22:21:18 2018] sd 0:1:0:19: [sdt] tag#0 CDB: Read(16) 88 00 00 00 00 01 37 0c c5 b8 00 00 00 08 00 00 [Mon Apr 30 22:21:18 2018] print_req_error: critical medium error, dev sdt, sector 5218551224 [Tue May 1 06:27:37 2018] hpsa :08:00.0: aborted: LUN:00c03901 CDB:12003100 [Tue May 1 06:27:37 2018] hpsa :08:00.0: hpsa_update_device_info: inquiry failed, device will be skipped. [Tue May 1 06:27:37 2018] hpsa :08:00.0: scsi 0:0:50:0: removed Direct-Access ATA MB4000GCWDC PHYS DRV SSDSmartP athCap- En- Exp=0 [Tue May 1 06:28:24 2018] hpsa :08:00.0: aborted: LUN:00c03901 CDB:12003100 [Tue May 1 06:28:24 2018] hpsa :08:00.0: hpsa_update_device_info: inquiry failed, device will be skipped. ... -- You are receiving this mail because: You are the assignee for the bug.
[Bug 199435] HPSA + P420i resetting logical Direct-Access never complete
https://bugzilla.kernel.org/show_bug.cgi?id=199435 --- Comment #22 from Anthony Hausman (anthonyhaussm...@gmail.com) --- Created attachment 275723 --> https://bugzilla.kernel.org/attachment.cgi?id=275723=edit Load on server during reset problem -- You are receiving this mail because: You are the assignee for the bug.
[Bug 199435] HPSA + P420i resetting logical Direct-Access never complete
https://bugzilla.kernel.org/show_bug.cgi?id=199435 --- Comment #21 from Anthony Hausman (anthonyhaussm...@gmail.com) --- I have reproduced the problem. Here the condition that I have done: Kernel: 4.16.3-041603-generic hpsa: 3.4.20-125 with patch to use local work-queue instead of system work-queue. I needed to execute a badblocks in a read-only test on a disk who has failed before: ~# while :; do badblocks -v -b 4096 -s /dev/sdt; done And several days after, the bug raised. You'll find a graph of the load in an attachment. Before the reset, I have a hpsa_update_device_info: inquiry failed and a stack trace on badblocks (this one seems to be logical) Load: 850 [Tue May 1 06:27:37 2018] hpsa :08:00.0: aborted: LUN:00c03901 CDB:12003100 [Tue May 1 06:27:37 2018] hpsa :08:00.0: hpsa_update_device_info: inquiry failed, device will be skipped. [Tue May 1 06:27:37 2018] hpsa :08:00.0: scsi 0:0:50:0: removed Direct-Access ATA MB4000GCWDC PHYS DRV SSDSmartPathCap- En- Exp=0 [Tue May 1 06:28:24 2018] hpsa :08:00.0: aborted: LUN:00c03901 CDB:12003100 [Tue May 1 06:28:24 2018] hpsa :08:00.0: hpsa_update_device_info: inquiry failed, device will be skipped. [Tue May 1 06:29:51 2018] INFO: task badblocks:46824 blocked for more than 120 seconds. [Tue May 1 06:29:51 2018] Tainted: G OE 4.16.3-041603-generic #201804190730 [Tue May 1 06:29:51 2018] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Tue May 1 06:29:51 2018] badblocks D0 46824 48728 0x0004 [Tue May 1 06:29:51 2018] Call Trace: [Tue May 1 06:29:51 2018] __schedule+0x297/0x880 [Tue May 1 06:29:51 2018] ? iov_iter_get_pages+0xc0/0x2c0 [Tue May 1 06:29:51 2018] schedule+0x2c/0x80 [Tue May 1 06:29:51 2018] io_schedule+0x16/0x40 [Tue May 1 06:29:51 2018] __blkdev_direct_IO_simple+0x1ff/0x360 [Tue May 1 06:29:51 2018] ? bdget+0x120/0x120 [Tue May 1 06:29:51 2018] blkdev_direct_IO+0x3a2/0x3f0 [Tue May 1 06:29:51 2018] ? blkdev_direct_IO+0x3a2/0x3f0 [Tue May 1 06:29:51 2018] ? current_time+0x32/0x70 [Tue May 1 06:29:51 2018] ? __atime_needs_update+0x7f/0x190 [Tue May 1 06:29:51 2018] generic_file_read_iter+0xc6/0xc10 [Tue May 1 06:29:51 2018] ? __blkdev_direct_IO_simple+0x360/0x360 [Tue May 1 06:29:51 2018] ? generic_file_read_iter+0xc6/0xc10 [Tue May 1 06:29:51 2018] ? __wake_up+0x13/0x20 [Tue May 1 06:29:51 2018] ? tty_ldisc_deref+0x16/0x20 [Tue May 1 06:29:51 2018] ? tty_write+0x1fb/0x320 [Tue May 1 06:29:51 2018] blkdev_read_iter+0x35/0x40 [Tue May 1 06:29:51 2018] __vfs_read+0xfb/0x170 [Tue May 1 06:29:51 2018] vfs_read+0x8e/0x130 [Tue May 1 06:29:51 2018] SyS_read+0x55/0xc0 [Tue May 1 06:29:51 2018] do_syscall_64+0x73/0x130 [Tue May 1 06:29:51 2018] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [Tue May 1 06:29:51 2018] RIP: 0033:0x7fe31b97c330 [Tue May 1 06:29:51 2018] RSP: 002b:7fffcea10258 EFLAGS: 0246 ORIG_RAX: [Tue May 1 06:29:51 2018] RAX: ffda RBX: 026e1980 RCX: 7fe31b97c330 [Tue May 1 06:29:51 2018] RDX: 0004 RSI: 7fe31c26e000 RDI: 0003 [Tue May 1 06:29:51 2018] RBP: 1000 R08: 26e19800 R09: 7fffcea10008 [Tue May 1 06:29:51 2018] R10: 7fffcea10020 R11: 0246 R12: 0003 [Tue May 1 06:29:51 2018] R13: 7fe31c26e000 R14: 0040 R15: 0004 [Tue May 1 06:31:52 2018] INFO: task badblocks:46824 blocked for more than 120 seconds. [Tue May 1 06:31:52 2018] Tainted: G OE 4.16.3-041603-generic #201804190730 [Tue May 1 06:31:52 2018] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Tue May 1 06:31:52 2018] badblocks D0 46824 48728 0x0004 [Tue May 1 06:31:52 2018] Call Trace: [Tue May 1 06:31:52 2018] __schedule+0x297/0x880 [Tue May 1 06:31:52 2018] ? iov_iter_get_pages+0xc0/0x2c0 [Tue May 1 06:31:52 2018] schedule+0x2c/0x80 [Tue May 1 06:31:52 2018] io_schedule+0x16/0x40 [Tue May 1 06:31:52 2018] __blkdev_direct_IO_simple+0x1ff/0x360 [Tue May 1 06:31:52 2018] ? bdget+0x120/0x120 [Tue May 1 06:31:52 2018] blkdev_direct_IO+0x3a2/0x3f0 [Tue May 1 06:31:52 2018] ? blkdev_direct_IO+0x3a2/0x3f0 [Tue May 1 06:31:52 2018] ? current_time+0x32/0x70 [Tue May 1 06:31:52 2018] ? __atime_needs_update+0x7f/0x190 [Tue May 1 06:31:52 2018] generic_file_read_iter+0xc6/0xc10 [Tue May 1 06:31:52 2018] ? __blkdev_direct_IO_simple+0x360/0x360 [Tue May 1 06:31:52 2018] ? generic_file_read_iter+0xc6/0xc10 [Tue May 1 06:31:52 2018] ? __wake_up+0x13/0x20 [Tue May 1 06:31:52 2018] ? tty_ldisc_deref+0x16/0x20 [Tue May 1 06:31:52 2018] ? tty_write+0x1fb/0x320 [Tue May 1 06:31:52 2018] blkdev_read_iter+0x35/0x40 [Tue May 1 06:31:52 2018] __vfs_read+0xfb/0x170 [Tue May 1 06:31:52 2018] vfs_read+0x8e/0x130 [Tue May 1 06:31:52 2018]
[Bug 199435] HPSA + P420i resetting logical Direct-Access never complete
https://bugzilla.kernel.org/show_bug.cgi?id=199435 --- Comment #20 from Anthony Hausman (anthonyhaussm...@gmail.com) --- So here are all my test. With the agent enable, using hp check command disk (hpacucli/ssacli and hpssacli) and launching a sg_reset, the reset has no problem on the problematic disk: Apr 26 14:31:20 kernel: hpsa :08:00.0: scsi 0:1:0:0: resetting logical Direct-Access HP LOGICAL VOLUME RAID-0 SSDSmartPathCap- En- Exp=1 Apr 26 14:31:21 kernel: hpsa :08:00.0: device is ready. Apr 26 14:31:21 kernel: hpsa :08:00.0: scsi 0:1:0:0: reset logical completed successfully Direct-Access HP LOGICAL VOLUME RAID-0 SSDSmartPathCap- En- Exp=1 The reset only took 1 second. The "bug" seems to appear only when the disk returns errors concerning Unrecovered read error (when using badblocks read-only test by example). I try to reproduce it. -- You are receiving this mail because: You are the assignee for the bug.
[Bug 199435] HPSA + P420i resetting logical Direct-Access never complete
https://bugzilla.kernel.org/show_bug.cgi?id=199435 --- Comment #19 from lober...@redhat.com --- I was concerned about the agents but Anthony disabled them and still saw this.I have seen this timeout sometimes when the agents probe via passthrough. I did just bump into this reset on a 7.5 RHEL kernel with no agents but it recovered almost immediately. I need to chase that down -- You are receiving this mail because: You are the assignee for the bug.
[Bug 199435] HPSA + P420i resetting logical Direct-Access never complete
https://bugzilla.kernel.org/show_bug.cgi?id=199435 --- Comment #18 from Anthony Hausman (anthonyhaussm...@gmail.com) --- Unfortunatly I don't know what process was consuming the CPU cycles at this. I'll try to reproduce the problem to reproduce the problem to have the information. I'm not using sg_reset to test the lv reset, actually I am launching a badblocks command on a problematic disk and the reset is invoked when it begins to fails. I'll use sg_reset to reproduce the problem and test with/without the agent. I invoke the agent every 5 minutes to check the controller and disks states. I keep you inform on my test. By the way, I thank you for your help. -- You are receiving this mail because: You are the assignee for the bug.
[Bug 199435] HPSA + P420i resetting logical Direct-Access never complete
https://bugzilla.kernel.org/show_bug.cgi?id=199435 --- Comment #17 from Don (don.br...@microsemi.com) --- (In reply to Anthony Hausman from comment #16) > Don, > > So I'm actually running the kernel 4.16.3 (build 18-04-19) with the hpsa > modules patch to use local work-queue insead of system work-queue. > > I have a reproduce a reset with no stack trace (which is a good news). > The only thing is between the resetting logical and the completation, 2 > hours passed and caused an heavy load on the server during this time: > > Apr 25 01:31:09 kernel: hpsa :08:00.0: scsi 0:1:0:0: resetting logical > Direct-Access HP LOGICAL VOLUME RAID-0 SSDSmartPathCap- En- Exp=1 > Apr 25 03:31:00 kernel: hpsa :08:00.0: device is ready. > Apr 25 03:31:00 kernel: hpsa :08:00.0: scsi 0:1:0:0: reset logical > completed successfully Direct-Access HP LOGICAL VOLUME RAID-0 > SSDSmartPathCap- En- Exp=1 > > The good thing after the reset has completed, this one is removed: > > Apr 25 03:31:45 kernel: hpsa :08:00.0: scsi 0:1:0:0: removed > Direct-Access HP LOGICAL VOLUME RAID-0 SSDSmartPathCap- En- Exp=1 The driver was notified by the P420i that the volume went offline, so the driver removed it from SML. > Apr 25 03:31:48 kernel: scsi 0:1:0:0: rejecting I/O to dead device There were I/O requests for the device, but the SML detected that it was deleted. > > So the question is if it's normal than the reset logical take such a long > time (and causing trouble on the server)? It is not normal. For a Logical Volume reset, the P420i flushes out any outstanding I/O requests then returns. The SML should block any new requests from coming down while the reset is in progress. Do you know what process was consuming the CPU cycles? ps -deo psr,pid,cls,cmd:50,pmem,size,vsz,nice,psr,pcpu,wchan:30,comm:30 | sort -nk1 | head -20 Are your using sg_reset to test LV resets? Or, does the device have some intermittent issues which is causing the SML to issue the reset operation? If you turn off the agents, do the resets complete more quickly? I am wondering if the agents are frequently probing the P420i for changes when the reset is active and the agents are consuming the CPU cycles. -- You are receiving this mail because: You are the assignee for the bug.
[Bug 199435] HPSA + P420i resetting logical Direct-Access never complete
https://bugzilla.kernel.org/show_bug.cgi?id=199435 --- Comment #16 from Anthony Hausman (anthonyhaussm...@gmail.com) --- Don, So I'm actually running the kernel 4.16.3 (build 18-04-19) with the hpsa modules patch to use local work-queue insead of system work-queue. I have a reproduce a reset with no stack trace (which is a good news). The only thing is between the resetting logical and the completation, 2 hours passed and caused an heavy load on the server during this time: Apr 25 01:31:09 kernel: hpsa :08:00.0: scsi 0:1:0:0: resetting logical Direct-Access HP LOGICAL VOLUME RAID-0 SSDSmartPathCap- En- Exp=1 Apr 25 03:31:00 kernel: hpsa :08:00.0: device is ready. Apr 25 03:31:00 kernel: hpsa :08:00.0: scsi 0:1:0:0: reset logical completed successfully Direct-Access HP LOGICAL VOLUME RAID-0 SSDSmartPathCap- En- Exp=1 The good thing after the reset has completed, this one is removed: Apr 25 03:31:45 kernel: hpsa :08:00.0: scsi 0:1:0:0: removed Direct-Access HP LOGICAL VOLUME RAID-0 SSDSmartPathCap- En- Exp=1 Apr 25 03:31:48 kernel: scsi 0:1:0:0: rejecting I/O to dead device So the question is if it's normal than the reset logical take such a long time (and causing trouble on the server)? -- You are receiving this mail because: You are the assignee for the bug.
[Bug 199435] HPSA + P420i resetting logical Direct-Access never complete
https://bugzilla.kernel.org/show_bug.cgi?id=199435 --- Comment #15 from Don (don.br...@microsemi.com) --- (In reply to Anthony Hausman from comment #11) > The only patch that I'm sure that I have is the "scsi: hpsa: fix selection > of reply queue" one. > For the I'm using an out of the box 4.11 kernel. So I'm really not sure that > the other patches are present. > > > Unfortunately, the module does not compile using 4.11.0-14-generic headers. > > # make -C /lib/modules/4.11.0-14-generic/build M=$(pwd) > --makefile="/root/hpsa-3.4.20-136/hpsa-3.4.20/drivers/scsi/Makefile.alt" > make: Entering directory '/usr/src/linux-headers-4.11.0-14-generic' > make -C /lib/modules/4.4.0-96-generic/build > M=/usr/src/linux-headers-4.11.0-14-generic EXTRA_CFLAGS+=-DKCLASS4A modules > make[1]: Entering directory '/usr/src/linux-headers-4.4.0-96-generic' > make[2]: *** No rule to make target 'kernel/bounds.c', needed by > 'kernel/bounds.s'. Stop. > Makefile:1423: recipe for target > '_module_/usr/src/linux-headers-4.11.0-14-generic' failed > make[1]: *** [_module_/usr/src/linux-headers-4.11.0-14-generic] Error 2 > make[1]: Leaving directory '/usr/src/linux-headers-4.4.0-96-generic' > /root/hpsa-3.4.20-136/hpsa-3.4.20/drivers/scsi/Makefile.alt:96: recipe for > target 'default' failed > make: *** [default] Error 2 > make: Leaving directory '/usr/src/linux-headers-4.11.0-14-generic' > > But if you tell me the principal problem is using the 4.11 kernel, I can > upgrade it to use the 4.16.3 kernel. > > If I use it, must I use the out of box 3.4.20-136 hpsa driver or use your > precedent patch on the last 3.4.20-125? (In reply to Anthony Hausman from comment #11) > The only patch that I'm sure that I have is the "scsi: hpsa: fix selection > of reply queue" one. > For the I'm using an out of the box 4.11 kernel. So I'm really not sure that > the other patches are present. > > > Unfortunately, the module does not compile using 4.11.0-14-generic headers. > > # make -C /lib/modules/4.11.0-14-generic/build M=$(pwd) > --makefile="/root/hpsa-3.4.20-136/hpsa-3.4.20/drivers/scsi/Makefile.alt" > make: Entering directory '/usr/src/linux-headers-4.11.0-14-generic' > make -C /lib/modules/4.4.0-96-generic/build > M=/usr/src/linux-headers-4.11.0-14-generic EXTRA_CFLAGS+=-DKCLASS4A modules > make[1]: Entering directory '/usr/src/linux-headers-4.4.0-96-generic' > make[2]: *** No rule to make target 'kernel/bounds.c', needed by > 'kernel/bounds.s'. Stop. > Makefile:1423: recipe for target > '_module_/usr/src/linux-headers-4.11.0-14-generic' failed > make[1]: *** [_module_/usr/src/linux-headers-4.11.0-14-generic] Error 2 > make[1]: Leaving directory '/usr/src/linux-headers-4.4.0-96-generic' > /root/hpsa-3.4.20-136/hpsa-3.4.20/drivers/scsi/Makefile.alt:96: recipe for > target 'default' failed > make: *** [default] Error 2 > make: Leaving directory '/usr/src/linux-headers-4.11.0-14-generic' > > But if you tell me the principal problem is using the 4.11 kernel, I can > upgrade it to use the 4.16.3 kernel. > > If I use it, must I use the out of box 3.4.20-136 hpsa driver or use your > precedent patch on the last 3.4.20-125? The 4.16.3 driver should be OK to use. You could not untar the sources I gave you in /tmp and build with make -f Makefile.alt? If you copy the source code into the kernel tree, you should be able to do make modules SUBDIRS=drivers/scsi hpsa.ko -- You are receiving this mail because: You are the assignee for the bug.
[Bug 199435] HPSA + P420i resetting logical Direct-Access never complete
https://bugzilla.kernel.org/show_bug.cgi?id=199435 --- Comment #14 from Anthony Hausman (anthonyhaussm...@gmail.com) --- Indeed I have a charged battery (capacitor) and the writeback-cache enabled. I run the hp-health component too, I have already try to disable it on the 4.11 kernel and have reproduced the problem of load without it. The cma related call trace up after the logical drive reset is called. Right now, I test on a server the kernel 4.16.3-041603-generic with the hpsa module with the patch to use local work-queue insead of system work-queue. Right now I didn't reproduce the problem. I had a disk with bad blocks (before launching a read-only test badblocks returned a lot of block error) but since I have upgraded the kernel with the patch hpsa module I have no more error. I'm still trying to reproduce the problem by launching a badblocks read-only test on the "ex-faulty" disk. I'll tell you the result of the test. -- You are receiving this mail because: You are the assignee for the bug.
[Bug 199435] HPSA + P420i resetting logical Direct-Access never complete
https://bugzilla.kernel.org/show_bug.cgi?id=199435 --- Comment #13 from lober...@redhat.com --- Apr 18 01:29:16 kernel: cmaidad D0 3442 1 0x Apr 18 01:29:16 kernel: Call Trace: Apr 18 01:29:16 kernel: __schedule+0x3b9/0x8f0 Apr 18 01:29:16 kernel: schedule+0x36/0x80 Apr 18 01:29:16 kernel: scsi_block_when_processing_errors+0xd5/0x110 Apr 18 01:29:16 kernel: ? wake_atomic_t_function+0x60/0x60 Apr 18 01:29:16 kernel: sg_open+0x14a/0x5c0 * Likely a pass though from the cma* management daemons Can you try reproduce with all the HP Health daemons disabled -- You are receiving this mail because: You are the assignee for the bug.
[Bug 199435] HPSA + P420i resetting logical Direct-Access never complete
https://bugzilla.kernel.org/show_bug.cgi?id=199435 lober...@redhat.com changed: What|Removed |Added CC||lober...@redhat.com --- Comment #12 from lober...@redhat.com --- We had a bunch of issues with the HPSA as already mentioned above. The specific issue that we had to revert was this commit 8b834bff1b73dce46f4e9f5e84af6f73fed8b0ef I assume your array has a charged battery (capacitor) and the writeback-cache is enabled on the 420i Are you only seeing this wen you have cmaeventd running, because hat can use pass through commands and has been known to cause issues. I am not running any of the HPE Proliant SPP daemons on my system. I have not seen this load related issue (without those daemons running) that you are seeing on my DL380G7 or Dl380G8 here so I will work on trying to reproduce and assist. Thanks Laurence -- You are receiving this mail because: You are the assignee for the bug.
[Bug 199435] HPSA + P420i resetting logical Direct-Access never complete
https://bugzilla.kernel.org/show_bug.cgi?id=199435 --- Comment #11 from Anthony Hausman (anthonyhaussm...@gmail.com) --- The only patch that I'm sure that I have is the "scsi: hpsa: fix selection of reply queue" one. For the I'm using an out of the box 4.11 kernel. So I'm really not sure that the other patches are present. Unfortunately, the module does not compile using 4.11.0-14-generic headers. # make -C /lib/modules/4.11.0-14-generic/build M=$(pwd) --makefile="/root/hpsa-3.4.20-136/hpsa-3.4.20/drivers/scsi/Makefile.alt" make: Entering directory '/usr/src/linux-headers-4.11.0-14-generic' make -C /lib/modules/4.4.0-96-generic/build M=/usr/src/linux-headers-4.11.0-14-generic EXTRA_CFLAGS+=-DKCLASS4A modules make[1]: Entering directory '/usr/src/linux-headers-4.4.0-96-generic' make[2]: *** No rule to make target 'kernel/bounds.c', needed by 'kernel/bounds.s'. Stop. Makefile:1423: recipe for target '_module_/usr/src/linux-headers-4.11.0-14-generic' failed make[1]: *** [_module_/usr/src/linux-headers-4.11.0-14-generic] Error 2 make[1]: Leaving directory '/usr/src/linux-headers-4.4.0-96-generic' /root/hpsa-3.4.20-136/hpsa-3.4.20/drivers/scsi/Makefile.alt:96: recipe for target 'default' failed make: *** [default] Error 2 make: Leaving directory '/usr/src/linux-headers-4.11.0-14-generic' But if you tell me the principal problem is using the 4.11 kernel, I can upgrade it to use the 4.16.3 kernel. If I use it, must I use the out of box 3.4.20-136 hpsa driver or use your precedent patch on the last 3.4.20-125? -- You are receiving this mail because: You are the assignee for the bug.
[Bug 199435] HPSA + P420i resetting logical Direct-Access never complete
https://bugzilla.kernel.org/show_bug.cgi?id=199435 --- Comment #10 from Don (don.br...@microsemi.com) --- Created attachment 275473 --> https://bugzilla.kernel.org/attachment.cgi?id=275473=edit Latest out of box hpsa driver. This tar file contains our latest out-of-box driver. 1. tar xf hpsa-3.4.20-136.tar.bz2 2. cd hpsa-3.4.20/drivers/scsi 3. make -f Makefile.alt If you are booted from hpsa, you will need to update your initrd and reboot. If you are using hpsa for non-boot drives, your can 1. rmmod hpsa 2. insmod ./hpsa.ko -- You are receiving this mail because: You are the assignee for the bug.
[Bug 199435] HPSA + P420i resetting logical Direct-Access never complete
https://bugzilla.kernel.org/show_bug.cgi?id=199435 --- Comment #9 from Don (don.br...@microsemi.com) --- When you applied the 4.16 hpsa driver patches, was this patch also applied? commit 84676c1f21e8ff54befe985f4f14dc1edc10046b Author: Christoph HellwigDate: Fri Jan 12 10:53:05 2018 +0800 genirq/affinity: assign vectors to all possible CPUs Currently we assign managed interrupt vectors to all present CPUs. This works fine for systems were we only online/offline CPUs. But in case of systems that support physical CPU hotplug (or the virtualized version of it) this means the additional CPUs covered for in the ACPI tables or on the command line are not catered for. To fix this we'd either need to introduce new hotplug CPU states just for this case, or we can start assining vectors to possible but not present CPUs. Reported-by: Christian Borntraeger Tested-by: Christian Borntraeger Tested-by: Stefan Haberland Fixes: 4b855ad37194 ("blk-mq: Create hctx for each present CPU") Cc: linux-ker...@vger.kernel.org Cc: Thomas Gleixner Signed-off-by: Christoph Hellwig Signed-off-by: Jens Axboe The above patch is why the hpsa-fix-selection-of-reply-queue patch was needed. If not, I would redact that patch because it may be causing your issues. There was another patch required for the hpsa-fix-selection-of-reply-queue patch: scsi-introduce-force-blk-mq. The errors shown in your logs indicate issues with DMA transfers of your data. Unaligned partial completion errors are usually issues with the scatter/gather buffers that represent your data buffers. I would like to eliminate using the 4.16 hpsa driver in a 4.11 kernel. Can you try our out-of-box driver? I'll attach this to the BZ. You compile it with make -f Makefile.alt The name is hpsa-3.4.20-136.tar.bz2 commit 8b834bff1b73dce46f4e9f5e84af6f73fed8b0ef Author: Ming Lei Date: Tue Mar 13 17:42:39 2018 +0800 scsi: hpsa: fix selection of reply queue Since commit 84676c1f21e8 ("genirq/affinity: assign vectors to all possible CPUs") we could end up with an MSI-X vector that did not have any online CPUs mapped. This would lead to I/O hangs since there was no CPU to receive the completion. Retrieve IRQ affinity information using pci_irq_get_affinity() and use this mapping to choose a reply queue. [mkp: tweaked commit desc] Cc: Hannes Reinecke Cc: "Martin K. Petersen" , Cc: James Bottomley , Cc: Christoph Hellwig , Cc: Don Brace Cc: Kashyap Desai Cc: Laurence Oberman Cc: Meelis Roos Cc: Artem Bityutskiy Cc: Mike Snitzer Fixes: 84676c1f21e8 ("genirq/affinity: assign vectors to all possible CPUs") Signed-off-by: Ming Lei Tested-by: Laurence Oberman Tested-by: Don Brace Tested-by: Artem Bityutskiy Acked-by: Don Brace Reviewed-by: Christoph Hellwig Signed-off-by: Martin K. Petersen I believe this patch is also required. commit cf2a0ce8d1c25c8cc4509874d270be8fc6026cc3 Author: Ming Lei Date: Tue Mar 13 17:42:41 2018 +0800 scsi: introduce force_blk_mq From scsi driver view, it is a bit troublesome to support both blk-mq and non-blk-mq at the same time, especially when drivers need to support multi hw-queue. This patch introduces 'force_blk_mq' to scsi_host_template so that drivers can provide blk-mq only support, so driver code can avoid the trouble for supporting both. Cc: Omar Sandoval , Cc: "Martin K. Petersen" , Cc: James Bottomley , Cc: Christoph Hellwig , Cc: Don Brace Cc: Kashyap Desai Cc: Mike Snitzer Cc: Laurence Oberman Reviewed-by: Hannes Reinecke Reviewed-by: Christoph Hellwig Signed-off-by: Ming Lei -- You are receiving this mail because: You are the assignee for the bug.
[Bug 199435] HPSA + P420i resetting logical Direct-Access never complete
https://bugzilla.kernel.org/show_bug.cgi?id=199435 --- Comment #8 from Anthony Hausman (anthonyhaussm...@gmail.com) --- So I have reproduced the problem with the patch driver. At the beginning, one disk return a lot of blk_update_request: critical medium error/Unrecovered read error and after the driver trigger a reset logical on all disk. At the first trigger, all reset completed successfully but the third reset on the problematic error disk the system hang out and the reset never complete. The load on the server is less important at that time but application seems to stuck their IO still. And the faulty disk is still considered healthy via the hp utilitues (ssacli). Here is the stack trace: [Fri Apr 20 20:56:58 2018] sd 0:1:0:15: [sdp] Unaligned partial completion (resid=32, sector_sz=512) [Fri Apr 20 20:56:58 2018] sd 0:1:0:15: [sdp] tag#50 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [Fri Apr 20 20:56:58 2018] sd 0:1:0:15: [sdp] tag#50 Sense Key : Medium Error [current] [Fri Apr 20 20:56:58 2018] sd 0:1:0:15: [sdp] tag#50 Add. Sense: Unrecovered read error [Fri Apr 20 20:56:58 2018] sd 0:1:0:15: [sdp] tag#50 CDB: Read(16) 88 00 00 00 00 02 36 46 b5 a8 00 00 04 00 00 00 [Fri Apr 20 20:56:58 2018] blk_update_request: critical medium error, dev sdp, sector 9500538280 [Fri Apr 20 20:57:30 2018] hpsa :08:00.0: scsi 0:1:0:15: resetting logical Direct-Access HP LOGICAL VOLUME RAID-0 SSDSmartPathCap- En- Exp=1 [Fri Apr 20 20:59:06 2018] hpsa :08:00.0: device is ready. [Fri Apr 20 20:59:06 2018] hpsa :08:00.0: scsi 0:1:0:15: reset logical completed successfully Direct-Access HP LOGICAL VOLUME RAID-0 SSDSmartPathCap- En- Exp=1 [Fri Apr 20 21:00:05 2018] sd 0:1:0:15: [sdp] Unaligned partial completion (resid=198, sector_sz=512) [Fri Apr 20 21:00:05 2018] sd 0:1:0:15: [sdp] tag#7 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [Fri Apr 20 21:00:05 2018] sd 0:1:0:15: [sdp] tag#7 Sense Key : Medium Error [current] [Fri Apr 20 21:00:05 2018] sd 0:1:0:15: [sdp] tag#7 Add. Sense: Unrecovered read error [Fri Apr 20 21:00:05 2018] sd 0:1:0:15: [sdp] tag#7 CDB: Read(16) 88 00 00 00 00 02 36 46 b9 a8 00 00 04 00 00 00 [Fri Apr 20 21:00:05 2018] blk_update_request: critical medium error, dev sdp, sector 9500539304 [Fri Apr 20 21:00:56 2018] sd 0:1:0:15: [sdp] Unaligned partial completion (resid=48, sector_sz=512) [Fri Apr 20 21:00:56 2018] sd 0:1:0:15: [sdp] tag#2 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [Fri Apr 20 21:00:56 2018] sd 0:1:0:15: [sdp] tag#2 Sense Key : Medium Error [current] [Fri Apr 20 21:00:56 2018] sd 0:1:0:15: [sdp] tag#2 Add. Sense: Unrecovered read error [Fri Apr 20 21:00:56 2018] sd 0:1:0:15: [sdp] tag#2 CDB: Read(16) 88 00 00 00 00 02 36 46 a9 a8 00 00 04 00 00 00 [Fri Apr 20 21:00:56 2018] blk_update_request: critical medium error, dev sdp, sector 9500535208 [Fri Apr 20 21:09:59 2018] hpsa :08:00.0: scsi 0:1:0:15: resetting logical Direct-Access HP LOGICAL VOLUME RAID-0 SSDSmartPathCap- En- Exp=1 [Fri Apr 20 21:48:43 2018] hpsa :08:00.0: device is ready. [Fri Apr 20 21:48:43 2018] hpsa :08:00.0: scsi 0:1:0:15: reset logical completed successfully Direct-Access HP LOGICAL VOLUME RAID-0 SSDSmartPathCap- En- Exp=1 [Fri Apr 20 21:51:44 2018] hpsa :08:00.0: scsi 0:1:0:0: resetting logical Direct-Access HP LOGICAL VOLUME RAID-0 SSDSmartPathCap- En- Exp=1 [Fri Apr 20 22:14:05 2018] hpsa :08:00.0: device is ready. [Fri Apr 20 22:14:05 2018] hpsa :08:00.0: scsi 0:1:0:0: reset logical completed successfully Direct-Access HP LOGICAL VOLUME RAID-0 SSDSmartPathCap- En- Exp=1 [Fri Apr 20 22:14:05 2018] hpsa :08:00.0: scsi 0:1:0:1: resetting logical Direct-Access HP LOGICAL VOLUME RAID-0 SSDSmartPathCap- En- Exp=1 [Fri Apr 20 22:14:06 2018] hpsa :08:00.0: device is ready. [Fri Apr 20 22:14:06 2018] hpsa :08:00.0: scsi 0:1:0:1: reset logical completed successfully Direct-Access HP LOGICAL VOLUME RAID-0 SSDSmartPathCap- En- Exp=1 [Fri Apr 20 22:14:06 2018] hpsa :08:00.0: scsi 0:1:0:2: resetting logical Direct-Access HP LOGICAL VOLUME RAID-0 SSDSmartPathCap- En- Exp=1 [Fri Apr 20 22:14:07 2018] hpsa :08:00.0: device is ready. [Fri Apr 20 22:14:07 2018] hpsa :08:00.0: scsi 0:1:0:2: reset logical completed successfully Direct-Access HP LOGICAL VOLUME RAID-0 SSDSmartPathCap- En- Exp=1 [Fri Apr 20 22:14:07 2018] hpsa :08:00.0: scsi 0:1:0:3: resetting logical Direct-Access HP LOGICAL VOLUME RAID-0 SSDSmartPathCap- En- Exp=1 [Fri Apr 20 22:14:08 2018] hpsa :08:00.0: device is ready. [Fri Apr 20 22:14:08 2018] hpsa :08:00.0: scsi 0:1:0:3: reset logical completed successfully Direct-Access HP LOGICAL VOLUME RAID-0 SSDSmartPathCap- En- Exp=1 [Fri Apr 20 22:14:08 2018] hpsa :08:00.0: scsi 0:1:0:4: resetting logical Direct-Access HP LOGICAL VOLUME
[Bug 199435] HPSA + P420i resetting logical Direct-Access never complete
https://bugzilla.kernel.org/show_bug.cgi?id=199435 --- Comment #7 from Anthony Hausman (anthonyhaussm...@gmail.com) --- I had a similar stack trace: Apr 20 14:57:18 kernel: INFO: task jbd2/sdt-8:10890 blocked for more than 120 seconds. Apr 20 14:57:18 kernel: Tainted: G OE 4.11.0-14-generic #20~16.04.1-Ubuntu Apr 20 14:57:18 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Apr 20 14:57:18 kernel: jbd2/sdt-8 D0 10890 2 0x Apr 20 14:57:18 kernel: Call Trace: Apr 20 14:57:18 kernel: __schedule+0x3b9/0x8f0 Apr 20 14:57:18 kernel: schedule+0x36/0x80 Apr 20 14:57:18 kernel: jbd2_journal_commit_transaction+0x241/0x1830 Apr 20 14:57:18 kernel: ? update_load_avg+0x84/0x560 Apr 20 14:57:18 kernel: ? update_load_avg+0x84/0x560 Apr 20 14:57:18 kernel: ? dequeue_entity+0xed/0x4c0 Apr 20 14:57:18 kernel: ? wake_atomic_t_function+0x60/0x60 Apr 20 14:57:18 kernel: ? lock_timer_base+0x7d/0xa0 Apr 20 14:57:18 kernel: kjournald2+0xca/0x250 Apr 20 14:57:18 kernel: ? kjournald2+0xca/0x250 Apr 20 14:57:18 kernel: ? wake_atomic_t_function+0x60/0x60 Apr 20 14:57:18 kernel: kthread+0x109/0x140 Apr 20 14:57:18 kernel: ? commit_timeout+0x10/0x10 Apr 20 14:57:18 kernel: ? kthread_create_on_node+0x70/0x70 Apr 20 14:57:18 kernel: ret_from_fork+0x25/0x30 Apr 20 14:57:18 kernel: INFO: task task:13497 blocked for more than 120 seconds. Apr 20 14:57:18 kernel: Tainted: G OE 4.11.0-14-generic #20~16.04.1-Ubuntu Apr 20 14:57:18 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Apr 20 14:57:18 kernel: taskD0 13497 13196 0x Apr 20 14:57:18 kernel: Call Trace: Apr 20 14:57:18 kernel: __schedule+0x3b9/0x8f0 Apr 20 14:57:18 kernel: schedule+0x36/0x80 Apr 20 14:57:18 kernel: rwsem_down_write_failed+0x237/0x3b0 Apr 20 14:57:18 kernel: ? copy_page_to_iter_iovec+0x97/0x170 Apr 20 14:57:18 kernel: call_rwsem_down_write_failed+0x17/0x30 Apr 20 14:57:18 kernel: ? call_rwsem_down_write_failed+0x17/0x30 Apr 20 14:57:18 kernel: down_write+0x2d/0x40 Apr 20 14:57:18 kernel: ext4_file_write_iter+0x70/0x3c0 Apr 20 14:57:18 kernel: ? futex_wake+0x90/0x170 Apr 20 14:57:18 kernel: new_sync_write+0xd3/0x130 Apr 20 14:57:18 kernel: __vfs_write+0x26/0x40 Apr 20 14:57:18 kernel: vfs_write+0xb8/0x1b0 Apr 20 14:57:18 kernel: SyS_pwrite64+0x95/0xb0 Apr 20 14:57:18 kernel: entry_SYSCALL_64_fastpath+0x1e/0xad Apr 20 14:57:18 kernel: RIP: 0033:0x7fa085d92d23 Apr 20 14:57:18 kernel: RSP: 002b:7fa0801acc90 EFLAGS: 0293 ORIG_RAX: 0012 Apr 20 14:57:18 kernel: RAX: ffda RBX: 7fa0480009d0 RCX: 7fa085d92d23 Apr 20 14:57:18 kernel: RDX: 0200 RSI: 7fa004000b30 RDI: 000f Apr 20 14:57:18 kernel: RBP: 7fa0801ad060 R08: 7fa0801acd2c R09: 0001 Apr 20 14:57:18 kernel: R10: 0001f86be000 R11: 0293 R12: 7fa0040014c0 Apr 20 14:57:18 kernel: R13: 7fa004000d80 R14: 002e R15: 7fa0480009d0 Apr 20 14:57:18 kernel: INFO: task task:13499 blocked for more than 120 seconds. Apr 20 14:57:18 kernel: Tainted: G OE 4.11.0-14-generic #20~16.04.1-Ubuntu Apr 20 14:57:18 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Apr 20 14:57:18 kernel: taskD0 13499 13196 0x Apr 20 14:57:18 kernel: Call Trace: Apr 20 14:57:18 kernel: __schedule+0x3b9/0x8f0 Apr 20 14:57:18 kernel: schedule+0x36/0x80 Apr 20 14:57:18 kernel: rwsem_down_write_failed+0x237/0x3b0 Apr 20 14:57:18 kernel: ? copy_page_to_iter_iovec+0x97/0x170 Apr 20 14:57:18 kernel: call_rwsem_down_write_failed+0x17/0x30 Apr 20 14:57:18 kernel: ? call_rwsem_down_write_failed+0x17/0x30 Apr 20 14:57:18 kernel: down_write+0x2d/0x40 Apr 20 14:57:18 kernel: ext4_file_write_iter+0x70/0x3c0 Apr 20 14:57:18 kernel: ? futex_wake+0x90/0x170 Apr 20 14:57:18 kernel: new_sync_write+0xd3/0x130 Apr 20 14:57:18 kernel: __vfs_write+0x26/0x40 Apr 20 14:57:18 kernel: vfs_write+0xb8/0x1b0 Apr 20 14:57:18 kernel: SyS_pwrite64+0x95/0xb0 Apr 20 14:57:18 kernel: entry_SYSCALL_64_fastpath+0x1e/0xad Apr 20 14:57:18 kernel: RIP: 0033:0x7fa085d92d23 Apr 20 14:57:18 kernel: RSP: 002b:7fa07f9abc90 EFLAGS: 0293 ORIG_RAX: 0012 Apr 20 14:57:18 kernel: RAX: ffda RBX: 7f9fac008d00 RCX: 7fa085d92d23 Apr 20 14:57:18 kernel: RDX: 0200 RSI: 7fa0080013b0 RDI: 000f Apr 20 14:57:18 kernel: RBP: 7fa07f9ac060 R08: 7fa07f9abd2c R09: 0001 Apr 20 14:57:18 kernel: R10: 000219541000 R11: 0293 R12: 7fa008001140 Apr 20 14:57:18 kernel: R13: 7fa0080008c0 R14: 002e R15: 7f9fac008d00 Apr 20 14:57:18 kernel: INFO: task task:13510 blocked for more than 120 seconds. Apr 20 14:57:18 kernel: Tainted: G OE 4.11.0-14-generic #20~16.04.1-Ubuntu Apr 20 14:57:18 kernel: "echo 0 >
[Bug 199435] HPSA + P420i resetting logical Direct-Access never complete
https://bugzilla.kernel.org/show_bug.cgi?id=199435 --- Comment #6 from Anthony Hausman (anthonyhaussm...@gmail.com) --- I have a stack trave about the Workqueue: Apr 19 11:22:52 kernel: INFO: task kworker/u129:28:428 blocked for more than 120 seconds. Apr 19 11:22:52 kernel: Tainted: G OE 4.11.0-14-generic #20~16.04.1-Ubuntu Apr 19 11:22:52 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Apr 19 11:22:52 kernel: kworker/u129:28 D0 428 2 0x Apr 19 11:22:52 kernel: Workqueue: writeback wb_workfn (flush-67:80) Apr 19 11:22:52 kernel: Call Trace: Apr 19 11:22:52 kernel: __schedule+0x3b9/0x8f0 Apr 19 11:22:52 kernel: schedule+0x36/0x80 Apr 19 11:22:52 kernel: wait_transaction_locked+0x8a/0xd0 Apr 19 11:22:52 kernel: ? wake_atomic_t_function+0x60/0x60 Apr 19 11:22:52 kernel: add_transaction_credits+0x1c1/0x2a0 Apr 19 11:22:52 kernel: start_this_handle+0x103/0x3f0 Apr 19 11:22:52 kernel: ? find_get_pages_tag+0x19f/0x2b0 Apr 19 11:22:52 kernel: ? kmem_cache_alloc+0xd7/0x1b0 Apr 19 11:22:52 kernel: jbd2__journal_start+0xdb/0x1f0 Apr 19 11:22:52 kernel: ? ext4_writepages+0x4e6/0xe20 Apr 19 11:22:52 kernel: __ext4_journal_start_sb+0x6d/0x120 Apr 19 11:22:52 kernel: ext4_writepages+0x4e6/0xe20 Apr 19 11:22:52 kernel: ? generic_writepages+0x67/0x90 Apr 19 11:22:52 kernel: ? sd_init_command+0x30/0xb0 Apr 19 11:22:52 kernel: do_writepages+0x1e/0x30 Apr 19 11:22:52 kernel: ? do_writepages+0x1e/0x30 Apr 19 11:22:52 kernel: __writeback_single_inode+0x45/0x330 Apr 19 11:22:52 kernel: writeback_sb_inodes+0x26a/0x5f0 Apr 19 11:22:52 kernel: __writeback_inodes_wb+0x92/0xc0 Apr 19 11:22:52 kernel: wb_writeback+0x26e/0x320 Apr 19 11:22:52 kernel: wb_workfn+0x2cf/0x3a0 Apr 19 11:22:52 kernel: ? wb_workfn+0x2cf/0x3a0 Apr 19 11:22:52 kernel: process_one_work+0x16b/0x4a0 Apr 19 11:22:52 kernel: worker_thread+0x4b/0x500 Apr 19 11:22:52 kernel: kthread+0x109/0x140 Apr 19 11:22:52 kernel: ? process_one_work+0x4a0/0x4a0 Apr 19 11:22:52 kernel: ? kthread_create_on_node+0x70/0x70 Apr 19 11:22:52 kernel: ret_from_fork+0x25/0x30 Apr 19 11:22:52 kernel: INFO: task jbd2/sdbb-8:10556 blocked for more than 120 seconds. Apr 19 11:22:52 kernel: Tainted: G OE 4.11.0-14-generic #20~16.04.1-Ubuntu Apr 19 11:22:52 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Apr 19 11:22:52 kernel: jbd2/sdbb-8 D0 10556 2 0x Apr 19 11:22:52 kernel: Call Trace: Apr 19 11:22:52 kernel: __schedule+0x3b9/0x8f0 Apr 19 11:22:52 kernel: ? update_cfs_rq_load_avg.constprop.91+0x227/0x4e0 Apr 19 11:22:52 kernel: schedule+0x36/0x80 Apr 19 11:22:52 kernel: jbd2_journal_commit_transaction+0x241/0x1830 Apr 19 11:22:52 kernel: ? update_load_avg+0x84/0x560 Apr 19 11:22:52 kernel: ? wake_atomic_t_function+0x60/0x60 Apr 19 11:22:52 kernel: ? lock_timer_base+0x7d/0xa0 Apr 19 11:22:52 kernel: kjournald2+0xca/0x250 Apr 19 11:22:52 kernel: ? kjournald2+0xca/0x250 Apr 19 11:22:52 kernel: ? wake_atomic_t_function+0x60/0x60 Apr 19 11:22:52 kernel: kthread+0x109/0x140 Apr 19 11:22:52 kernel: ? commit_timeout+0x10/0x10 Apr 19 11:22:52 kernel: ? kthread_create_on_node+0x70/0x70 Apr 19 11:22:52 kernel: ret_from_fork+0x25/0x30 Apr 19 11:22:52 kernel: INFO: task task:14138 blocked for more than 120 seconds. Apr 19 11:22:52 kernel: Tainted: G OE 4.11.0-14-generic #20~16.04.1-Ubuntu Apr 19 11:22:52 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Apr 19 11:22:52 kernel: taskD0 14138 14058 0x Apr 19 11:22:52 kernel: Call Trace: Apr 19 11:22:52 kernel: __schedule+0x3b9/0x8f0 Apr 19 11:22:52 kernel: schedule+0x36/0x80 Apr 19 11:22:52 kernel: wait_transaction_locked+0x8a/0xd0 Apr 19 11:22:52 kernel: ? wake_atomic_t_function+0x60/0x60 Apr 19 11:22:52 kernel: add_transaction_credits+0x1c1/0x2a0 Apr 19 11:22:52 kernel: ? autoremove_wake_function+0x40/0x40 Apr 19 11:22:52 kernel: start_this_handle+0x103/0x3f0 Apr 19 11:22:52 kernel: ? dquot_file_open+0x3d/0x50 Apr 19 11:22:52 kernel: ? kmem_cache_alloc+0xd7/0x1b0 Apr 19 11:22:52 kernel: jbd2__journal_start+0xdb/0x1f0 Apr 19 11:22:52 kernel: ? ext4_dirty_inode+0x32/0x70 Apr 19 11:22:52 kernel: __ext4_journal_start_sb+0x6d/0x120 Apr 19 11:22:52 kernel: ext4_dirty_inode+0x32/0x70 Apr 19 11:22:52 kernel: __mark_inode_dirty+0x176/0x370 Apr 19 11:22:52 kernel: generic_update_time+0x7b/0xd0 Apr 19 11:22:52 kernel: ? current_time+0x38/0x80 Apr 19 11:22:52 kernel: ? ext4_xattr_security_set+0x30/0x30 Apr 19 11:22:52 kernel: file_update_time+0xb7/0x110 Apr 19 11:22:52 kernel: ? ext4_xattr_security_set+0x30/0x30 Apr 19 11:22:52 kernel: __generic_file_write_iter+0x9d/0x1f0 Apr 19 11:22:52 kernel: ext4_file_write_iter+0x21a/0x3c0 Apr 19 11:22:52 kernel: ? __slab_free+0x9e/0x2e0 Apr 19 11:22:52 kernel: new_sync_write+0xd3/0x130 Apr 19 11:22:52 kernel: __vfs_write+0x26/0x40 Apr 19 11:22:52
[Bug 199435] HPSA + P420i resetting logical Direct-Access never complete
https://bugzilla.kernel.org/show_bug.cgi?id=199435 --- Comment #5 from Anthony Hausman (anthonyhaussm...@gmail.com) --- Don, I have applied the patch, it actually run and I try to reproduce the problem. I'll inform you about the diagnose. -- You are receiving this mail because: You are the assignee for the bug.
[Bug 199435] HPSA + P420i resetting logical Direct-Access never complete
https://bugzilla.kernel.org/show_bug.cgi?id=199435 --- Comment #4 from Don (don.br...@microsemi.com) --- Your stack trace does not show and hpsa driver components, but I do see the reset issued but not completing. I'm hoping that the attached patch helps diagnose the issue a little better. -- You are receiving this mail because: You are the assignee for the bug.
[Bug 199435] HPSA + P420i resetting logical Direct-Access never complete
https://bugzilla.kernel.org/show_bug.cgi?id=199435 --- Comment #3 from Don (don.br...@microsemi.com) --- Created attachment 275437 --> https://bugzilla.kernel.org/attachment.cgi?id=275437=edit Patch to use local work-queue insead of system work-queue If the driver initiates a re-scan from a system work-queue, the kernel can hang. This patch has not been submitted to linux-scsi, I will be sending this patch out soon. -- You are receiving this mail because: You are the assignee for the bug.
[Bug 199435] HPSA + P420i resetting logical Direct-Access never complete
https://bugzilla.kernel.org/show_bug.cgi?id=199435 --- Comment #2 from Anthony Hausman (anthonyhaussm...@gmail.com) --- I don't have any "Controller lockup detected" message in the Syslog unfortunately. On the ilo IML log, the last message was about the cache module: CAUTION: POST Messages - POST Error: 1792-Slot X Drive Array - Valid Data Found in Cache Module. Data will automatically be written to drive array.. I have nothing about lockup entries. Indeed, we use the driver from the last kernel and compiled it for 4.11. I am ready to test the patch you are proposing. Where can I retrieve it? -- You are receiving this mail because: You are the assignee for the bug.
[Bug 199435] HPSA + P420i resetting logical Direct-Access never complete
https://bugzilla.kernel.org/show_bug.cgi?id=199435 Don (don.br...@microsemi.com) changed: What|Removed |Added CC||don.br...@microsemi.com --- Comment #1 from Don (don.br...@microsemi.com) --- Do you see any lockup messages in the console logs? "Controller lockup detected"... The driver you used is from 4.16 kernel on a 4.11 kernel? I have not tested this configuration. I notice that the driver is still using the kernel work-queue for monitoring. I will be sending up a patch to change this to local work-queues soon. Perhaps you can test this patch? It may help to discover more information on what is happening. Also, after you rebooted, were there any lockup entries in the ilo IML log? -- You are receiving this mail because: You are the assignee for the bug.