Public bug reported: [Impact] On deployments with lots of disks, timeouts can occur that escalate into nexus resets. This can cause disk devices to disappear from the system, possibly requiring a reboot to recover:
[18324.951189] cq: iptt:892, task:ffff8026fbde5000, cmp_st:3, err_rcrd_xfrd:1,rspns_xfrd:0,error_phase:6,devid:16,io_cfg_err_code:0,err_code:0, ft = 0x0, ata_st=0x0, tgt_io_st=0x0,disk_err=0x0 [18324.951190] sb dw0:0x8001,dw1:0x0,dw2:0x0,dw3:0x0 [18324.951191] cmd table: 0x0,0x0,0x0,0x0,0x0 [18324.951192] itct: 0x12fa0345,0x5000cca25d31dac1,0x1000000001388,0x0,0x0,0x0,0x0,0x0,0x0,0x0 [18324.951334] hisi_sas_v3_hw 0000:74:02.0: slot complete: task(ffff8026fbde5000) ignored [18325.039774] sb dw0:0x8001,dw1:0x0,dw2:0x0,dw3:0x0 [18325.044467] cmd table: 0x0,0x0,0x0,0x0,0x0 [18325.048553] itct: 0x12fa0345,0x5000c50094c65c55,0x1000000001388,0x0,0x0,0x0,0x0,0x0,0x0,0x0 [18325.057058] hisi_sas_v3_hw 0000:74:02.0: slot complete: task(ffff8027dc8e7500) ignored [18326.951312] cq: iptt:1705, task:ffff8027820d0200, cmp_st:3, err_rcrd_xfrd:1,rspns_xfrd:0,error_phase:6,devid:18,io_cfg_err_code:0,err_code:0, ft = 0x0, ata_st=0x0, tgt_io_st=0x0,disk_err=0x0 [18326.968247] sb dw0:0x8001,dw1:0x0,dw2:0x0,dw3:0x0 [18326.972938] cmd table: 0x0,0x0,0x0,0x0,0x0 [18326.977023] itct: 0x12fa0345,0x5000cca0803e9c1d,0x1000000001388,0x0,0x0,0x0,0x0,0x0,0x0,0x0 [18326.985496] hisi_sas_v3_hw 0000:74:02.0: slot complete: task(ffff8027820d0200) ignored [18329.384695] hisi_sas_v3_hw 0000:74:02.0: internal task abort: timeout and not done. [18329.392344] hisi_sas_v3_hw 0000:74:02.0: start dump all regs,reason:abort timeout! [18329.399904] ***************DUMP IS DISABLED*************** [18329.405467] dump reg fail. [18329.408162] hisi_sas_v3_hw 0000:74:02.0: I_T nexus reset: internal abort (-5) [18329.936017] cq: iptt:649, task:ffff8027981f8500, cmp_st:3, err_rcrd_xfrd:1,rspns_xfrd:0,error_phase:6,devid:19,io_cfg_err_code:0,err_code:0, ft = 0x0, ata_st=0x0, tgt_io_st=0x0,disk_err=0x0 [18329.936154] cq: iptt:1091, task:ffff8026ff666d00, cmp_st:3, err_rcrd_xfrd:1,rspns_xfrd:0,error_phase:6,devid:49,io_cfg_err_code:0,err_code:0, ft = 0x0, ata_st=0x0, tgt_io_st=0x0,disk_err=0x0 [18329.936155] sb dw0:0x8001,dw1:0x0,dw2:0x0,dw3:0x0 [18329.936156] cmd table: 0x0,0x0,0x0,0x0,0x0 [18329.936158] itct: 0x12fa0345,0x5000cca2552b2855,0x1000000001388,0x0,0x0,0x0,0x0,0x0,0x0,0x0 [18329.936301] hisi_sas_v3_hw 0000:74:02.0: slot complete: task(ffff8026ff666d00) ignored [Test Case] This was seen on a system with 100s of disks, something I don't have access to, so verification testing will be regression-only. [Fix] A fix queued in the scsi maintainer's tree adjusts some magic registers in the controller, and that somehow fixes the problem (I don't have programming docs for this controller, so I can only hand-wave here). [Regression Risk] The fix is localized to the hisi_sas_v3_hw driver, which is only used in Ubuntu for the D06 platform. ** Affects: linux (Ubuntu) Importance: Undecided Assignee: dann frazier (dannf) Status: In Progress ** Affects: linux (Ubuntu Bionic) Importance: Undecided Assignee: dann frazier (dannf) Status: In Progress ** Also affects: linux (Ubuntu Bionic) Importance: Undecided Status: New ** Summary changed: - hisi_sas: + hisi_sas_v3_hw: internal task abort: timeout and not done. ** Changed in: linux (Ubuntu) Status: New => In Progress ** Changed in: linux (Ubuntu Bionic) Status: New => In Progress ** Changed in: linux (Ubuntu) Assignee: (unassigned) => dann frazier (dannf) ** Changed in: linux (Ubuntu Bionic) Assignee: (unassigned) => dann frazier (dannf) -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1777736 Title: hisi_sas_v3_hw: internal task abort: timeout and not done. Status in linux package in Ubuntu: In Progress Status in linux source package in Bionic: In Progress Bug description: [Impact] On deployments with lots of disks, timeouts can occur that escalate into nexus resets. This can cause disk devices to disappear from the system, possibly requiring a reboot to recover: [18324.951189] cq: iptt:892, task:ffff8026fbde5000, cmp_st:3, err_rcrd_xfrd:1,rspns_xfrd:0,error_phase:6,devid:16,io_cfg_err_code:0,err_code:0, ft = 0x0, ata_st=0x0, tgt_io_st=0x0,disk_err=0x0 [18324.951190] sb dw0:0x8001,dw1:0x0,dw2:0x0,dw3:0x0 [18324.951191] cmd table: 0x0,0x0,0x0,0x0,0x0 [18324.951192] itct: 0x12fa0345,0x5000cca25d31dac1,0x1000000001388,0x0,0x0,0x0,0x0,0x0,0x0,0x0 [18324.951334] hisi_sas_v3_hw 0000:74:02.0: slot complete: task(ffff8026fbde5000) ignored [18325.039774] sb dw0:0x8001,dw1:0x0,dw2:0x0,dw3:0x0 [18325.044467] cmd table: 0x0,0x0,0x0,0x0,0x0 [18325.048553] itct: 0x12fa0345,0x5000c50094c65c55,0x1000000001388,0x0,0x0,0x0,0x0,0x0,0x0,0x0 [18325.057058] hisi_sas_v3_hw 0000:74:02.0: slot complete: task(ffff8027dc8e7500) ignored [18326.951312] cq: iptt:1705, task:ffff8027820d0200, cmp_st:3, err_rcrd_xfrd:1,rspns_xfrd:0,error_phase:6,devid:18,io_cfg_err_code:0,err_code:0, ft = 0x0, ata_st=0x0, tgt_io_st=0x0,disk_err=0x0 [18326.968247] sb dw0:0x8001,dw1:0x0,dw2:0x0,dw3:0x0 [18326.972938] cmd table: 0x0,0x0,0x0,0x0,0x0 [18326.977023] itct: 0x12fa0345,0x5000cca0803e9c1d,0x1000000001388,0x0,0x0,0x0,0x0,0x0,0x0,0x0 [18326.985496] hisi_sas_v3_hw 0000:74:02.0: slot complete: task(ffff8027820d0200) ignored [18329.384695] hisi_sas_v3_hw 0000:74:02.0: internal task abort: timeout and not done. [18329.392344] hisi_sas_v3_hw 0000:74:02.0: start dump all regs,reason:abort timeout! [18329.399904] ***************DUMP IS DISABLED*************** [18329.405467] dump reg fail. [18329.408162] hisi_sas_v3_hw 0000:74:02.0: I_T nexus reset: internal abort (-5) [18329.936017] cq: iptt:649, task:ffff8027981f8500, cmp_st:3, err_rcrd_xfrd:1,rspns_xfrd:0,error_phase:6,devid:19,io_cfg_err_code:0,err_code:0, ft = 0x0, ata_st=0x0, tgt_io_st=0x0,disk_err=0x0 [18329.936154] cq: iptt:1091, task:ffff8026ff666d00, cmp_st:3, err_rcrd_xfrd:1,rspns_xfrd:0,error_phase:6,devid:49,io_cfg_err_code:0,err_code:0, ft = 0x0, ata_st=0x0, tgt_io_st=0x0,disk_err=0x0 [18329.936155] sb dw0:0x8001,dw1:0x0,dw2:0x0,dw3:0x0 [18329.936156] cmd table: 0x0,0x0,0x0,0x0,0x0 [18329.936158] itct: 0x12fa0345,0x5000cca2552b2855,0x1000000001388,0x0,0x0,0x0,0x0,0x0,0x0,0x0 [18329.936301] hisi_sas_v3_hw 0000:74:02.0: slot complete: task(ffff8026ff666d00) ignored [Test Case] This was seen on a system with 100s of disks, something I don't have access to, so verification testing will be regression-only. [Fix] A fix queued in the scsi maintainer's tree adjusts some magic registers in the controller, and that somehow fixes the problem (I don't have programming docs for this controller, so I can only hand-wave here). [Regression Risk] The fix is localized to the hisi_sas_v3_hw driver, which is only used in Ubuntu for the D06 platform. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1777736/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp