Public bug reported:

[Impact]
On deployments with lots of disks, timeouts can occur that escalate into nexus 
resets. This can cause disk devices to disappear from the system, possibly 
requiring a reboot to recover:

[18324.951189] cq: iptt:892, task:ffff8026fbde5000, cmp_st:3, 
err_rcrd_xfrd:1,rspns_xfrd:0,error_phase:6,devid:16,io_cfg_err_code:0,err_code:0,
 ft = 0x0, ata_st=0x0, tgt_io_st=0x0,disk_err=0x0
[18324.951190] sb dw0:0x8001,dw1:0x0,dw2:0x0,dw3:0x0
[18324.951191] cmd table: 0x0,0x0,0x0,0x0,0x0
[18324.951192] itct: 
0x12fa0345,0x5000cca25d31dac1,0x1000000001388,0x0,0x0,0x0,0x0,0x0,0x0,0x0
[18324.951334] hisi_sas_v3_hw 0000:74:02.0: slot complete: 
task(ffff8026fbde5000) ignored

[18325.039774] sb dw0:0x8001,dw1:0x0,dw2:0x0,dw3:0x0
[18325.044467] cmd table: 0x0,0x0,0x0,0x0,0x0
[18325.048553] itct: 
0x12fa0345,0x5000c50094c65c55,0x1000000001388,0x0,0x0,0x0,0x0,0x0,0x0,0x0
[18325.057058] hisi_sas_v3_hw 0000:74:02.0: slot complete: 
task(ffff8027dc8e7500) ignored

[18326.951312] cq: iptt:1705, task:ffff8027820d0200, cmp_st:3, 
err_rcrd_xfrd:1,rspns_xfrd:0,error_phase:6,devid:18,io_cfg_err_code:0,err_code:0,
 ft = 0x0, ata_st=0x0, tgt_io_st=0x0,disk_err=0x0
[18326.968247] sb dw0:0x8001,dw1:0x0,dw2:0x0,dw3:0x0
[18326.972938] cmd table: 0x0,0x0,0x0,0x0,0x0
[18326.977023] itct: 
0x12fa0345,0x5000cca0803e9c1d,0x1000000001388,0x0,0x0,0x0,0x0,0x0,0x0,0x0
[18326.985496] hisi_sas_v3_hw 0000:74:02.0: slot complete: 
task(ffff8027820d0200) ignored

[18329.384695] hisi_sas_v3_hw 0000:74:02.0: internal task abort: timeout and 
not done.
[18329.392344] hisi_sas_v3_hw 0000:74:02.0: start dump all regs,reason:abort 
timeout!
[18329.399904] ***************DUMP IS DISABLED***************
[18329.405467] dump reg fail.
[18329.408162] hisi_sas_v3_hw 0000:74:02.0: I_T nexus reset: internal abort (-5)
[18329.936017] cq: iptt:649, task:ffff8027981f8500, cmp_st:3, 
err_rcrd_xfrd:1,rspns_xfrd:0,error_phase:6,devid:19,io_cfg_err_code:0,err_code:0,
 ft = 0x0, ata_st=0x0, tgt_io_st=0x0,disk_err=0x0
[18329.936154] cq: iptt:1091, task:ffff8026ff666d00, cmp_st:3, 
err_rcrd_xfrd:1,rspns_xfrd:0,error_phase:6,devid:49,io_cfg_err_code:0,err_code:0,
 ft = 0x0, ata_st=0x0, tgt_io_st=0x0,disk_err=0x0
[18329.936155] sb dw0:0x8001,dw1:0x0,dw2:0x0,dw3:0x0
[18329.936156] cmd table: 0x0,0x0,0x0,0x0,0x0
[18329.936158] itct: 
0x12fa0345,0x5000cca2552b2855,0x1000000001388,0x0,0x0,0x0,0x0,0x0,0x0,0x0
[18329.936301] hisi_sas_v3_hw 0000:74:02.0: slot complete: 
task(ffff8026ff666d00) ignored

[Test Case]
This was seen on a system with 100s of disks, something I don't have access to, 
so verification testing will be regression-only.

[Fix]
A fix queued in the scsi maintainer's tree adjusts some magic registers in the 
controller, and that somehow fixes the problem (I don't have programming docs 
for this controller, so I can only hand-wave here).

[Regression Risk]
The fix is localized to the hisi_sas_v3_hw driver, which is only used in Ubuntu 
for the D06 platform.

** Affects: linux (Ubuntu)
     Importance: Undecided
     Assignee: dann frazier (dannf)
         Status: In Progress

** Affects: linux (Ubuntu Bionic)
     Importance: Undecided
     Assignee: dann frazier (dannf)
         Status: In Progress

** Also affects: linux (Ubuntu Bionic)
   Importance: Undecided
       Status: New

** Summary changed:

- hisi_sas: 
+ hisi_sas_v3_hw: internal task abort: timeout and not done.

** Changed in: linux (Ubuntu)
       Status: New => In Progress

** Changed in: linux (Ubuntu Bionic)
       Status: New => In Progress

** Changed in: linux (Ubuntu)
     Assignee: (unassigned) => dann frazier (dannf)

** Changed in: linux (Ubuntu Bionic)
     Assignee: (unassigned) => dann frazier (dannf)

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1777736

Title:
  hisi_sas_v3_hw: internal task abort: timeout and not done.

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Bionic:
  In Progress

Bug description:
  [Impact]
  On deployments with lots of disks, timeouts can occur that escalate into 
nexus resets. This can cause disk devices to disappear from the system, 
possibly requiring a reboot to recover:

  [18324.951189] cq: iptt:892, task:ffff8026fbde5000, cmp_st:3, 
err_rcrd_xfrd:1,rspns_xfrd:0,error_phase:6,devid:16,io_cfg_err_code:0,err_code:0,
 ft = 0x0, ata_st=0x0, tgt_io_st=0x0,disk_err=0x0
  [18324.951190] sb dw0:0x8001,dw1:0x0,dw2:0x0,dw3:0x0
  [18324.951191] cmd table: 0x0,0x0,0x0,0x0,0x0
  [18324.951192] itct: 
0x12fa0345,0x5000cca25d31dac1,0x1000000001388,0x0,0x0,0x0,0x0,0x0,0x0,0x0
  [18324.951334] hisi_sas_v3_hw 0000:74:02.0: slot complete: 
task(ffff8026fbde5000) ignored

  [18325.039774] sb dw0:0x8001,dw1:0x0,dw2:0x0,dw3:0x0
  [18325.044467] cmd table: 0x0,0x0,0x0,0x0,0x0
  [18325.048553] itct: 
0x12fa0345,0x5000c50094c65c55,0x1000000001388,0x0,0x0,0x0,0x0,0x0,0x0,0x0
  [18325.057058] hisi_sas_v3_hw 0000:74:02.0: slot complete: 
task(ffff8027dc8e7500) ignored

  [18326.951312] cq: iptt:1705, task:ffff8027820d0200, cmp_st:3, 
err_rcrd_xfrd:1,rspns_xfrd:0,error_phase:6,devid:18,io_cfg_err_code:0,err_code:0,
 ft = 0x0, ata_st=0x0, tgt_io_st=0x0,disk_err=0x0
  [18326.968247] sb dw0:0x8001,dw1:0x0,dw2:0x0,dw3:0x0
  [18326.972938] cmd table: 0x0,0x0,0x0,0x0,0x0
  [18326.977023] itct: 
0x12fa0345,0x5000cca0803e9c1d,0x1000000001388,0x0,0x0,0x0,0x0,0x0,0x0,0x0
  [18326.985496] hisi_sas_v3_hw 0000:74:02.0: slot complete: 
task(ffff8027820d0200) ignored

  [18329.384695] hisi_sas_v3_hw 0000:74:02.0: internal task abort: timeout and 
not done.
  [18329.392344] hisi_sas_v3_hw 0000:74:02.0: start dump all regs,reason:abort 
timeout!
  [18329.399904] ***************DUMP IS DISABLED***************
  [18329.405467] dump reg fail.
  [18329.408162] hisi_sas_v3_hw 0000:74:02.0: I_T nexus reset: internal abort 
(-5)
  [18329.936017] cq: iptt:649, task:ffff8027981f8500, cmp_st:3, 
err_rcrd_xfrd:1,rspns_xfrd:0,error_phase:6,devid:19,io_cfg_err_code:0,err_code:0,
 ft = 0x0, ata_st=0x0, tgt_io_st=0x0,disk_err=0x0
  [18329.936154] cq: iptt:1091, task:ffff8026ff666d00, cmp_st:3, 
err_rcrd_xfrd:1,rspns_xfrd:0,error_phase:6,devid:49,io_cfg_err_code:0,err_code:0,
 ft = 0x0, ata_st=0x0, tgt_io_st=0x0,disk_err=0x0
  [18329.936155] sb dw0:0x8001,dw1:0x0,dw2:0x0,dw3:0x0
  [18329.936156] cmd table: 0x0,0x0,0x0,0x0,0x0
  [18329.936158] itct: 
0x12fa0345,0x5000cca2552b2855,0x1000000001388,0x0,0x0,0x0,0x0,0x0,0x0,0x0
  [18329.936301] hisi_sas_v3_hw 0000:74:02.0: slot complete: 
task(ffff8026ff666d00) ignored

  [Test Case]
  This was seen on a system with 100s of disks, something I don't have access 
to, so verification testing will be regression-only.

  [Fix]
  A fix queued in the scsi maintainer's tree adjusts some magic registers in 
the controller, and that somehow fixes the problem (I don't have programming 
docs for this controller, so I can only hand-wave here).

  [Regression Risk]
  The fix is localized to the hisi_sas_v3_hw driver, which is only used in 
Ubuntu for the D06 platform.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1777736/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to