mvsas sata drives with high ioerr_cnt and long stalls

Larkin Lowrey Wed, 15 Nov 2017 15:55:46 -0800

Hello, I'm looking for some help to diagnose write stalls for satadrives connected to Highpoint controllers running the mvsas driver.

I'm seeing writes stall for several seconds at a time for md raid arraysusing Highpoint 2740 and 2720sgl controllers running in JBOD mode (mvsasdriver). I'm seeing the same behavior on two servers that use the samecontrollers but are otherwise configured differently.

Reads are just fine, the only time I encounter problems is when writing.When writing large amounts of data, all of the drives in the array willgo idle (according to iostat) with all zeros in iostat except foravgrq-sz (numbers < 10) and %util (always 100%). After a few seconds,iostat will show resumed activity with the first report showing awaittimes for all drives roughly equal to the length of he stall. Thesestalls will occur periodically during the write.


Here's iostat during a stall:

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sdf               0.00     0.00    0.00    0.00     0.00     0.00     0.00     
2.00    0.00    0.00    0.00   0.00 100.10
sdg               0.00     0.00    0.00    0.00     0.00     0.00     0.00     
2.00    0.00    0.00    0.00   0.00 100.00
sdj               0.00     0.00    0.00    0.00     0.00     0.00     0.00     
9.00    0.00    0.00    0.00   0.00 100.00
sdl               0.00     0.00    0.00    0.00     0.00     0.00     0.00     
1.00    0.00    0.00    0.00   0.00 100.00
sdm               0.00     0.00    0.00    0.00     0.00     0.00     0.00     
2.00    0.00    0.00    0.00   0.00 100.00
sdn               0.00     0.00    0.00    0.00     0.00     0.00     0.00     
2.00    0.00    0.00    0.00   0.00 100.10
sdo               0.00     0.00    0.00    0.00     0.00     0.00     0.00     
3.00    0.00    0.00    0.00   0.00 100.10
sdq               0.00     0.00    0.00    0.00     0.00     0.00     0.00     
1.00    0.00    0.00    0.00   0.00 100.10
sdp               0.00     0.00    0.00    0.00     0.00     0.00     0.00     
4.00    0.00    0.00    0.00   0.00 100.00

Then, after a few seconds, the first iostat report after the stall withnon-zero activity:


sdf               0.00     0.00    0.00    3.00     0.00     0.01     5.67     
1.03 3361.67    0.00 3361.67 175.67  52.70
sdg               0.00     0.00    0.00    3.00     0.00     0.01     5.67     
1.01 3357.00    0.00 3357.00 171.33  51.40
sdj               0.00     0.00    1.00    9.00     0.00     0.22    46.50     
4.61 3839.90    4.00 4266.11  55.30  55.30
sdl               0.00     0.00    0.00    2.00     0.00     0.01    12.50     
0.54 2534.00    0.00 2534.00 269.50  53.90
sdm               0.00     0.00    0.00    4.00     0.00     0.01     4.25     
1.01 2516.00    0.00 2516.00 126.75  50.70
sdn               0.00     0.00    0.00    3.00     0.00     0.12    83.00     
1.02 1685.33    0.00 1685.33 175.33  52.60
sdo               0.00     0.00    0.00    4.00     0.00     0.02    10.25     
1.60 3798.00    0.00 3798.00 149.00  59.60
sdq               0.00     0.00    0.00    2.00     0.00     0.06    64.50     
0.53 1558.50    0.00 1558.50 266.50  53.30
sdp               0.00     0.00    0.00    5.00     0.00     0.02     9.80     
2.11 4046.00    0.00 4046.00 120.80  60.40

There are no error messages printed to the console (serial console islogged, plus checked dmesg and /var/log/messages).


The drives in the array show high  ioerr_cnt values (0x330 aka 816d below).

[8:0:4:0]    disk    ATA      ST8000DM002-1YW1 DN02  /dev/sdj
  device_blocked=0
  iocounterbits=32
  iodone_cnt=0x147fa
  ioerr_cnt=0x330
  iorequest_cnt=0x1487a
  queue_depth=31
  queue_type=simple
  scsi_level=6
  state=running
  timeout=30
  type=0

SMART does not show any errors (no pending or relocated sectors, no UDMACRC errors, etc).

My guess is the write transfer to the drive is failing and is retried bythe driver. Since the start and end of the stall is the same for all ofthe drives in the array I find it hard to believe the issue is relatedto individual drives or cabling.


Ideas?

Are there any other sources of diagnostic data that would help to debugthis?


Kernel machine A: 4.13.5-100.fc25.x86_64
Kernel machine B:  4.13.10-200.fc26.x86_64

--Larkin

mvsas sata drives with high ioerr_cnt and long stalls

Reply via email to