2017-03-07 1:54 GMT+01:00 V <viswesh.vi...@gmail.com>:
> Hi,
>
> I am reaching out regarding a SCSI error handler issue, which we see
> in our lab systems.
>
> Environment: Ubuntu 4.4.0-31-generic
>
> Issue: Write IOs are getting stuck in our system randomly. All drives
> seen with the issue are all SATA drives.
> Root cause: SCSI error handler is not woken up (or) it is up, but not
> able to flush commands and hence the IOs get stuck. (as the requests
> are not flushed out and host is not restarted)
>
> We have 6 instances seen till now.
>
> This is what we see.
> 1) As part of testing our drives, we start lot of read/writes, along
> with some user space utilities running to check the drive health.
> 1) In our recent testing, we see lot of commands going through the
> "abort scheduled" path. And we see that in between, for one of the
> commands, the error handler is not woken up, and majority of
> processes, which were writing, gets stalled.
> 2) We are still trying to figure out what is causing this much number
> of abort? Is it usual? Are there some other debugs, which I could
> enable to get more information? We know these are user space commands
> which are being aborted, but not sure which exact command it is for
> now.
> 3) I see the "abort scheduled" messages in almost all drives in all
> systems, hence i dont believe it is a drive issue.
> 4) I checked the host_failed and host_busy variables, both are set to
> 1 in system 1. 2nd one is still alive, havent taken a crash dump yet.
> 5) All drives seen with the issue are SATA drives.
> 6) I have attached udevadm info of a drive which failed in system 2.
>
> Thanks in advance,
> Viswesh
>
>
> Logs
> -------
> System 1
>
> [371546.605736] sd 9:0:0:0: scsi_block_when_processing_errors: rtn: 1
> [371546.606727] sd 9:0:0:0: scsi_block_when_processing_errors: rtn: 1
> [371546.607721] sd 9:0:0:0: scsi_block_when_processing_errors: rtn: 1
> [371546.618113] sd 9:0:0:0: [sg19] tag#20 abort scheduled
> [371546.624039] sd 9:0:0:0: [sg19] tag#20 aborting command
> [371546.624045] sd 9:0:0:0: [sg19] tag#20 cmd abort failed
> [371546.624051] scsi host9: Waking error handler thread
> [371546.624060] scsi host9: scsi_eh_9: waking up 0/1/1
> [371546.624081] sd 9:0:0:0: [sg19] tag#20 scsi_eh_9: flush finish cmd
> [371546.624095] scsi host9: waking up host to restart
> [371546.624098] scsi host9: scsi_eh_9: sleeping
>
>
> [371546.635314] sd 8:0:0:0: [sg17] tag#13 abort scheduled
> [371546.640031] sd 8:0:0:0: [sg17] tag#13 aborting command
> [371546.640037] sd 8:0:0:0: [sg17] tag#13 cmd abort failed
> [371546.640043] scsi host8: Waking error handler thread
> [371546.640078] scsi host8: scsi_eh_8: waking up 0/1/1
> [371546.640098] sd 8:0:0:0: [sg17] tag#13 scsi_eh_8: flush finish cmd
> [371546.640113] scsi host8: waking up host to restart
> [371546.640117] scsi host8: scsi_eh_8: sleeping
>
> [371546.657269] sd 6:0:0:0: [sg12] tag#1 abort scheduled
> [371546.664034] sd 6:0:0:0: [sg12] tag#1 aborting command
> [371546.664040] sd 6:0:0:0: [sg12] tag#1 cmd abort failed
>
> Here we can see that, error handler is not woken up.  And the entire
> IO subsystem goes for a toss.
>
> [371777.594510] INFO: task md2_raid1:508 blocked for more than 120 seconds.
> [371777.594571]      Not tainted 4.4.0-31-generic #50~14.04.1-Ubuntu
> [371777.594613] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [371777.594665] md2_raid1      D ffff883fed2afc78    0  508      2 0x00000000
> [371777.594673]  ffff883fed2afc78 ffff881fef0e8000 ffff883fed21c4c0
> ffff883fed2b0000
> [371777.594678]  ffff883ff236e298 ffff883ff236e000 ffff883ff236e018
> ffff883ff236e018
>
> By enabling more scsi logs(in a different system), including LL, we
> get some more info ( I have enabled the full scsi levels for all
> categories and it is still running)
>
>
> System 2
>
> [405197.171574] sd 5:0:0:0: [sg3] sg_write: count=3D88
> [405197.171577] sd 5:0:0:0: scsi_block_when_processing_errors: rtn: 1
> [405197.171581] sd 5:0:0:0: [sg3] sg_common_write:  scsi
> opcode=3D0x85, cmd_size=3D16
> [405197.171583] sd 5:0:0:0: [sg3] sg_start_req: dxfer_len=3D512
> [405197.171605] sd 5:0:0:0: [sg3] sg_link_reserve: size=3D512
> [405197.171623] sd 5:0:0:0: [sg3] sg_poll: res=3D0x104
> [405197.172648] sd 5:0:0:0: [sg3] sg_cmd_done: pack_id=3D0, res=3D0x0
> [405197.172701] sd 5:0:0:0: [sg3] sg_poll: res=3D0x145
> [405197.172710] sd 5:0:0:0: [sg3] sg_read: count=3D88
> [405197.172716] sd 5:0:0:0: [sg3] sg_finish_rem_req: res_used=3D1
> [405197.172721] sd 5:0:0:0: [sg3] sg_unlink_reserve: req->k_use_sg=3D1
> [405197.172756] sd 5:0:0:0: [sg3] sg_write: count=3D88
> [405197.172760] sd 5:0:0:0: scsi_block_when_processing_errors: rtn: 1
> [405197.172764] sd 5:0:0:0: [sg3] sg_common_write:  scsi
> opcode=3D0x85, cmd_size=3D16
> [405197.172767] sd 5:0:0:0: [sg3] sg_start_req: dxfer_len=3D512
> [405197.172774] sd 5:0:0:0: [sg3] sg_link_reserve: size=3D512
> [405197.172793] sd 5:0:0:0: [sg3] sg_poll: res=3D0x104
> [405197.173806] sd 5:0:0:0: [sg3] sg_cmd_done: pack_id=3D0, res=3D0x0
> [405197.173829] sd 5:0:0:0: [sg3] sg_poll: res=3D0x145
> [405197.173836] sd 5:0:0:0: [sg3] sg_read: count=3D88
> [405197.173839] sd 5:0:0:0: [sg3] sg_finish_rem_req: res_used=3D1
> [405197.173843] sd 5:0:0:0: [sg3] sg_unlink_reserve: req->k_use_sg=3D1
> [405197.173893] sd 5:0:0:0: [sg3] sg_write: count=3D88
> [405197.173896] sd 5:0:0:0: scsi_block_when_processing_errors: rtn: 1
> [405197.173900] sd 5:0:0:0: [sg3] sg_common_write:  scsi
> opcode=3D0x85, cmd_size=3D16
> [405197.173903] sd 5:0:0:0: [sg3] sg_start_req: dxfer_len=3D0
> [405197.173921] sd 5:0:0:0: [sg3] sg_poll: res=3D0x104
>
> [405197.174082] sd 5:0:0:0: [sg3] tag#5 abort scheduled
> [405197.179372] sd 5:0:0:0: [sg3] tag#5 aborting command
> [405197.179378] sd 5:0:0:0: [sg3] tag#5 cmd abort failed
>
> [405257.225584] sd 5:0:0:0: [sg3] sg_poll: res=3D0x104
> [405257.225599] sd 5:0:0:0: [sg3] sg_release
> [405257.225617] sd 5:0:0:0: [sg3] sg_open: flags=3D0x8002
>
> Error handler thread is not waken up here (or) i dont see flush
> command trace (even if it is already woken up)
>
> [405349.082863] INFO: task md2_raid1:484 blocked for more than 120 seconds
> [405349.082922]      Not tainted 4.4.0-31-generic #50~14.04.1-Ubuntu
> [405349.082964] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [405349.083016] md2_raid1      D ffff881feca2bc78    0  484      2 0x00000000

Can you reproduce without md raid1? I suppose it shouldn't related.

You mentioned it's only with SATA, are they all the same HDDs or from
different vendors?

Regards,
Jack

Reply via email to