On Fri, 2014-06-13 at 14:01 +0200, Hannes Reinecke wrote: > Hi all, > > I've received reports claiming they're seeing a double command completion > occasionally when a command timeout happens. > > Delving into it I found this this indeed might happening; reason being > that the LLDD will only be informed about a timed-out command by > calling scsi_try_to_abort_command(). Anytime before that the LLDD > is free to assume the command is valid and might call scsi_done() on it. > Which then will lead to interesting issues in the error handler.
Actually, I'm afraid, this reasoning isn't correct. Completions of timed out commands are mediated by the block layer using the REQ_ATOM_COMPLETE flag. The reason it's an atomic flag is whoever sets it first owns the completion. Before a timeout fires, the timeout and completion actually race. If completion occurs first, the timeout may still fire, but it will get harmlessly ignored if the REQ_ATOM_COMPLETE flag is set (it's mediated by blk_mark_rq_complete()). Conversely, after the timeout has fired, the flag is set and any incoming completion gets ignored (code in blk_complete_request()). The atomicity of the flag should guarantee we never see double completions. However, in between the timeout firing and us doing something with the command in the error handler, we have to force the LLD to give it up. This requires that we take actions to ensure that we've really killed the command within the LLD before we start doing things with the command in the error handler. The way we do this is either successful abort, which ensures the LLD won't complete the command or successful reset which should kill all commands for the LUN/Target/Device etc. If you're seeing double completions it's either because we have a bug in SCSI and are doing something with the command before we know block has relinquished it. That's actually why this bug was so serious: commit d555a2abf3481f81303d835046a5ec2c4fb3ca8e Author: James Bottomley <jbottom...@parallels.com> Date: Fri Mar 28 10:50:17 2014 -0700 [SCSI] Fix spurious request sense in error handling We'd wrongly call request sense on a timed out command and that could cause double completions. Assuming SCSI is correct, we can still get double completions if drivers don't actually kill the queued command on abort or reset ... there was a nasty bug like this within hpsa for a while. James -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html