Re: Multipath failover handling (Was: Re: 2.6.24-rc3-mm1)
James Bottomley wrote: However, there's still devloss_tmo to consider ... even in multipath, I don't think you want to signal path failure until devloss_tmo has fired otherwise you'll get too many transient up/down events which damage performance if the array has an expensive failover model. Yes. But currently we have a very high failover latency as we always have to wait for the requeued commands to time-out. Hence we're damaging performance on arrays with inexpensive failover. If it's a either/or choice between the two that's showing our current approach to multi-path is broken. The other problem is what to do with in-flight commands at the time the link went down. With your current patch, they're still stuck until they time out ... surely there needs to be some type of recovery mechanism for these? Well, the in-flight commands are owned by the HBA driver, which should have the proper code to terminate / return those commands with the appriopriate codes. They will then be rescheduled and will be caught like 'normal' IO requests. But my point is that if a driver goes blocked, those commands will be forced to wait the blocked timeout anyway, so your proposed patch does nothing to improve the case for dm anyway ... you only avoid commands stuck when a device goes blocked if by chance its request queue was empty. How about my patches to use new transport error values and make the iscsi and fc behave the same. The problem I think Hannes and I are both trying to solve is this: 1. We do not want to wait for dev_loss_tmo seconds for failover. 2. The FC drivers can hook into fast_io_fail_tmo related callouts and with that set that tmo to a very low value like a couple of seconds if they are using multipath, so failovers are fast. However, there is a bug with where when the fast_io_fail_tmo fires requests that made it to the driver get failed and returned to the multipath layer, but commands in the blocked request queue are stuck in there until dev_loss_tmo fires. With my patches here (need to be rediffed and for FC I need to handle JamesS's comments about not using a new field for the fast_fail_timeout state bit): http://marc.info/?l=linux-scsi=117399843216280=2 http://marc.info/?l=linux-scsi=117399544112073=2 http://marc.info/?l=linux-scsi=117399844316771=2 http://marc.info/?l=linux-scsi=117400203324693=2 http://marc.info/?l=linux-scsi=117400203324690=2 For FC we can use the fast_io_fail_tmo for fast failovers, and commands will not get stuck in a blocked queue for dev_loss_tmo seconds because when the fast_io_fail_tmo fires the target's queues are unblocked and fc_remote_port_chkready() ready kicks in (iSCSI does the same with the patches in the links). And with the patches if multipath-tools is sending its path testing IO it will get a DID_TRANSPORT_* error code that it can use to make a decent path failing decision with. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Multipath failover handling (Was: Re: 2.6.24-rc3-mm1)
On Mon, 2008-01-07 at 15:05 +0100, Hannes Reinecke wrote: > James Bottomley wrote: > > On Fri, 2007-12-14 at 10:00 +0100, Hannes Reinecke wrote: > >> James Bottomley wrote: > >>> On Mon, 2007-11-26 at 22:15 -0800, Andrew Morton wrote: > OK, thanks. I'll assume that James and Hannes have this in hand (or will > have, by mid-week) and I won't do anything here. > >>> Just to confirm what I think I'm going to be doing: rebasing the > >>> scsi-misc tree to remove this commit: > >>> > >>> commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 > >>> Author: Hannes Reinecke <[EMAIL PROTECTED]> > >>> Date: Tue Nov 6 09:23:40 2007 +0100 > >>> > >>> [SCSI] Do not requeue requests if REQ_FAILFAST is set > >>> > >>> And its allied fix ups: > >>> > >>> commit 983289045faa96fba8841d3c51b98bb8623d9504 > >>> Author: James Bottomley <[EMAIL PROTECTED]> > >>> Date: Sat Nov 24 19:47:25 2007 +0200 > >>> > >>> [SCSI] fix up REQ_FASTFAIL not to fail when state is QUIESCE > >>> > >>> commit 9dd15a13b332e9f5c8ee752b1ccd9b84cb5bdf17 > >>> Author: James Bottomley <[EMAIL PROTECTED]> > >>> Date: Sat Nov 24 19:55:53 2007 +0200 > >>> > >>> [SCSI] fix domain validation to work again > >>> > >>> James > >>> > >>> > >> Or just apply my latest patch (cf Undo __scsi_kill_request). > >> The main point is that we shouldn't retry requests > >> with FAILFAST set when the queue is blocked. AFAICS > >> only FC and iSCSI transports set the queue to blocked, > >> and use this to indicate a loss of connection. So any > >> retry with queue blocked is futile. > > > > I still don't think this is the right approach. > > > > For link up/down events, those are direct pathing events and should be > > signalled along a kernel notifier, not by mucking with the SCSI state > > machine. > Of course they will be signalled. And eventually we should patch up > mutltipath-tools to read the exising events from the uevent socket. > But even with that patch there is a quite largish window during > which IOs will be sent to the blocked device, and hence will be > stuck in the request queue until the timer expires. But the assumption your code makes is that if REQ_FAILFAST is set then it's a dm request ... and that's not true. The code in question negatively impacts other users of REQ_FAILFAST. For every user other than dm, the right thing to do is to wait out the block. > > However, there's still devloss_tmo to consider ... even in > > multipath, I don't think you want to signal path failure until > > devloss_tmo has fired otherwise you'll get too many transient up/down > > events which damage performance if the array has an expensive failover > > model. > > > Yes. But currently we have a very high failover latency as we always have > to wait for the requeued commands to time-out. > Hence we're damaging performance on arrays with inexpensive failover. If it's a either/or choice between the two that's showing our current approach to multi-path is broken. > > The other problem is what to do with in-flight commands at the time the > > link went down. With your current patch, they're still stuck until they > > time out ... surely there needs to be some type of recovery mechanism > > for these? > > > Well, the in-flight commands are owned by the HBA driver, which should > have the proper code to terminate / return those commands with the > appriopriate codes. They will then be rescheduled and will be caught > like 'normal' IO requests. But my point is that if a driver goes blocked, those commands will be forced to wait the blocked timeout anyway, so your proposed patch does nothing to improve the case for dm anyway ... you only avoid commands stuck when a device goes blocked if by chance its request queue was empty. James -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Multipath failover handling (Was: Re: 2.6.24-rc3-mm1)
James Bottomley wrote: > On Fri, 2007-12-14 at 10:00 +0100, Hannes Reinecke wrote: >> James Bottomley wrote: >>> On Mon, 2007-11-26 at 22:15 -0800, Andrew Morton wrote: OK, thanks. I'll assume that James and Hannes have this in hand (or will have, by mid-week) and I won't do anything here. >>> Just to confirm what I think I'm going to be doing: rebasing the >>> scsi-misc tree to remove this commit: >>> >>> commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 >>> Author: Hannes Reinecke <[EMAIL PROTECTED]> >>> Date: Tue Nov 6 09:23:40 2007 +0100 >>> >>> [SCSI] Do not requeue requests if REQ_FAILFAST is set >>> >>> And its allied fix ups: >>> >>> commit 983289045faa96fba8841d3c51b98bb8623d9504 >>> Author: James Bottomley <[EMAIL PROTECTED]> >>> Date: Sat Nov 24 19:47:25 2007 +0200 >>> >>> [SCSI] fix up REQ_FASTFAIL not to fail when state is QUIESCE >>> >>> commit 9dd15a13b332e9f5c8ee752b1ccd9b84cb5bdf17 >>> Author: James Bottomley <[EMAIL PROTECTED]> >>> Date: Sat Nov 24 19:55:53 2007 +0200 >>> >>> [SCSI] fix domain validation to work again >>> >>> James >>> >>> >> Or just apply my latest patch (cf Undo __scsi_kill_request). >> The main point is that we shouldn't retry requests >> with FAILFAST set when the queue is blocked. AFAICS >> only FC and iSCSI transports set the queue to blocked, >> and use this to indicate a loss of connection. So any >> retry with queue blocked is futile. > > I still don't think this is the right approach. > > For link up/down events, those are direct pathing events and should be > signalled along a kernel notifier, not by mucking with the SCSI state > machine. Of course they will be signalled. And eventually we should patch up mutltipath-tools to read the exising events from the uevent socket. But even with that patch there is a quite largish window during which IOs will be sent to the blocked device, and hence will be stuck in the request queue until the timer expires. > However, there's still devloss_tmo to consider ... even in > multipath, I don't think you want to signal path failure until > devloss_tmo has fired otherwise you'll get too many transient up/down > events which damage performance if the array has an expensive failover > model. > Yes. But currently we have a very high failover latency as we always have to wait for the requeued commands to time-out. Hence we're damaging performance on arrays with inexpensive failover. > The other problem is what to do with in-flight commands at the time the > link went down. With your current patch, they're still stuck until they > time out ... surely there needs to be some type of recovery mechanism > for these? > Well, the in-flight commands are owned by the HBA driver, which should have the proper code to terminate / return those commands with the appriopriate codes. They will then be rescheduled and will be caught like 'normal' IO requests. Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage [EMAIL PROTECTED] +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Markus Rex, HRB 16746 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Multipath failover handling (Was: Re: 2.6.24-rc3-mm1)
James Bottomley wrote: On Fri, 2007-12-14 at 10:00 +0100, Hannes Reinecke wrote: James Bottomley wrote: On Mon, 2007-11-26 at 22:15 -0800, Andrew Morton wrote: OK, thanks. I'll assume that James and Hannes have this in hand (or will have, by mid-week) and I won't do anything here. Just to confirm what I think I'm going to be doing: rebasing the scsi-misc tree to remove this commit: commit 8655a546c83fc43f0a73416bbd126d02de7ad6c0 Author: Hannes Reinecke [EMAIL PROTECTED] Date: Tue Nov 6 09:23:40 2007 +0100 [SCSI] Do not requeue requests if REQ_FAILFAST is set And its allied fix ups: commit 983289045faa96fba8841d3c51b98bb8623d9504 Author: James Bottomley [EMAIL PROTECTED] Date: Sat Nov 24 19:47:25 2007 +0200 [SCSI] fix up REQ_FASTFAIL not to fail when state is QUIESCE commit 9dd15a13b332e9f5c8ee752b1ccd9b84cb5bdf17 Author: James Bottomley [EMAIL PROTECTED] Date: Sat Nov 24 19:55:53 2007 +0200 [SCSI] fix domain validation to work again James Or just apply my latest patch (cf Undo __scsi_kill_request). The main point is that we shouldn't retry requests with FAILFAST set when the queue is blocked. AFAICS only FC and iSCSI transports set the queue to blocked, and use this to indicate a loss of connection. So any retry with queue blocked is futile. I still don't think this is the right approach. For link up/down events, those are direct pathing events and should be signalled along a kernel notifier, not by mucking with the SCSI state machine. Of course they will be signalled. And eventually we should patch up mutltipath-tools to read the exising events from the uevent socket. But even with that patch there is a quite largish window during which IOs will be sent to the blocked device, and hence will be stuck in the request queue until the timer expires. However, there's still devloss_tmo to consider ... even in multipath, I don't think you want to signal path failure until devloss_tmo has fired otherwise you'll get too many transient up/down events which damage performance if the array has an expensive failover model. Yes. But currently we have a very high failover latency as we always have to wait for the requeued commands to time-out. Hence we're damaging performance on arrays with inexpensive failover. The other problem is what to do with in-flight commands at the time the link went down. With your current patch, they're still stuck until they time out ... surely there needs to be some type of recovery mechanism for these? Well, the in-flight commands are owned by the HBA driver, which should have the proper code to terminate / return those commands with the appriopriate codes. They will then be rescheduled and will be caught like 'normal' IO requests. Cheers, Hannes -- Dr. Hannes Reinecke zSeries Storage [EMAIL PROTECTED] +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Markus Rex, HRB 16746 (AG Nürnberg) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Multipath failover handling (Was: Re: 2.6.24-rc3-mm1)
James Bottomley wrote: However, there's still devloss_tmo to consider ... even in multipath, I don't think you want to signal path failure until devloss_tmo has fired otherwise you'll get too many transient up/down events which damage performance if the array has an expensive failover model. Yes. But currently we have a very high failover latency as we always have to wait for the requeued commands to time-out. Hence we're damaging performance on arrays with inexpensive failover. If it's a either/or choice between the two that's showing our current approach to multi-path is broken. The other problem is what to do with in-flight commands at the time the link went down. With your current patch, they're still stuck until they time out ... surely there needs to be some type of recovery mechanism for these? Well, the in-flight commands are owned by the HBA driver, which should have the proper code to terminate / return those commands with the appriopriate codes. They will then be rescheduled and will be caught like 'normal' IO requests. But my point is that if a driver goes blocked, those commands will be forced to wait the blocked timeout anyway, so your proposed patch does nothing to improve the case for dm anyway ... you only avoid commands stuck when a device goes blocked if by chance its request queue was empty. How about my patches to use new transport error values and make the iscsi and fc behave the same. The problem I think Hannes and I are both trying to solve is this: 1. We do not want to wait for dev_loss_tmo seconds for failover. 2. The FC drivers can hook into fast_io_fail_tmo related callouts and with that set that tmo to a very low value like a couple of seconds if they are using multipath, so failovers are fast. However, there is a bug with where when the fast_io_fail_tmo fires requests that made it to the driver get failed and returned to the multipath layer, but commands in the blocked request queue are stuck in there until dev_loss_tmo fires. With my patches here (need to be rediffed and for FC I need to handle JamesS's comments about not using a new field for the fast_fail_timeout state bit): http://marc.info/?l=linux-scsim=117399843216280w=2 http://marc.info/?l=linux-scsim=117399544112073w=2 http://marc.info/?l=linux-scsim=117399844316771w=2 http://marc.info/?l=linux-scsim=117400203324693w=2 http://marc.info/?l=linux-scsim=117400203324690w=2 For FC we can use the fast_io_fail_tmo for fast failovers, and commands will not get stuck in a blocked queue for dev_loss_tmo seconds because when the fast_io_fail_tmo fires the target's queues are unblocked and fc_remote_port_chkready() ready kicks in (iSCSI does the same with the patches in the links). And with the patches if multipath-tools is sending its path testing IO it will get a DID_TRANSPORT_* error code that it can use to make a decent path failing decision with. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/