Re: [ofa-general][PATCH 3/4] SRP fail-over faster
David Dillow, on 10/24/2009 02:08 AM wrote: On Fri, 2009-10-23 at 12:50 -0400, Vu Pham wrote: David Dillow wrote: I don't know much about multipath in ALUA mode. How would multipath driver (in ALUA mode) to switch path? (ie. basing on what criteria?) ALUA sets the priority of each path, and generally multipath is set to round-robin among all paths of the same priority. So, paths going to the primary controller of a LUN get the best priority and are used preferentially over the backup paths. Once no more paths in a priority group are active, the round-robin selector will fall back to the next highest priority path group. The multipath core will immediately fail a path if the lower layers propagate an error up to it, such that an I/O request completes in error. If it has failed the path, it will start sending requests down alternate paths without waiting for the queue to drain on the first one. ALUA is not like RDAC -- in ALUA, all paths are valid to use, but some paths will give better performance. You do not necessarily need to give the array a command to move the LUN to another controller, so there's no reason to wait for a queue to drain. At least that is the way I understand things, having picked my way through the block layer, multipath core, and device handlers. Unfortunately, it isn't always so simple. Sometimes there are dependencies between order of requests, so you have to drain the failed path before sending retries to the alternate path to make sure that the retries will not interfere with not yet aborted requests from the failed paths. Just to make things more precise. Vlad -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general][PATCH 3/4] SRP fail-over faster
Roland Dreier wrote: > + if (timer_pending(&target->qp_err_timer)) > + del_timer_sync(&target->qp_err_timer); > + > spin_unlock_irq(target->scsi_host->host_lock); As was pointed out, I don't think you can do del_timer_sync while holding the lock, since the timer function takes the same lock. I don't think that it takes the same lock (ie. host_lock).; however, the problem is kernel configured CONFIG_LOCKDEP, del_timer_sync() call local_irq_save/restore() to acquire/release timer->lockdep_map But I don't know that just switching to del_timer without the sync works here ... without the sync then the timeout function could still run any time after the del_timer, even after everything gets freed. I will move del_timer_sync() of the host_lock. It's safe because we only setup the same timer when the target state is ALIVE, however, target state is already REMOVED when del_timer_sync() is called BTW the test of timer_pending isn't needed here... del_timer does the test internally anyway. I do agree it would be very good to improve the SRP error handling. I have some concerns about the overall design here -- it seems that if we handle connection failure and allow a new connection to proceed while cleaning up asynchronously, then this opens the door to a lot of complexity, and I don't see that complexity handled in this patchset. For example, the new connection could fail too before the old one is done cleaning up, etc, etc and we end up with an arbitrarily large queue of things waiting to clean up. Or maybe it really it is simpler than that. It's is complicated; however, it is not as complicated as you said. There are two flows/places where we clean-up connection and create new one (ie. call srp_reconnect_target) in async event timer and scsi error handling srp_reset_host(); however, in srp_reset_host() the patch 3/4 check for true condition of timer_pending or qp_in_error or target state wrapped in host_lock then it won't call srp_reconnect_target(). It left one flow/place to clean up / create new connection. We destroy cm connection/cq/qp and create new connection/cq/qp in the same flow/place; we also already clean up all scsi/srp resources associated with old connection before create new one; therefore, the only resources pended for clean up is cm/connection resources, qp/cq low-level resources. And I believe the lower IB stack does that well. I think the best way to move this forward would be to post another cleaned up version of your patch set. Please try to reorganize things so each patch is reasonably self contained. Of course your patchset is taking multiple steps to improve things. But as much as possible, please try to avoid combining two things into a single patch, and conversely also try to avoid putting things into a patch that don't make sense without a later patch. Avoiding policy in the kernel as much as possible in terms of hard-coded timeouts etc is a good goal too. Also it would help to give each patch a separate descriptive subject, and put as much detail in the changelogs as you can. Thanks, Roland I'll follow the instructions above, rework the patch set and resubmit thanks, -vu -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general][PATCH 3/4] SRP fail-over faster
> +if (timer_pending(&target->qp_err_timer)) > +del_timer_sync(&target->qp_err_timer); > + > spin_unlock_irq(target->scsi_host->host_lock); As was pointed out, I don't think you can do del_timer_sync while holding the lock, since the timer function takes the same lock. But I don't know that just switching to del_timer without the sync works here ... without the sync then the timeout function could still run any time after the del_timer, even after everything gets freed. BTW the test of timer_pending isn't needed here... del_timer does the test internally anyway. I do agree it would be very good to improve the SRP error handling. I have some concerns about the overall design here -- it seems that if we handle connection failure and allow a new connection to proceed while cleaning up asynchronously, then this opens the door to a lot of complexity, and I don't see that complexity handled in this patchset. For example, the new connection could fail too before the old one is done cleaning up, etc, etc and we end up with an arbitrarily large queue of things waiting to clean up. Or maybe it really it is simpler than that. I think the best way to move this forward would be to post another cleaned up version of your patch set. Please try to reorganize things so each patch is reasonably self contained. Of course your patchset is taking multiple steps to improve things. But as much as possible, please try to avoid combining two things into a single patch, and conversely also try to avoid putting things into a patch that don't make sense without a later patch. Avoiding policy in the kernel as much as possible in terms of hard-coded timeouts etc is a good goal too. Also it would help to give each patch a separate descriptive subject, and put as much detail in the changelogs as you can. Thanks, Roland -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general][PATCH 3/4] SRP fail-over faster
On Sat, 2009-10-24 at 03:35 -0400, Vu Pham wrote: > It's a big improvement from 3-5 minutes cutting down to 1s and now you > talk about device_loss_timeout=0. I'll look at the trade-off to have > it; however, to receive and process the async event (port error) > already cost you a fair amount of cycles. I agree that it is a great improvement over just sending packets blindly to the link, and waiting for SCSI to time them out -- I've been using the variant of the patches from OFED -- but it is harder to change things once they are in the mainstream kernel, so I'd like to see it done better. And hey, maybe I'm just overly touchy about this. These should be rare events, and there's nothing we can do about the commands sent prior to being told about the link error. It's just that I don't want my file system to stall the petaflop simulation platforms if I can avoid it -- and there's no reason to send any command down the wire once we've been told there is no link or the target is not there. Maybe we don't need to destroy the link immediately, but we need to let the SCSI mid-layer know that things are failing. -- Dave Dillow National Center for Computational Science Oak Ridge National Laboratory (865) 241-6602 office -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general][PATCH 3/4] SRP fail-over faster
David Dillow wrote: On Fri, 2009-10-23 at 12:50 -0400, Vu Pham wrote: David Dillow wrote: I don't know much about multipath in ALUA mode. How would multipath driver (in ALUA mode) to switch path? (ie. basing on what criteria?) ALUA sets the priority of each path, and generally multipath is set to round-robin among all paths of the same priority. So, paths going to the primary controller of a LUN get the best priority and are used preferentially over the backup paths. Once no more paths in a priority group are active, the round-robin selector will fall back to the next highest priority path group. The multipath core will immediately fail a path if the lower layers propagate an error up to it, such that an I/O request completes in error. If it has failed the path, it will start sending requests down alternate paths without waiting for the queue to drain on the first one. Thanks for explaining. Without these patches, it will take ~3-5 minutes before srp driver propagate errors up so that dm-multipath can switch path. You need these patches - test them and you'll see. ALUA is not like RDAC -- in ALUA, all paths are valid to use, but some paths will give better performance. You do not necessarily need to give the array a command to move the LUN to another controller, so there's no reason to wait for a queue to drain. At least that is the way I understand things, having picked my way through the block layer, multipath core, and device handlers. Can you switch path manually in user mode (while there are commands stucked in current active path)? I've not tried, but give the above I don't see why not. Without this patch, all outstanding I/Os have to go thru error recovery before being returned with error code so that dm-multipath fail-over. I think we're talking about two separate things here -- I agree that the idea of failing IO early when we've lost our local connection, or know the target is not in the fabric, is a good one. I want a fast failure so that I can immediately start using my alternate paths. I'll have to deal with the timeouts on the requests in flight at some point, but they don't hold back independent requests. We talk about the same thing here. Like I said above, these patches are needed so that errors can be propagated up faster. Without them, you have to wait 3-5 minutes. The difference of opinion seems to be in how long to wait after being notified of a connection loss -- or the target leaving the fabric -- before we start kicking back errors at the SRP command dispatch handler. I agree that it makes sense to wait a moment before forcing an RDAC path change, as they seem to be slow. But I also want it to get out of the way for my case, when I don't incur much of a penalty to immediately light up my backup path. Both RDAC & ALUA need errors propagated up sooner. With the introducing of device_loss_timeout, srp satisfy both RDAC and ALUA modes. You can set device_loss_tmo=1 and RDAC can set to 60s or so. If you want to failing requests right away, you can just set device_loss_timeout=1, others don't want dm-multipath to switch path right away. That's a whole idea of these patches that I submitted The thing is, I don't want to wait even 1 second to use my backup path, and I don't want all of those requests going into a black hole for that time, forcing me to wait for the SCSI timeout on requests that could have been immediately processed. On our system, 1 second is up to 1500 MB of data transferred over this one connection, and waiting around twiddling our thumbs for a single second can potentially cost 1.3 thousand trillion operations. It's a big improvement from 3-5 minutes cutting down to 1s and now you talk about device_loss_timeout=0 I'll look at the trade-off to have it; however, to receive and process the async event (port error) already cost you a fair amount of cycles. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general][PATCH 3/4] SRP fail-over faster
On Fri, 2009-10-23 at 12:50 -0400, Vu Pham wrote: > David Dillow wrote: > I don't know much about multipath in ALUA mode. > How would multipath driver (in ALUA mode) to switch path? (ie. basing on > what criteria?) ALUA sets the priority of each path, and generally multipath is set to round-robin among all paths of the same priority. So, paths going to the primary controller of a LUN get the best priority and are used preferentially over the backup paths. Once no more paths in a priority group are active, the round-robin selector will fall back to the next highest priority path group. The multipath core will immediately fail a path if the lower layers propagate an error up to it, such that an I/O request completes in error. If it has failed the path, it will start sending requests down alternate paths without waiting for the queue to drain on the first one. ALUA is not like RDAC -- in ALUA, all paths are valid to use, but some paths will give better performance. You do not necessarily need to give the array a command to move the LUN to another controller, so there's no reason to wait for a queue to drain. At least that is the way I understand things, having picked my way through the block layer, multipath core, and device handlers. > Can you switch path manually in user mode (while there are commands > stucked in current active path)? I've not tried, but give the above I don't see why not. > Without this patch, all outstanding I/Os have to go thru error recovery > before being returned with error code so that dm-multipath fail-over. I think we're talking about two separate things here -- I agree that the idea of failing IO early when we've lost our local connection, or know the target is not in the fabric, is a good one. I want a fast failure so that I can immediately start using my alternate paths. I'll have to deal with the timeouts on the requests in flight at some point, but they don't hold back independent requests. The difference of opinion seems to be in how long to wait after being notified of a connection loss -- or the target leaving the fabric -- before we start kicking back errors at the SRP command dispatch handler. I agree that it makes sense to wait a moment before forcing an RDAC path change, as they seem to be slow. But I also want it to get out of the way for my case, when I don't incur much of a penalty to immediately light up my backup path. > If you want to failing requests right away, you can just set > device_loss_timeout=1, others don't want dm-multipath to switch path > right away. That's a whole idea of these patches that I submitted The thing is, I don't want to wait even 1 second to use my backup path, and I don't want all of those requests going into a black hole for that time, forcing me to wait for the SCSI timeout on requests that could have been immediately processed. On our system, 1 second is up to 1500 MB of data transferred over this one connection, and waiting around twiddling our thumbs for a single second can potentially cost 1.3 thousand trillion operations. -- Dave Dillow National Center for Computational Science Oak Ridge National Laboratory (865) 241-6602 office -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general][PATCH 3/4] SRP fail-over faster
Bart Van Assche wrote: On Fri, Oct 23, 2009 at 1:13 AM, Vu Pham wrote: [ ... ] Here is the updated patch which implement the device_loss_timeout for each target instead of module parameter. It also reflects changes from previous feedbacks. Please review Introducing device_loss_timeout per target granularity. Creating a timer to clean up connection after device_loss_timeout expired. During device_loss_timeout, the QP is in error state, srp will return DID_RESET for outstanding I/Os and return FAILED for abort_cmd, reset_lun, and return SUCCESS (without retrying reconnect) on reset_host Signed-off-by: Vu Pham --- drivers/infiniband/ulp/srp/ib_srp.c | 94 ++- drivers/infiniband/ulp/srp/ib_srp.h |3 + 2 files changed, 95 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index e44939a..12404d5 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -433,6 +433,10 @@ static void srp_remove_work(struct work_struct *work) return; } target->state = SRP_TARGET_REMOVED; + + if (timer_pending(&target->qp_err_timer)) + del_timer_sync(&target->qp_err_timer); + spin_unlock_irq(target->scsi_host->host_lock); spin_lock(&target->srp_host->target_lock); [ ... ] Calling del_timer_sync() while holding a spinlock can cause locking inversion. Has this code been tested on a kernel with LOCKDEP enabled ? Bart. I have not tested with LOCKDEP enabled., probably del_timer() would be sufficient inside target_lock -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general][PATCH 3/4] SRP fail-over faster
David Dillow wrote: On Thu, 2009-10-22 at 20:24 -0400, Vu Pham wrote: David Dillow wrote: On Thu, 2009-10-22 at 20:04 -0400, Vu Pham wrote: Yes and you can not disable intirely. I'm still looking at benefits/advantages to disable it entirely To me, the advantage is I have a perfectly viable backup path to the storage, and can immediately start issuing commands to it rather than waiting for any timeout. On my systems, 1 second can be up to 1500 MB transferred and a _huge_ number of compute cycles. And I expect those numbers to grow. You can still do so with these patches applied by using the right device name (ie. /dev/sdXXX) Not in a multipath situation configured for failover. I have to use the multipath device, which will then use the appropriate path as prioritized by ALUA. I don't know much about multipath in ALUA mode. How would multipath driver (in ALUA mode) to switch path? (ie. basing on what criteria?) Can you switch path manually in user mode (while there are commands stucked in current active path)? Without this patch, all outstanding I/Os have to go thru error recovery before being returned with error code so that dm-multipath fail-over. I use the user supplied setting for local async event on port error where link is broken from host to switch Perhaps that part should be in the patch that adds that support, then? That's patch #4 Sure, and perhaps the part that massages the timeout should be in the patch that introduces it and actually uses it, no? I will look at it and rework the patch. This makes a certain amount of sense; I was confused by the two unrelated changes in this patch. I'm still not all that happy about a hard-coded 5 seconds, especially with no explanation about the magic number. As I said above, it's not magic at all, it just that certain unknown seconds already passed by, therefore, just pick X seconds to sleep on. Sorry, this is a common idiom here -- a bare number in source code, with no explanation as to where it came from or why it was picked, is often called a "magic number." I'm saying you should comment on it, either in the commit message or in a comment in the code. Or better yet, give it a #define and a comment above that definition that says why you picked it. In other words, don't make someone who comes along after us have to search for this mail thread to figure out that the 5 second sleep was pulled out of thin air. Understood. To really sleep user supplied number of seconds, we need to register trap to SM and receiving trap for a node leaving the fabric. It requires a lot of changes in srp_daemon (registering to trap, passing event down to srp driver) and srp driver (handling this event) Well, if this were done, then you wouldn't need to sleep at all would you? Just wait for the trap telling you the target rejoined the fabric? Perhaps you'd want a delay before tearing down the target connection, but then that could be part of the user settings above? Not that I'm sure it is worth it, though. If it's done, you still need to sleep target->device_loss_timeout (instead of some unknown seconds + 5) to tear down connection so that dm-multipath can fail-over. Or I can just start failing requests due to knowing they won't get to the target so dm-multipath will use the backup path immediately. I can sleep as long as I want before killing the connection, just in case it comes back, but my commands will still be going to the other path. If you want to failing requests right away, you can just set device_loss_timeout=1, others don't want dm-multipath to switch path right away. That's a whole idea of these patches that I submitted -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general][PATCH 3/4] SRP fail-over faster
On Fri, Oct 23, 2009 at 1:13 AM, Vu Pham wrote: > > [ ... ] > > Here is the updated patch which implement the device_loss_timeout for each > target instead of module parameter. It also reflects changes from previous > feedbacks. Please review > > > > Introducing device_loss_timeout per target granularity. Creating a timer to > clean up connection after device_loss_timeout expired. During > device_loss_timeout, the QP is in error state, srp will return DID_RESET > for outstanding I/Os and return FAILED for abort_cmd, reset_lun, and return > SUCCESS (without retrying reconnect) on reset_host > > Signed-off-by: Vu Pham > > --- > > drivers/infiniband/ulp/srp/ib_srp.c | 94 ++- > drivers/infiniband/ulp/srp/ib_srp.h | 3 + > 2 files changed, 95 insertions(+), 2 deletions(-) > > diff --git a/drivers/infiniband/ulp/srp/ib_srp.c > b/drivers/infiniband/ulp/srp/ib_srp.c > index e44939a..12404d5 100644 > --- a/drivers/infiniband/ulp/srp/ib_srp.c > +++ b/drivers/infiniband/ulp/srp/ib_srp.c > @@ -433,6 +433,10 @@ static void srp_remove_work(struct work_struct *work) > return; > } > target->state = SRP_TARGET_REMOVED; > + > + if (timer_pending(&target->qp_err_timer)) > + del_timer_sync(&target->qp_err_timer); > + > spin_unlock_irq(target->scsi_host->host_lock); > > spin_lock(&target->srp_host->target_lock); > > [ ... ] Calling del_timer_sync() while holding a spinlock can cause locking inversion. Has this code been tested on a kernel with LOCKDEP enabled ? Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general][PATCH 3/4] SRP fail-over faster
On Thu, 2009-10-22 at 20:24 -0400, Vu Pham wrote: > David Dillow wrote: > > On Thu, 2009-10-22 at 20:04 -0400, Vu Pham wrote: > >> Yes and you can not disable intirely. I'm still looking at > >> benefits/advantages to disable it entirely > > > > To me, the advantage is I have a perfectly viable backup path to the > > storage, and can immediately start issuing commands to it rather than > > waiting for any timeout. On my systems, 1 second can be up to 1500 MB > > transferred and a _huge_ number of compute cycles. And I expect those > > numbers to grow. > > > You can still do so with these patches applied by using the right device > name (ie. /dev/sdXXX) Not in a multipath situation configured for failover. I have to use the multipath device, which will then use the appropriate path as prioritized by ALUA. > >> I use the user supplied setting for local async event on port error > >> where link is broken from host to switch > > > > Perhaps that part should be in the patch that adds that support, then? > > > That's patch #4 Sure, and perhaps the part that massages the timeout should be in the patch that introduces it and actually uses it, no? > > This makes a certain amount of sense; I was confused by the two > > unrelated changes in this patch. I'm still not all that happy about a > > hard-coded 5 seconds, especially with no explanation about the magic > > number. > > > As I said above, it's not magic at all, it just that certain unknown > seconds already passed by, therefore, just pick X seconds to sleep on. Sorry, this is a common idiom here -- a bare number in source code, with no explanation as to where it came from or why it was picked, is often called a "magic number." I'm saying you should comment on it, either in the commit message or in a comment in the code. Or better yet, give it a #define and a comment above that definition that says why you picked it. In other words, don't make someone who comes along after us have to search for this mail thread to figure out that the 5 second sleep was pulled out of thin air. > >> To really sleep user supplied number of seconds, we need to register > >> trap to SM and receiving trap for a node leaving the fabric. > >> It requires a lot of changes in srp_daemon (registering to trap, passing > >> event down to srp driver) and srp driver (handling this event) > >> > > > > Well, if this were done, then you wouldn't need to sleep at all would > > you? Just wait for the trap telling you the target rejoined the fabric? > > Perhaps you'd want a delay before tearing down the target connection, > > but then that could be part of the user settings above? > > > > Not that I'm sure it is worth it, though. > > > If it's done, you still need to sleep target->device_loss_timeout > (instead of some unknown seconds + 5) to tear down connection so that > dm-multipath can fail-over. Or I can just start failing requests due to knowing they won't get to the target so dm-multipath will use the backup path immediately. I can sleep as long as I want before killing the connection, just in case it comes back, but my commands will still be going to the other path. -- Dave Dillow National Center for Computational Science Oak Ridge National Laboratory (865) 241-6602 office -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general][PATCH 3/4] SRP fail-over faster
David Dillow wrote: On Thu, 2009-10-22 at 20:04 -0400, Vu Pham wrote: David Dillow wrote: Yes and you can not disable intirely. I'm still looking at benefits/advantages to disable it entirely To me, the advantage is I have a perfectly viable backup path to the storage, and can immediately start issuing commands to it rather than waiting for any timeout. On my systems, 1 second can be up to 1500 MB transferred and a _huge_ number of compute cycles. And I expect those numbers to grow. You can still do so with these patches applied by using the right device name (ie. /dev/sdXXX) You also don't seem to use the user supplied setting, but hard code the time to 5 seconds? I use the user supplied setting for local async event on port error where link is broken from host to switch Perhaps that part should be in the patch that adds that support, then? That's patch #4 For case link broken from target port to switch. We detect this case by receiving connection closed or wqe error and when this happen unknown certain seconds already passed by; therefore, I sleep 5 seconds instead of using user supplied value. This makes a certain amount of sense; I was confused by the two unrelated changes in this patch. I'm still not all that happy about a hard-coded 5 seconds, especially with no explanation about the magic number. As I said above, it's not magic at all, it just that certain unknown seconds already passed by, therefore, just pick X seconds to sleep on. To really sleep user supplied number of seconds, we need to register trap to SM and receiving trap for a node leaving the fabric. It requires a lot of changes in srp_daemon (registering to trap, passing event down to srp driver) and srp driver (handling this event) Well, if this were done, then you wouldn't need to sleep at all would you? Just wait for the trap telling you the target rejoined the fabric? Perhaps you'd want a delay before tearing down the target connection, but then that could be part of the user settings above? Not that I'm sure it is worth it, though. If it's done, you still need to sleep target->device_loss_timeout (instead of some unknown seconds + 5) to tear down connection so that dm-multipath can fail-over. srp_daemon get the trap right away when target port in/out of fabric, it pass these events down to srp driver, and srp driver need to sleep target->device_loss_timeout. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general][PATCH 3/4] SRP fail-over faster
On Thu, 2009-10-22 at 20:04 -0400, Vu Pham wrote: > David Dillow wrote: > > Yes and you can not disable intirely. I'm still looking at > benefits/advantages to disable it entirely To me, the advantage is I have a perfectly viable backup path to the storage, and can immediately start issuing commands to it rather than waiting for any timeout. On my systems, 1 second can be up to 1500 MB transferred and a _huge_ number of compute cycles. And I expect those numbers to grow. > > You also don't seem to use the user supplied setting, but hard code the > > time to 5 seconds? > > I use the user supplied setting for local async event on port error > where link is broken from host to switch Perhaps that part should be in the patch that adds that support, then? > For case link broken from target port to switch. We detect this case by > receiving connection closed or wqe error and when this happen unknown > certain seconds already passed by; therefore, I sleep 5 seconds instead > of using user supplied value. This makes a certain amount of sense; I was confused by the two unrelated changes in this patch. I'm still not all that happy about a hard-coded 5 seconds, especially with no explanation about the magic number. > To really sleep user supplied number of seconds, we need to register > trap to SM and receiving trap for a node leaving the fabric. > It requires a lot of changes in srp_daemon (registering to trap, passing > event down to srp driver) and srp driver (handling this event) Well, if this were done, then you wouldn't need to sleep at all would you? Just wait for the trap telling you the target rejoined the fabric? Perhaps you'd want a delay before tearing down the target connection, but then that could be part of the user settings above? Not that I'm sure it is worth it, though. -- Dave Dillow National Center for Computational Science Oak Ridge National Laboratory (865) 241-6602 office -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general][PATCH 3/4] SRP fail-over faster
David Dillow wrote: On Thu, 2009-10-22 at 19:34 -0400, David Dillow wrote: On Thu, 2009-10-22 at 19:33 -0400, David Dillow wrote: On Thu, 2009-10-22 at 19:13 -0400, Vu Pham wrote: Jason Gunthorpe wrote: On Thu, Oct 15, 2009 at 03:25:15PM -0400, David Dillow wrote: I use multipath with ALUA, and I don't mind if the link flaps a bit. 60 seconds is near my SCSI timeout of 77 seconds, so it doesn't buy me much. I'd rather multipath be delivering traffic to the backup path than sitting on its thumbs for 60 seconds doing nothing. Certainly an enforced lower limit in the kernel is silly, and a per-device setting does make some sense. Here is the updated patch which implement the device_loss_timeout for each target instead of module parameter. It also reflects changes from previous feedbacks. Please review Please put your patches inline for ease of reply... You still seem to have a 30 second minimum timeout, and no way to disable it altogether. And if I could read a patch, I'd notice that that was not a minimum, but a default. Still, I have to have a 1 second timeout with no way to disable entirely. Better, but... Yes and you can not disable intirely. I'm still looking at benefits/advantages to disable it entirely Last time I reply to myself tonight, I promise... :/ You also don't seem to use the user supplied setting, but hard code the time to 5 seconds? I use the user supplied setting for local async event on port error where link is broken from host to switch For case link broken from target port to switch. We detect this case by receiving connection closed or wqe error and when this happen unknown certain seconds already passed by; therefore, I sleep 5 seconds instead of using user supplied value. To really sleep user supplied number of seconds, we need to register trap to SM and receiving trap for a node leaving the fabric. It requires a lot of changes in srp_daemon (registering to trap, passing event down to srp driver) and srp driver (handling this event) -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general][PATCH 3/4] SRP fail-over faster
On Thu, 2009-10-22 at 19:34 -0400, David Dillow wrote: > On Thu, 2009-10-22 at 19:33 -0400, David Dillow wrote: > > On Thu, 2009-10-22 at 19:13 -0400, Vu Pham wrote: > > > Jason Gunthorpe wrote: > > > > On Thu, Oct 15, 2009 at 03:25:15PM -0400, David Dillow wrote: > > > >> I use multipath with ALUA, and I don't mind if the link flaps a bit. 60 > > > >> seconds is near my SCSI timeout of 77 seconds, so it doesn't buy me > > > >> much. I'd rather multipath be delivering traffic to the backup path > > > >> than > > > >> sitting on its thumbs for 60 seconds doing nothing. > > > > > > Certainly an enforced lower limit in the kernel is silly, and a > > > > per-device setting does make some sense. > > > > > Here is the updated patch which implement the device_loss_timeout for > > > each target instead of module parameter. It also reflects changes from > > > previous feedbacks. Please review > > > > Please put your patches inline for ease of reply... > > > > You still seem to have a 30 second minimum timeout, and no way to > > disable it altogether. > > And if I could read a patch, I'd notice that that was not a minimum, but > a default. Still, I have to have a 1 second timeout with no way to > disable entirely. Better, but... Last time I reply to myself tonight, I promise... :/ You also don't seem to use the user supplied setting, but hard code the time to 5 seconds? -- Dave Dillow National Center for Computational Science Oak Ridge National Laboratory (865) 241-6602 office -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general][PATCH 3/4] SRP fail-over faster
On Thu, 2009-10-22 at 19:33 -0400, David Dillow wrote: > On Thu, 2009-10-22 at 19:13 -0400, Vu Pham wrote: > > Jason Gunthorpe wrote: > > > On Thu, Oct 15, 2009 at 03:25:15PM -0400, David Dillow wrote: > > >> I use multipath with ALUA, and I don't mind if the link flaps a bit. 60 > > >> seconds is near my SCSI timeout of 77 seconds, so it doesn't buy me > > >> much. I'd rather multipath be delivering traffic to the backup path than > > >> sitting on its thumbs for 60 seconds doing nothing. > > > > Certainly an enforced lower limit in the kernel is silly, and a > > > per-device setting does make some sense. > > > Here is the updated patch which implement the device_loss_timeout for > > each target instead of module parameter. It also reflects changes from > > previous feedbacks. Please review > > Please put your patches inline for ease of reply... > > You still seem to have a 30 second minimum timeout, and no way to > disable it altogether. And if I could read a patch, I'd notice that that was not a minimum, but a default. Still, I have to have a 1 second timeout with no way to disable entirely. Better, but... -- Dave Dillow National Center for Computational Science Oak Ridge National Laboratory (865) 241-6602 office -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general][PATCH 3/4] SRP fail-over faster
On Thu, 2009-10-22 at 19:13 -0400, Vu Pham wrote: > Jason Gunthorpe wrote: > > On Thu, Oct 15, 2009 at 03:25:15PM -0400, David Dillow wrote: > >> I use multipath with ALUA, and I don't mind if the link flaps a bit. 60 > >> seconds is near my SCSI timeout of 77 seconds, so it doesn't buy me > >> much. I'd rather multipath be delivering traffic to the backup path than > >> sitting on its thumbs for 60 seconds doing nothing. > > Certainly an enforced lower limit in the kernel is silly, and a > > per-device setting does make some sense. > Here is the updated patch which implement the device_loss_timeout for > each target instead of module parameter. It also reflects changes from > previous feedbacks. Please review Please put your patches inline for ease of reply... You still seem to have a 30 second minimum timeout, and no way to disable it altogether. -- Dave Dillow National Center for Computational Science Oak Ridge National Laboratory (865) 241-6602 office -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general][PATCH 3/4] SRP fail-over faster
Jason Gunthorpe wrote: On Thu, Oct 15, 2009 at 03:25:15PM -0400, David Dillow wrote: On Thu, 2009-10-15 at 09:23 -0700, Vu Pham wrote: David Dillow wrote: And if I want to disable this completely? Unless these patches are bad and affect the stability of the driver, why do you want to disable it? If you don't use multipath/device-mapper and use /dev/sd**, everything will be same I use multipath with ALUA, and I don't mind if the link flaps a bit. 60 seconds is near my SCSI timeout of 77 seconds, so it doesn't buy me much. I'd rather multipath be delivering traffic to the backup path than sitting on its thumbs for 60 seconds doing nothing. I've been left with a similar impression for several multipath things I've seen in the past. True active/active multipath setups should have a shorter timeout - there is no penalty for directing more traffic to the 2nd path (the paths should be load balancing existing traffic in the standard case anyhow). An active/passive configuration might be different... Certainly an enforced lower limit in the kernel is silly, and a per-device setting does make some sense. Jason Here is the updated patch which implement the device_loss_timeout for each target instead of module parameter. It also reflects changes from previous feedbacks. Please review Introducing device_loss_timeout per target granuality. Creating a timer to clean up connection after device_loss_timeout expired. During device_loss_timeout, the QP is in error state, srp will return DID_RESET for outstanding I/Os and return FAILED for abort_cmd, reset_lun, and return SUCCESS (without retrying reconnect) on reset_host Signed-off-by: Vu Pham --- drivers/infiniband/ulp/srp/ib_srp.c | 94 ++- drivers/infiniband/ulp/srp/ib_srp.h |3 + 2 files changed, 95 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index e44939a..12404d5 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -433,6 +433,10 @@ static void srp_remove_work(struct work_struct *work) return; } target->state = SRP_TARGET_REMOVED; + + if (timer_pending(&target->qp_err_timer)) + del_timer_sync(&target->qp_err_timer); + spin_unlock_irq(target->scsi_host->host_lock); spin_lock(&target->srp_host->target_lock); @@ -896,6 +900,50 @@ static void srp_handle_recv(struct srp_target_port *target, struct ib_wc *wc) DMA_FROM_DEVICE); } +static void srp_reconnect_work(struct work_struct *work) +{ + struct srp_target_port *target = + container_of(work, struct srp_target_port, work); + + srp_reconnect_target(target); + spin_lock_irq(target->scsi_host->host_lock); + target->work_in_progress = 0; + spin_unlock_irq(target->scsi_host->host_lock); +} + +static void srp_qp_in_err_timer(unsigned long data) +{ + struct srp_target_port *target = (struct srp_target_port *)data; + struct srp_request *req, *tmp; + + if (target->state != SRP_TARGET_LIVE) + return; + + spin_lock_irq(target->scsi_host->host_lock); + list_for_each_entry_safe(req, tmp, &target->req_queue, list) + srp_reset_req(target, req); + spin_unlock_irq(target->scsi_host->host_lock); + + spin_lock_irq(target->scsi_host->host_lock); + if (!target->work_in_progress) { + target->work_in_progress = 1; + INIT_WORK(&target->work, srp_reconnect_work); + schedule_work(&target->work); + } + spin_unlock_irq(target->scsi_host->host_lock); +} + +static void srp_qp_err_add_timer(struct srp_target_port *target, int time) +{ + if (!timer_pending(&target->qp_err_timer)) { + setup_timer(&target->qp_err_timer, + srp_qp_in_err_timer, + (unsigned long)target); + target->qp_err_timer.expires = round_jiffies(time*HZ + jiffies); + add_timer(&target->qp_err_timer); + } +} + static void srp_completion(struct ib_cq *cq, void *target_ptr) { struct srp_target_port *target = target_ptr; @@ -904,11 +952,19 @@ static void srp_completion(struct ib_cq *cq, void *target_ptr) ib_req_notify_cq(cq, IB_CQ_NEXT_COMP); while (ib_poll_cq(cq, 1, &wc) > 0) { if (wc.status) { + unsigned long flags; + shost_printk(KERN_ERR, target->scsi_host, PFX "failed %s status %d\n", wc.wr_id & SRP_OP_RECV ? "receive" : "send", wc.status); - target->qp_in_error = 1; + spin_loc
Re: [ofa-general][PATCH 3/4] SRP fail-over faster
On Thu, Oct 15, 2009 at 03:25:15PM -0400, David Dillow wrote: > On Thu, 2009-10-15 at 09:23 -0700, Vu Pham wrote: > > David Dillow wrote: > > > And if I want to disable this completely? > > > > > > > Unless these patches are bad and affect the stability of the driver, why > > do you want to disable it? If you don't use multipath/device-mapper and > > use /dev/sd**, everything will be same > > I use multipath with ALUA, and I don't mind if the link flaps a bit. 60 > seconds is near my SCSI timeout of 77 seconds, so it doesn't buy me > much. I'd rather multipath be delivering traffic to the backup path than > sitting on its thumbs for 60 seconds doing nothing. I've been left with a similar impression for several multipath things I've seen in the past. True active/active multipath setups should have a shorter timeout - there is no penalty for directing more traffic to the 2nd path (the paths should be load balancing existing traffic in the standard case anyhow). An active/passive configuration might be different... Certainly an enforced lower limit in the kernel is silly, and a per-device setting does make some sense. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general][PATCH 3/4] SRP fail-over faster
On Thu, 2009-10-15 at 09:23 -0700, Vu Pham wrote: > David Dillow wrote: > > And if I want to disable this completely? > > > > Unless these patches are bad and affect the stability of the driver, why > do you want to disable it? If you don't use multipath/device-mapper and > use /dev/sd**, everything will be same I use multipath with ALUA, and I don't mind if the link flaps a bit. 60 seconds is near my SCSI timeout of 77 seconds, so it doesn't buy me much. I'd rather multipath be delivering traffic to the backup path than sitting on its thumbs for 60 seconds doing nothing. -- Dave Dillow National Center for Computational Science Oak Ridge National Laboratory (865) 241-6602 office -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general][PATCH 3/4] SRP fail-over faster
David Dillow wrote: On Wed, 2009-10-14 at 15:47 -0700, Roland Dreier wrote: > First it does not make sense for user to set it below 60; therefore, > > > it is forced to have 60 and above > > Why not? A minute seems to be a really long time given the point of > > these patches is supposed to be failing over faster. Surely we can tell > > if a path really failed sooner than 60 seconds on an IB fabric. > When we fail-over, it will cause the luns ownership transfer in > target/storage. It's undesirable op unless necessary > Target/storage most likely can reboot and come back within 60 seconds > We don't want to create the situation of path bouncing OK, I can see why in some (many) situations it makes sense to wait a while before reporting a target as gone. But why do we hard code the policy of a minimum timeout of 60 seconds in the kernel? Why not a minimum of 120 seconds? What if I know my storage is guaranteed to reboot in 2 seconds -- why can't I have a timeout of 5 seconds? You haven't really explained where the magic number of 60 seconds comes from. On that note, does this have to be a module parameter -- what if I have connections to different devices, that have different reboot guarantees? It can be tuned down to target/device level instead of module parameter. What do you think, Roland? It can be a param. in login string and stored in target structure. And if I really want a timeout of less than 60 seconds, why should I have to patch my kernel? Then target parameter would be the right approach. And if I want to disable this completely? Unless these patches are bad and affect the stability of the driver, why do you want to disable it? If you don't use multipath/device-mapper and use /dev/sd**, everything will be same -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general][PATCH 3/4] SRP fail-over faster
On Wed, 2009-10-14 at 15:47 -0700, Roland Dreier wrote: > > > > First it does not make sense for user to set it below 60; therefore, > > > > it is forced to have 60 and above > > > > Why not? A minute seems to be a really long time given the point of > > > these patches is supposed to be failing over faster. Surely we can tell > > > if a path really failed sooner than 60 seconds on an IB fabric. > > > When we fail-over, it will cause the luns ownership transfer in > > target/storage. It's undesirable op unless necessary > > Target/storage most likely can reboot and come back within 60 seconds > > We don't want to create the situation of path bouncing > > OK, I can see why in some (many) situations it makes sense to wait a > while before reporting a target as gone. But why do we hard code the > policy of a minimum timeout of 60 seconds in the kernel? Why not a > minimum of 120 seconds? What if I know my storage is guaranteed to > reboot in 2 seconds -- why can't I have a timeout of 5 seconds? > > You haven't really explained where the magic number of 60 seconds comes from. On that note, does this have to be a module parameter -- what if I have connections to different devices, that have different reboot guarantees? And if I really want a timeout of less than 60 seconds, why should I have to patch my kernel? And if I want to disable this completely? -- Dave Dillow National Center for Computational Science Oak Ridge National Laboratory (865) 241-6602 office -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general][PATCH 3/4] SRP fail-over faster
Roland Dreier wrote: > > > First it does not make sense for user to set it below 60; therefore, > > > it is forced to have 60 and above > > Why not? A minute seems to be a really long time given the point of > > these patches is supposed to be failing over faster. Surely we can tell > > if a path really failed sooner than 60 seconds on an IB fabric. > When we fail-over, it will cause the luns ownership transfer in > target/storage. It's undesirable op unless necessary > Target/storage most likely can reboot and come back within 60 seconds > We don't want to create the situation of path bouncing OK, I can see why in some (many) situations it makes sense to wait a while before reporting a target as gone. But why do we hard code the policy of a minimum timeout of 60 seconds in the kernel? Why not a minimum of 120 seconds? What if I know my storage is guaranteed to reboot in 2 seconds -- why can't I have a timeout of 5 seconds? You haven't really explained where the magic number of 60 seconds comes from. - R. I don't have a clear answer. Same as why 30 secs for scsi to start aborting command(s). I think 60 secs is not too long, not too short for a taget coming back online -vu -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general][PATCH 3/4] SRP fail-over faster
> > > First it does not make sense for user to set it below 60; therefore, > > > it is forced to have 60 and above > > Why not? A minute seems to be a really long time given the point of > > these patches is supposed to be failing over faster. Surely we can tell > > if a path really failed sooner than 60 seconds on an IB fabric. > When we fail-over, it will cause the luns ownership transfer in > target/storage. It's undesirable op unless necessary > Target/storage most likely can reboot and come back within 60 seconds > We don't want to create the situation of path bouncing OK, I can see why in some (many) situations it makes sense to wait a while before reporting a target as gone. But why do we hard code the policy of a minimum timeout of 60 seconds in the kernel? Why not a minimum of 120 seconds? What if I know my storage is guaranteed to reboot in 2 seconds -- why can't I have a timeout of 5 seconds? You haven't really explained where the magic number of 60 seconds comes from. - R. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general][PATCH 3/4] SRP fail-over faster
Roland Dreier wrote: > First it does not make sense for user to set it below 60; therefore, > it is forced to have 60 and above Why not? A minute seems to be a really long time given the point of these patches is supposed to be failing over faster. Surely we can tell if a path really failed sooner than 60 seconds on an IB fabric. - R. When we fail-over, it will cause the luns ownership transfer in target/storage. It's undesirable op unless necessary Target/storage most likely can reboot and come back within 60 seconds We don't want to create the situation of path bouncing -vu -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general][PATCH 3/4] SRP fail-over faster
Roland Dreier wrote: > +static int srp_dev_loss_tmo = 60; I don't think the name needs to be this abbreviated. We don't necessarily need the srp_ prefix, but probably "device_loss_timeout" is much clearer without being too much longer. OK > + > +module_param(srp_dev_loss_tmo, int, 0444); > +MODULE_PARM_DESC(srp_dev_loss_tmo, > + "Default number of seconds that srp transport should \ > + insulate the lost of a remote port (default is 60 secs"); I can't understand this description. What does "insulate the lost" of a port mean? I should change "remote port" to just "port". It means that multipath driver won't know about port offline event (pulling cable, power cycling switch, target...) and won't act/fail-over because srp won't return error code until this timeout expired > +static void srp_reconnect_work(struct work_struct *work) > +{ > + struct srp_target_port *target = > + container_of(work, struct srp_target_port, work); > + > + srp_reconnect_target(target); > + target->work_in_progress = 0; surely this is racy... isn't it possible for a context to see work_in_progress as 1, decide not to schedule the work, and then have it set to 0 immediately afterwards by the workqueue context? Yes, it is racy. It should be in lock_irq scsi host_lock > + target->qp_err_timer.expires = time * HZ + jiffies; given that this is only with 1 second resolution, probably makes sense to either make it a deferrable timer or round the timeout to avoid extra wakeups. OK - I'll round the timeout. > + add_timer(&target->qp_err_timer); I don't see anywhere that this is canceled on module unload etc? My mistake. Bart also pointed it out. I'll fix this. > + srp_qp_err_add_timer(target, > + srp_dev_loss_tmo - 55); > + if (srp_dev_loss_tmo < 60) > + srp_dev_loss_tmo = 60; I don't understand the 55 and the 60 here... what are these magic numbers? Wouldn't it make sense for the user to specify the actual timeout that is used (value - 55) rather than the value and then secretly subtracting 55? - R. First it does not make sense for user to set it below 60; therefore, it is forced to have 60 and above With async event handler, srp can detect local port offline and set timer exact device_loss_timeout; however, it does not have mechanism to detect remote port offline (srp_daemon need to register trap and communicate remote port in/out fabric down to srp driver) I should just add timer (X seconds) instead of (device_loss_tmo - 55) in case receiving cqe error and/or connection close event -vu -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general][PATCH 3/4] SRP fail-over faster
> First it does not make sense for user to set it below 60; therefore, > it is forced to have 60 and above Why not? A minute seems to be a really long time given the point of these patches is supposed to be failing over faster. Surely we can tell if a path really failed sooner than 60 seconds on an IB fabric. - R. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general][PATCH 3/4] SRP fail-over faster
> +static int srp_dev_loss_tmo = 60; I don't think the name needs to be this abbreviated. We don't necessarily need the srp_ prefix, but probably "device_loss_timeout" is much clearer without being too much longer. > + > +module_param(srp_dev_loss_tmo, int, 0444); > +MODULE_PARM_DESC(srp_dev_loss_tmo, > + "Default number of seconds that srp transport should \ > + insulate the lost of a remote port (default is 60 secs"); I can't understand this description. What does "insulate the lost" of a port mean? > +static void srp_reconnect_work(struct work_struct *work) > +{ > +struct srp_target_port *target = > +container_of(work, struct srp_target_port, work); > + > +srp_reconnect_target(target); > +target->work_in_progress = 0; surely this is racy... isn't it possible for a context to see work_in_progress as 1, decide not to schedule the work, and then have it set to 0 immediately afterwards by the workqueue context? > +target->qp_err_timer.expires = time * HZ + jiffies; given that this is only with 1 second resolution, probably makes sense to either make it a deferrable timer or round the timeout to avoid extra wakeups. > +add_timer(&target->qp_err_timer); I don't see anywhere that this is canceled on module unload etc? > +srp_qp_err_add_timer(target, > + srp_dev_loss_tmo - 55); > +if (srp_dev_loss_tmo < 60) > +srp_dev_loss_tmo = 60; I don't understand the 55 and the 60 here... what are these magic numbers? Wouldn't it make sense for the user to specify the actual timeout that is used (value - 55) rather than the value and then secretly subtracting 55? - R. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general][PATCH 3/4] SRP fail-over faster
On Tue, Oct 13, 2009 at 12:57 AM, Vu Pham wrote: > > > Introducing srp_dev_loss_tmo module parameter. Creating a timer to clean up > connection after srp_dev_loss_tmo expired. During srp_dev_loss_tmo, the qp > is in error state, srp will return DID_RESET for outstanding I/O and return > FAILED for abort_cmd, reset_lun, and return SUCCESS (without trying > reconnect) on reset_host. > > Signed-off-by: Vu Pham > > > > Index: ofed_kernel/drivers/infiniband/ulp/srp/ib_srp.c > === > --- ofed_kernel.orig/drivers/infiniband/ulp/srp/ib_srp.c > +++ ofed_kernel/drivers/infiniband/ulp/srp/ib_srp.c > @@ -78,6 +77,13 @@ > MODULE_PARM_DESC(mellanox_workarounds, > "Enable workarounds for Mellanox SRP target bugs if != 0"); > > +static int srp_dev_loss_tmo = 60; > + > +module_param(srp_dev_loss_tmo, int, 0444); > +MODULE_PARM_DESC(srp_dev_loss_tmo, > + "Default number of seconds that srp transport should \ > + insulate the lost of a remote port (default is 60 secs"); > + > static void srp_add_one(struct ib_device *device); > static void srp_remove_one(struct ib_device *device); > static void srp_completion(struct ib_cq *cq, void *target_ptr); > @@ -898,6 +926,48 @@ > DMA_FROM_DEVICE); > } > > +static void srp_reconnect_work(struct work_struct *work) > +{ > + struct srp_target_port *target = > + container_of(work, struct srp_target_port, work); > + > + srp_reconnect_target(target); > + target->work_in_progress = 0; > +} > + > +static void srp_qp_in_err_timer(unsigned long data) > +{ > + struct srp_target_port *target = (struct srp_target_port *)data; > + struct srp_request *req, *tmp; > + > + if (target->state != SRP_TARGET_LIVE) > + return; > + > + spin_lock_irq(target->scsi_host->host_lock); > + list_for_each_entry_safe(req, tmp, &target->req_queue, list) > + srp_reset_req(target, req); > + spin_unlock_irq(target->scsi_host->host_lock); > + > + spin_lock_irq(target->scsi_host->host_lock); > + if (!target->work_in_progress) { > + target->work_in_progress = 1; > + INIT_WORK(&target->work, srp_reconnect_work); > + schedule_work(&target->work); > + } > + spin_unlock_irq(target->scsi_host->host_lock); > +} > + > +static void srp_qp_err_add_timer(struct srp_target_port *target, int time) > +{ > + if (!timer_pending(&target->qp_err_timer)) { > + setup_timer(&target->qp_err_timer, > + srp_qp_in_err_timer, > + (unsigned long)target); > + target->qp_err_timer.expires = time * HZ + jiffies; > + add_timer(&target->qp_err_timer); > + } > +} What will happen when the ib_srp kernel module is removed after the timer has been set up but before the timer has fired ? Isn't a call to del_timer_sync() missing in srp_remove_work() ? > + > static void srp_completion(struct ib_cq *cq, void *target_ptr) > { > struct srp_target_port *target = target_ptr; > @@ -960,11 +980,20 @@ > ib_req_notify_cq(cq, IB_CQ_NEXT_COMP); > while (ib_poll_cq(cq, 1, &wc) > 0) { > if (wc.status) { > + unsigned long flags; > + > shost_printk(KERN_ERR, target->scsi_host, > PFX "failed %s status %d\n", > wc.wr_id & SRP_OP_RECV ? "receive" : > "send", > wc.status); > - target->qp_in_error = 1; > + spin_lock_irqsave(target->scsi_host->host_lock, > flags); > + if (!target->qp_in_error && > + target->state == SRP_TARGET_LIVE) { > + target->qp_in_error = 1; > + srp_qp_err_add_timer(target, > + srp_dev_loss_tmo - 55); > + } > + spin_unlock_irqrestore(target->scsi_host->host_lock, > flags); > break; > } > > @@ -1274,5 +1299,6 @@ > int attr_mask = 0; > int comp = 0; > int opcode = 0; > + unsigned long flags; > > switch (event->event) { > @@ -1301,6 +1381,14 @@ > shost_printk(KERN_ERR, target->scsi_host, > PFX "connection closed\n"); > > + spin_lock_irqsave(target->scsi_host->host_lock, flags); > + if (!target->qp_in_error && > + target->state == SRP_TARGET_LIVE) { > + target->qp_in_error = 1; > + srp_qp_err_add_timer(target, > + srp_dev_loss_tmo - 55); > + } > + s
[ofa-general][PATCH 3/4] SRP fail-over faster
--- Begin Message --- Introducing srp_dev_loss_tmo module parameter. Creating a timer to clean up connection after srp_dev_loss_tmo expired. During srp_dev_loss_tmo, the qp is in error state, srp will return DID_RESET for outstanding I/O and return FAILED for abort_cmd, reset_lun, and return SUCCESS (without trying reconnect) on reset_host. Signed-off-by: Vu Pham Index: ofed_kernel/drivers/infiniband/ulp/srp/ib_srp.c === --- ofed_kernel.orig/drivers/infiniband/ulp/srp/ib_srp.c +++ ofed_kernel/drivers/infiniband/ulp/srp/ib_srp.c @@ -78,6 +77,13 @@ MODULE_PARM_DESC(mellanox_workarounds, "Enable workarounds for Mellanox SRP target bugs if != 0"); +static int srp_dev_loss_tmo = 60; + +module_param(srp_dev_loss_tmo, int, 0444); +MODULE_PARM_DESC(srp_dev_loss_tmo, +"Default number of seconds that srp transport should \ + insulate the lost of a remote port (default is 60 secs"); + static void srp_add_one(struct ib_device *device); static void srp_remove_one(struct ib_device *device); static void srp_completion(struct ib_cq *cq, void *target_ptr); @@ -898,6 +926,48 @@ DMA_FROM_DEVICE); } +static void srp_reconnect_work(struct work_struct *work) +{ + struct srp_target_port *target = + container_of(work, struct srp_target_port, work); + + srp_reconnect_target(target); + target->work_in_progress = 0; +} + +static void srp_qp_in_err_timer(unsigned long data) +{ + struct srp_target_port *target = (struct srp_target_port *)data; + struct srp_request *req, *tmp; + + if (target->state != SRP_TARGET_LIVE) + return; + + spin_lock_irq(target->scsi_host->host_lock); + list_for_each_entry_safe(req, tmp, &target->req_queue, list) + srp_reset_req(target, req); + spin_unlock_irq(target->scsi_host->host_lock); + + spin_lock_irq(target->scsi_host->host_lock); + if (!target->work_in_progress) { + target->work_in_progress = 1; + INIT_WORK(&target->work, srp_reconnect_work); + schedule_work(&target->work); + } + spin_unlock_irq(target->scsi_host->host_lock); +} + +static void srp_qp_err_add_timer(struct srp_target_port *target, int time) +{ + if (!timer_pending(&target->qp_err_timer)) { + setup_timer(&target->qp_err_timer, + srp_qp_in_err_timer, + (unsigned long)target); + target->qp_err_timer.expires = time * HZ + jiffies; + add_timer(&target->qp_err_timer); + } +} + static void srp_completion(struct ib_cq *cq, void *target_ptr) { struct srp_target_port *target = target_ptr; @@ -960,11 +980,20 @@ ib_req_notify_cq(cq, IB_CQ_NEXT_COMP); while (ib_poll_cq(cq, 1, &wc) > 0) { if (wc.status) { + unsigned long flags; + shost_printk(KERN_ERR, target->scsi_host, PFX "failed %s status %d\n", wc.wr_id & SRP_OP_RECV ? "receive" : "send", wc.status); - target->qp_in_error = 1; + spin_lock_irqsave(target->scsi_host->host_lock, flags); + if (!target->qp_in_error && + target->state == SRP_TARGET_LIVE) { + target->qp_in_error = 1; + srp_qp_err_add_timer(target, +srp_dev_loss_tmo - 55); + } + spin_unlock_irqrestore(target->scsi_host->host_lock, flags); break; } @@ -1274,5 +1299,6 @@ int attr_mask = 0; int comp = 0; int opcode = 0; + unsigned long flags; switch (event->event) { @@ -1301,6 +1381,14 @@ shost_printk(KERN_ERR, target->scsi_host, PFX "connection closed\n"); + spin_lock_irqsave(target->scsi_host->host_lock, flags); + if (!target->qp_in_error && + target->state == SRP_TARGET_LIVE) { + target->qp_in_error = 1; + srp_qp_err_add_timer(target, +srp_dev_loss_tmo - 55); + } + spin_unlock_irqrestore(target->scsi_host->host_lock, flags); target->status = 0; break; @@ -1443,9 +1529,22 @@ static int srp_reset_host(struct scsi_cmnd *scmnd) { struct srp_target_port *target = host_to_target(scmnd->device->host); + struct srp_request *req, *tmp; int ret = FAILED; - shost_printk(KERN_ERR, target->scsi_host, PFX "SRP reset_host called\n