subject:"\[ofa\-general\]\[PATCH 3\/4\] SRP fail\-over faster"

Re: [ofa-general][PATCH 3/4] SRP fail-over faster

2009-10-29 Thread Vladislav Bolkhovitin


David Dillow, on 10/24/2009 02:08 AM wrote:

On Fri, 2009-10-23 at 12:50 -0400, Vu Pham wrote:

David Dillow wrote:
I don't know much about multipath in ALUA mode.
How would multipath driver (in ALUA mode) to switch path? (ie. basing on 
what criteria?)


ALUA sets the priority of each path, and generally multipath is set to
round-robin among all paths of the same priority. So, paths going to the
primary controller of a LUN get the best priority and are used
preferentially over the backup paths. Once no more paths in a priority
group are active, the round-robin selector will fall back to the next
highest priority path group.

The multipath core will immediately fail a path if the lower layers
propagate an error up to it, such that an I/O request completes in
error. If it has failed the path, it will start sending requests down
alternate paths without waiting for the queue to drain on the first one.

ALUA is not like RDAC -- in ALUA, all paths are valid to use, but some
paths will give better performance. You do not necessarily need to give
the array a command to move the LUN to another controller, so there's no
reason to wait for a queue to drain.

At least that is the way I understand things, having picked my way
through the block layer, multipath core, and device handlers.


Unfortunately, it isn't always so simple. Sometimes there are 
dependencies between order of requests, so you have to drain the failed 
path before sending retries to the alternate path to make sure that the 
retries will not interfere with not yet aborted requests from the failed 
paths.


Just to make things more precise.

Vlad

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ofa-general][PATCH 3/4] SRP fail-over faster

2009-10-29 Thread Vu Pham

Roland Dreier wrote:

 > + if (timer_pending(&target->qp_err_timer))
 > + del_timer_sync(&target->qp_err_timer);
 > +
 >   spin_unlock_irq(target->scsi_host->host_lock);

As was pointed out, I don't think you can do del_timer_sync while
holding the lock, since the timer function takes the same lock.

I don't think that it takes the same lock (ie. host_lock).; however, the 
problem is  kernel configured CONFIG_LOCKDEP, del_timer_sync()  call 
local_irq_save/restore() to acquire/release timer->lockdep_map

But I don't know that just switching to del_timer without the sync works
here ... without the sync then the timeout function could still run any
time after the del_timer, even after everything gets freed.

I will move del_timer_sync() of the host_lock. It's safe because we only 
setup the same timer when the target state is ALIVE, however, target 
state is already REMOVED when del_timer_sync() is called

BTW the test of timer_pending isn't needed here... del_timer does the
test internally anyway.

I do agree it would be very good to improve the SRP error handling.  I
have some concerns about the overall design here -- it seems that if we
handle connection failure and allow a new connection to proceed while
cleaning up asynchronously, then this opens the door to a lot of
complexity, and I don't see that complexity handled in this patchset.
For example, the new connection could fail too before the old one is
done cleaning up, etc, etc and we end up with an arbitrarily large queue
of things waiting to clean up.  Or maybe it really it is simpler than that.

It's is complicated; however, it is not as complicated as you said.
There are two flows/places where we clean-up connection and create new 
one (ie. call srp_reconnect_target) in async event timer and scsi error 
handling srp_reset_host(); however, in srp_reset_host() the patch 3/4 
check for true condition of  timer_pending or qp_in_error or target 
state wrapped in host_lock then it won't call srp_reconnect_target(). It 
left one flow/place to clean up / create new connection.

We destroy cm connection/cq/qp and create new connection/cq/qp in the 
same flow/place; we also already clean up all scsi/srp resources 
associated with old connection before create new one; therefore, the 
only resources pended for clean up is cm/connection resources, qp/cq 
low-level resources. And I believe the lower IB stack does that well.

I think the best way to move this forward would be to post another
cleaned up version of your patch set.  Please try to reorganize things
so each patch is reasonably self contained.  Of course your patchset is
taking multiple steps to improve things.  But as much as possible,
please try to avoid combining two things into a single patch, and
conversely also try to avoid putting things into a patch that don't make
sense without a later patch.

Avoiding policy in the kernel as much as possible in terms of hard-coded
timeouts etc is a good goal too.

Also it would help to give each patch a separate descriptive subject,
and put as much detail in the changelogs as you can.

Thanks,
  Roland

I'll follow the instructions above, rework the patch set and resubmit

thanks,
-vu
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ofa-general][PATCH 3/4] SRP fail-over faster

2009-10-28 Thread Roland Dreier


 > +if (timer_pending(&target->qp_err_timer))
 > +del_timer_sync(&target->qp_err_timer);
 > +
 >  spin_unlock_irq(target->scsi_host->host_lock);

As was pointed out, I don't think you can do del_timer_sync while
holding the lock, since the timer function takes the same lock.

But I don't know that just switching to del_timer without the sync works
here ... without the sync then the timeout function could still run any
time after the del_timer, even after everything gets freed.

BTW the test of timer_pending isn't needed here... del_timer does the
test internally anyway.

I do agree it would be very good to improve the SRP error handling.  I
have some concerns about the overall design here -- it seems that if we
handle connection failure and allow a new connection to proceed while
cleaning up asynchronously, then this opens the door to a lot of
complexity, and I don't see that complexity handled in this patchset.
For example, the new connection could fail too before the old one is
done cleaning up, etc, etc and we end up with an arbitrarily large queue
of things waiting to clean up.  Or maybe it really it is simpler than that.

I think the best way to move this forward would be to post another
cleaned up version of your patch set.  Please try to reorganize things
so each patch is reasonably self contained.  Of course your patchset is
taking multiple steps to improve things.  But as much as possible,
please try to avoid combining two things into a single patch, and
conversely also try to avoid putting things into a patch that don't make
sense without a later patch.

Avoiding policy in the kernel as much as possible in terms of hard-coded
timeouts etc is a good goal too.

Also it would help to give each patch a separate descriptive subject,
and put as much detail in the changelogs as you can.

Thanks,
  Roland
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ofa-general][PATCH 3/4] SRP fail-over faster

2009-10-28 Thread David Dillow

On Sat, 2009-10-24 at 03:35 -0400, Vu Pham wrote:
> It's a big improvement from 3-5 minutes cutting down to 1s and now you
> talk about device_loss_timeout=0. I'll look at the trade-off to have
> it; however, to receive and process the async event (port error)
> already cost you a fair amount of cycles.

I agree that it is a great improvement over just sending packets blindly
to the link, and waiting for SCSI to time them out -- I've been using
the variant of the patches from OFED -- but it is harder to change
things once they are in the mainstream kernel, so I'd like to see it
done better.

And hey, maybe I'm just overly touchy about this. These should be rare
events, and there's nothing we can do about the commands sent prior to
being told about the link error. It's just that I don't want my file
system to stall the petaflop simulation platforms if I can avoid it --
and there's no reason to send any command down the wire once we've been
told there is no link or the target is not there. Maybe we don't need to
destroy the link immediately, but we need to let the SCSI mid-layer know
that things are failing.
-- 
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ofa-general][PATCH 3/4] SRP fail-over faster

2009-10-24 Thread Vu Pham


David Dillow wrote:

On Fri, 2009-10-23 at 12:50 -0400, Vu Pham wrote:
  

David Dillow wrote:
I don't know much about multipath in ALUA mode.
How would multipath driver (in ALUA mode) to switch path? (ie. basing on 
what criteria?)



ALUA sets the priority of each path, and generally multipath is set to
round-robin among all paths of the same priority. So, paths going to the
primary controller of a LUN get the best priority and are used
preferentially over the backup paths. Once no more paths in a priority
group are active, the round-robin selector will fall back to the next
highest priority path group.

The multipath core will immediately fail a path if the lower layers
propagate an error up to it, such that an I/O request completes in
error. If it has failed the path, it will start sending requests down
alternate paths without waiting for the queue to drain on the first one.

  


Thanks for explaining.
Without these patches, it will take ~3-5 minutes before srp driver 
propagate errors up so that dm-multipath can switch path. You need these 
patches - test them and you'll see.



ALUA is not like RDAC -- in ALUA, all paths are valid to use, but some
paths will give better performance. You do not necessarily need to give
the array a command to move the LUN to another controller, so there's no
reason to wait for a queue to drain.

At least that is the way I understand things, having picked my way
through the block layer, multipath core, and device handlers.

  
Can you switch path manually in user mode (while there are commands 
stucked in current active path)?



I've not tried, but give the above I don't see why not.

  
Without this patch, all outstanding I/Os have to go thru error recovery 
before being returned with error code so that dm-multipath fail-over.



I think we're talking about two separate things here -- I agree that the
idea of failing IO early when we've lost our local connection, or know
the target is not in the fabric, is a good one. I want a fast failure so
that I can immediately start using my alternate paths. I'll have to deal
with the timeouts on the requests in flight at some point, but they
don't hold back independent requests.
  


We talk about the same thing here. Like I said above, these patches are 
needed so that errors can be propagated up faster. Without them, you 
have to wait 3-5 minutes.



The difference of opinion seems to be in how long to wait after being
notified of a connection loss -- or the target leaving the fabric --
before we start kicking back errors at the SRP command dispatch handler.
I agree that it makes sense to wait a moment before forcing an RDAC path
change, as they seem to be slow. But I also want it to get out of the
way for my case, when I don't incur much of a penalty to immediately
light up my backup path.

  
Both RDAC & ALUA need errors propagated up sooner. With the introducing 
of device_loss_timeout, srp satisfy both RDAC and ALUA modes. You can 
set device_loss_tmo=1 and RDAC can set to 60s or so.


If you want to failing requests right away, you can just set 
device_loss_timeout=1, others don't want dm-multipath to switch path 
right away. That's a whole idea of these patches that I submitted



The thing is, I don't want to wait even 1 second to use my backup path,
and I don't want all of those requests going into a black hole for that
time, forcing me to wait for the SCSI timeout on requests that could
have been immediately processed. On our system, 1 second is up to 1500
MB of data transferred over this one connection, and waiting around
twiddling our thumbs for a single second can potentially cost 1.3
thousand trillion operations.
  
It's a big improvement from 3-5 minutes cutting down to 1s and now you 
talk about device_loss_timeout=0
I'll look at the trade-off to have it; however, to receive and process 
the async event (port error) already cost you a fair amount of cycles.



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ofa-general][PATCH 3/4] SRP fail-over faster

2009-10-23 Thread David Dillow

On Fri, 2009-10-23 at 12:50 -0400, Vu Pham wrote:
> David Dillow wrote:
> I don't know much about multipath in ALUA mode.
> How would multipath driver (in ALUA mode) to switch path? (ie. basing on 
> what criteria?)

ALUA sets the priority of each path, and generally multipath is set to
round-robin among all paths of the same priority. So, paths going to the
primary controller of a LUN get the best priority and are used
preferentially over the backup paths. Once no more paths in a priority
group are active, the round-robin selector will fall back to the next
highest priority path group.

The multipath core will immediately fail a path if the lower layers
propagate an error up to it, such that an I/O request completes in
error. If it has failed the path, it will start sending requests down
alternate paths without waiting for the queue to drain on the first one.

ALUA is not like RDAC -- in ALUA, all paths are valid to use, but some
paths will give better performance. You do not necessarily need to give
the array a command to move the LUN to another controller, so there's no
reason to wait for a queue to drain.

At least that is the way I understand things, having picked my way
through the block layer, multipath core, and device handlers.

> Can you switch path manually in user mode (while there are commands 
> stucked in current active path)?

I've not tried, but give the above I don't see why not.

> Without this patch, all outstanding I/Os have to go thru error recovery 
> before being returned with error code so that dm-multipath fail-over.

I think we're talking about two separate things here -- I agree that the
idea of failing IO early when we've lost our local connection, or know
the target is not in the fabric, is a good one. I want a fast failure so
that I can immediately start using my alternate paths. I'll have to deal
with the timeouts on the requests in flight at some point, but they
don't hold back independent requests.

The difference of opinion seems to be in how long to wait after being
notified of a connection loss -- or the target leaving the fabric --
before we start kicking back errors at the SRP command dispatch handler.
I agree that it makes sense to wait a moment before forcing an RDAC path
change, as they seem to be slow. But I also want it to get out of the
way for my case, when I don't incur much of a penalty to immediately
light up my backup path.

> If you want to failing requests right away, you can just set 
> device_loss_timeout=1, others don't want dm-multipath to switch path 
> right away. That's a whole idea of these patches that I submitted

The thing is, I don't want to wait even 1 second to use my backup path,
and I don't want all of those requests going into a black hole for that
time, forcing me to wait for the SCSI timeout on requests that could
have been immediately processed. On our system, 1 second is up to 1500
MB of data transferred over this one connection, and waiting around
twiddling our thumbs for a single second can potentially cost 1.3
thousand trillion operations.
-- 
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ofa-general][PATCH 3/4] SRP fail-over faster

2009-10-23 Thread Vu Pham


Bart Van Assche wrote:

On Fri, Oct 23, 2009 at 1:13 AM, Vu Pham  wrote:
  

[ ... ]

Here is the updated patch which implement the device_loss_timeout for each 
target instead of module parameter. It also reflects changes from previous 
feedbacks. Please review



Introducing device_loss_timeout per target granularity. Creating a timer to
clean up connection after device_loss_timeout expired. During
device_loss_timeout, the QP is in error state, srp will return DID_RESET
for outstanding I/Os and return FAILED for abort_cmd, reset_lun, and return
SUCCESS (without retrying reconnect) on reset_host

Signed-off-by: Vu Pham 

---

drivers/infiniband/ulp/srp/ib_srp.c |   94 ++-
 drivers/infiniband/ulp/srp/ib_srp.h |3 +
 2 files changed, 95 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/ulp/srp/ib_srp.c 
b/drivers/infiniband/ulp/srp/ib_srp.c
index e44939a..12404d5 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -433,6 +433,10 @@ static void srp_remove_work(struct work_struct *work)
   return;
   }
   target->state = SRP_TARGET_REMOVED;
+
+   if (timer_pending(&target->qp_err_timer))
+   del_timer_sync(&target->qp_err_timer);
+
   spin_unlock_irq(target->scsi_host->host_lock);

   spin_lock(&target->srp_host->target_lock);

[ ... ]



Calling del_timer_sync() while holding a spinlock can cause locking
inversion. Has this code been tested on a kernel with LOCKDEP enabled
?

Bart.
  
I have not tested with LOCKDEP enabled., probably del_timer() would be 
sufficient inside target_lock

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ofa-general][PATCH 3/4] SRP fail-over faster

2009-10-23 Thread Vu Pham


David Dillow wrote:

On Thu, 2009-10-22 at 20:24 -0400, Vu Pham wrote:
  

David Dillow wrote:


On Thu, 2009-10-22 at 20:04 -0400, Vu Pham wrote:
  
Yes and you can not disable intirely. I'm still looking at 
benefits/advantages to disable it entirely


To me, the advantage is I have a perfectly viable backup path to the
storage, and can immediately start issuing commands to it rather than
waiting for any timeout. On my systems, 1 second can be up to 1500 MB
transferred and a _huge_ number of compute cycles. And I expect those
numbers to grow.

  
You can still do so with these patches applied by using the right device 
name (ie. /dev/sdXXX)



Not in a multipath situation configured for failover. I have to use the
multipath device, which will then use the appropriate path as
prioritized by ALUA.

  

I don't know much about multipath in ALUA mode.
How would multipath driver (in ALUA mode) to switch path? (ie. basing on 
what criteria?)
Can you switch path manually in user mode (while there are commands 
stucked in current active path)?


Without this patch, all outstanding I/Os have to go thru error recovery 
before being returned with error code so that dm-multipath fail-over.


I use the user supplied setting for local async event on port error 
where link is broken from host to switch


Perhaps that part should be in the patch that adds that support, then?
  
  

That's patch #4



Sure, and perhaps the part that massages the timeout should be in the
patch that introduces it and actually uses it, no?

  


I will look at it and rework the patch.


This makes a certain amount of sense; I was confused by the two
unrelated changes in this patch. I'm still not all that happy about a
hard-coded 5 seconds, especially with no explanation about the magic
number.
  
  
As I said above, it's not magic at all, it just that certain unknown 
seconds already passed by, therefore, just pick X seconds to sleep on.



Sorry, this is a common idiom here -- a bare number in source code, with
no explanation as to where it came from or why it was picked, is often
called a "magic number."

I'm saying you should comment on it, either in the commit message or in
a comment in the code. Or better yet, give it a #define and a comment
above that definition that says why you picked it.

In other words, don't make someone who comes along after us have to
search for this mail thread to figure out that the 5 second sleep was
pulled out of thin air.

  

Understood.
To really sleep user supplied number of seconds, we need to register 
trap to SM and receiving trap for a node leaving the fabric.
It requires a lot of changes in srp_daemon (registering to trap, passing 
event down to srp driver) and srp driver (handling this event)



Well, if this were done, then you wouldn't need to sleep at all would
you? Just wait for the trap telling you the target rejoined the fabric?
Perhaps you'd want a delay before tearing down the target connection,
but then that could be part of the user settings above?

Not that I'm sure it is worth it, though.
  
  
If it's done, you still need to sleep target->device_loss_timeout 
(instead of some unknown seconds + 5) to tear down connection so that 
dm-multipath can fail-over.



Or I can just start failing requests due to knowing they won't get to
the target so dm-multipath will use the backup path immediately. I can
sleep as long as I want before killing the connection, just in case it
comes back, but my commands will still be going to the other path.

  


If you want to failing requests right away, you can just set 
device_loss_timeout=1, others don't want dm-multipath to switch path 
right away. That's a whole idea of these patches that I submitted

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ofa-general][PATCH 3/4] SRP fail-over faster

2009-10-22 Thread Bart Van Assche

On Fri, Oct 23, 2009 at 1:13 AM, Vu Pham  wrote:
>
> [ ... ]
>
> Here is the updated patch which implement the device_loss_timeout for each 
> target instead of module parameter. It also reflects changes from previous 
> feedbacks. Please review
>
>
>
> Introducing device_loss_timeout per target granularity. Creating a timer to
> clean up connection after device_loss_timeout expired. During
> device_loss_timeout, the QP is in error state, srp will return DID_RESET
> for outstanding I/Os and return FAILED for abort_cmd, reset_lun, and return
> SUCCESS (without retrying reconnect) on reset_host
>
> Signed-off-by: Vu Pham 
>
> ---
>
> drivers/infiniband/ulp/srp/ib_srp.c |   94 ++-
>  drivers/infiniband/ulp/srp/ib_srp.h |    3 +
>  2 files changed, 95 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/infiniband/ulp/srp/ib_srp.c 
> b/drivers/infiniband/ulp/srp/ib_srp.c
> index e44939a..12404d5 100644
> --- a/drivers/infiniband/ulp/srp/ib_srp.c
> +++ b/drivers/infiniband/ulp/srp/ib_srp.c
> @@ -433,6 +433,10 @@ static void srp_remove_work(struct work_struct *work)
>                return;
>        }
>        target->state = SRP_TARGET_REMOVED;
> +
> +       if (timer_pending(&target->qp_err_timer))
> +               del_timer_sync(&target->qp_err_timer);
> +
>        spin_unlock_irq(target->scsi_host->host_lock);
>
>        spin_lock(&target->srp_host->target_lock);
>
> [ ... ]

Calling del_timer_sync() while holding a spinlock can cause locking
inversion. Has this code been tested on a kernel with LOCKDEP enabled
?

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ofa-general][PATCH 3/4] SRP fail-over faster

2009-10-22 Thread David Dillow

On Thu, 2009-10-22 at 20:24 -0400, Vu Pham wrote:
> David Dillow wrote:
> > On Thu, 2009-10-22 at 20:04 -0400, Vu Pham wrote:
> >> Yes and you can not disable intirely. I'm still looking at 
> >> benefits/advantages to disable it entirely
> >
> > To me, the advantage is I have a perfectly viable backup path to the
> > storage, and can immediately start issuing commands to it rather than
> > waiting for any timeout. On my systems, 1 second can be up to 1500 MB
> > transferred and a _huge_ number of compute cycles. And I expect those
> > numbers to grow.
> >
> You can still do so with these patches applied by using the right device 
> name (ie. /dev/sdXXX)

Not in a multipath situation configured for failover. I have to use the
multipath device, which will then use the appropriate path as
prioritized by ALUA.

> >> I use the user supplied setting for local async event on port error 
> >> where link is broken from host to switch
> >
> > Perhaps that part should be in the patch that adds that support, then?
> >   
> That's patch #4

Sure, and perhaps the part that massages the timeout should be in the
patch that introduces it and actually uses it, no?

> > This makes a certain amount of sense; I was confused by the two
> > unrelated changes in this patch. I'm still not all that happy about a
> > hard-coded 5 seconds, especially with no explanation about the magic
> > number.
> >   
> As I said above, it's not magic at all, it just that certain unknown 
> seconds already passed by, therefore, just pick X seconds to sleep on.

Sorry, this is a common idiom here -- a bare number in source code, with
no explanation as to where it came from or why it was picked, is often
called a "magic number."

I'm saying you should comment on it, either in the commit message or in
a comment in the code. Or better yet, give it a #define and a comment
above that definition that says why you picked it.

In other words, don't make someone who comes along after us have to
search for this mail thread to figure out that the 5 second sleep was
pulled out of thin air.

> >> To really sleep user supplied number of seconds, we need to register 
> >> trap to SM and receiving trap for a node leaving the fabric.
> >> It requires a lot of changes in srp_daemon (registering to trap, passing 
> >> event down to srp driver) and srp driver (handling this event)
> >> 
> >
> > Well, if this were done, then you wouldn't need to sleep at all would
> > you? Just wait for the trap telling you the target rejoined the fabric?
> > Perhaps you'd want a delay before tearing down the target connection,
> > but then that could be part of the user settings above?
> >
> > Not that I'm sure it is worth it, though.
> >   
> If it's done, you still need to sleep target->device_loss_timeout 
> (instead of some unknown seconds + 5) to tear down connection so that 
> dm-multipath can fail-over.

Or I can just start failing requests due to knowing they won't get to
the target so dm-multipath will use the backup path immediately. I can
sleep as long as I want before killing the connection, just in case it
comes back, but my commands will still be going to the other path.

-- 
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ofa-general][PATCH 3/4] SRP fail-over faster

2009-10-22 Thread Vu Pham


David Dillow wrote:

On Thu, 2009-10-22 at 20:04 -0400, Vu Pham wrote:
  

David Dillow wrote:
   
Yes and you can not disable intirely. I'm still looking at 
benefits/advantages to disable it entirely



To me, the advantage is I have a perfectly viable backup path to the
storage, and can immediately start issuing commands to it rather than
waiting for any timeout. On my systems, 1 second can be up to 1500 MB
transferred and a _huge_ number of compute cycles. And I expect those
numbers to grow.

  
You can still do so with these patches applied by using the right device 
name (ie. /dev/sdXXX)



You also don't seem to use the user supplied setting, but hard code the
time to 5 seconds?
  
I use the user supplied setting for local async event on port error 
where link is broken from host to switch



Perhaps that part should be in the patch that adds that support, then?

  

That's patch #4
For case link broken from target port to switch. We detect this case by 
receiving connection closed or wqe error and when this happen unknown 
certain seconds already passed by; therefore, I sleep 5 seconds instead 
of using user supplied value.



This makes a certain amount of sense; I was confused by the two
unrelated changes in this patch. I'm still not all that happy about a
hard-coded 5 seconds, especially with no explanation about the magic
number.
  
As I said above, it's not magic at all, it just that certain unknown 
seconds already passed by, therefore, just pick X seconds to sleep on.
  
To really sleep user supplied number of seconds, we need to register 
trap to SM and receiving trap for a node leaving the fabric.
It requires a lot of changes in srp_daemon (registering to trap, passing 
event down to srp driver) and srp driver (handling this event)



Well, if this were done, then you wouldn't need to sleep at all would
you? Just wait for the trap telling you the target rejoined the fabric?
Perhaps you'd want a delay before tearing down the target connection,
but then that could be part of the user settings above?

Not that I'm sure it is worth it, though.
  
If it's done, you still need to sleep target->device_loss_timeout 
(instead of some unknown seconds + 5) to tear down connection so that 
dm-multipath can fail-over.


srp_daemon get the trap right away when target port in/out of fabric, it 
pass these events down to srp driver, and srp driver need to sleep 
target->device_loss_timeout.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ofa-general][PATCH 3/4] SRP fail-over faster

2009-10-22 Thread David Dillow

On Thu, 2009-10-22 at 20:04 -0400, Vu Pham wrote:
> David Dillow wrote:
>
> Yes and you can not disable intirely. I'm still looking at 
> benefits/advantages to disable it entirely

To me, the advantage is I have a perfectly viable backup path to the
storage, and can immediately start issuing commands to it rather than
waiting for any timeout. On my systems, 1 second can be up to 1500 MB
transferred and a _huge_ number of compute cycles. And I expect those
numbers to grow.

> > You also don't seem to use the user supplied setting, but hard code the
> > time to 5 seconds?
> 
> I use the user supplied setting for local async event on port error 
> where link is broken from host to switch

Perhaps that part should be in the patch that adds that support, then?

> For case link broken from target port to switch. We detect this case by 
> receiving connection closed or wqe error and when this happen unknown 
> certain seconds already passed by; therefore, I sleep 5 seconds instead 
> of using user supplied value.

This makes a certain amount of sense; I was confused by the two
unrelated changes in this patch. I'm still not all that happy about a
hard-coded 5 seconds, especially with no explanation about the magic
number.

> To really sleep user supplied number of seconds, we need to register 
> trap to SM and receiving trap for a node leaving the fabric.
> It requires a lot of changes in srp_daemon (registering to trap, passing 
> event down to srp driver) and srp driver (handling this event)

Well, if this were done, then you wouldn't need to sleep at all would
you? Just wait for the trap telling you the target rejoined the fabric?
Perhaps you'd want a delay before tearing down the target connection,
but then that could be part of the user settings above?

Not that I'm sure it is worth it, though.
-- 
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ofa-general][PATCH 3/4] SRP fail-over faster

2009-10-22 Thread Vu Pham


David Dillow wrote:

On Thu, 2009-10-22 at 19:34 -0400, David Dillow wrote:
  

On Thu, 2009-10-22 at 19:33 -0400, David Dillow wrote:


On Thu, 2009-10-22 at 19:13 -0400, Vu Pham wrote:
  

Jason Gunthorpe wrote:


On Thu, Oct 15, 2009 at 03:25:15PM -0400, David Dillow wrote:
  

I use multipath with ALUA, and I don't mind if the link flaps a bit. 60
seconds is near my SCSI timeout of 77 seconds, so it doesn't buy me
much. I'd rather multipath be delivering traffic to the backup path than
sitting on its thumbs for 60 seconds doing nothing.


Certainly an enforced lower limit in the kernel is silly, and a
per-device setting does make some sense.
  
Here is the updated patch which implement the device_loss_timeout for 
each target instead of module parameter. It also reflects changes from 
previous feedbacks. Please review


Please put your patches inline for ease of reply...

You still seem to have a 30 second minimum timeout, and no way to
disable it altogether.
  

And if I could read a patch, I'd notice that that was not a minimum, but
a default. Still, I have to have a 1 second timeout with no way to
disable entirely. Better, but...



  
Yes and you can not disable intirely. I'm still looking at 
benefits/advantages to disable it entirely



Last time I reply to myself tonight, I promise... :/

You also don't seem to use the user supplied setting, but hard code the
time to 5 seconds?
  


I use the user supplied setting for local async event on port error 
where link is broken from host to switch


For case link broken from target port to switch. We detect this case by 
receiving connection closed or wqe error and when this happen unknown 
certain seconds already passed by; therefore, I sleep 5 seconds instead 
of using user supplied value.


To really sleep user supplied number of seconds, we need to register 
trap to SM and receiving trap for a node leaving the fabric.
It requires a lot of changes in srp_daemon (registering to trap, passing 
event down to srp driver) and srp driver (handling this event)



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ofa-general][PATCH 3/4] SRP fail-over faster

2009-10-22 Thread David Dillow

On Thu, 2009-10-22 at 19:34 -0400, David Dillow wrote:
> On Thu, 2009-10-22 at 19:33 -0400, David Dillow wrote:
> > On Thu, 2009-10-22 at 19:13 -0400, Vu Pham wrote:
> > > Jason Gunthorpe wrote:
> > > > On Thu, Oct 15, 2009 at 03:25:15PM -0400, David Dillow wrote:
> > > >> I use multipath with ALUA, and I don't mind if the link flaps a bit. 60
> > > >> seconds is near my SCSI timeout of 77 seconds, so it doesn't buy me
> > > >> much. I'd rather multipath be delivering traffic to the backup path 
> > > >> than
> > > >> sitting on its thumbs for 60 seconds doing nothing.
> > 
> > > > Certainly an enforced lower limit in the kernel is silly, and a
> > > > per-device setting does make some sense.
> > 
> > > Here is the updated patch which implement the device_loss_timeout for 
> > > each target instead of module parameter. It also reflects changes from 
> > > previous feedbacks. Please review
> > 
> > Please put your patches inline for ease of reply...
> > 
> > You still seem to have a 30 second minimum timeout, and no way to
> > disable it altogether.
> 
> And if I could read a patch, I'd notice that that was not a minimum, but
> a default. Still, I have to have a 1 second timeout with no way to
> disable entirely. Better, but...

Last time I reply to myself tonight, I promise... :/

You also don't seem to use the user supplied setting, but hard code the
time to 5 seconds?
-- 
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ofa-general][PATCH 3/4] SRP fail-over faster

2009-10-22 Thread David Dillow

On Thu, 2009-10-22 at 19:33 -0400, David Dillow wrote:
> On Thu, 2009-10-22 at 19:13 -0400, Vu Pham wrote:
> > Jason Gunthorpe wrote:
> > > On Thu, Oct 15, 2009 at 03:25:15PM -0400, David Dillow wrote:
> > >> I use multipath with ALUA, and I don't mind if the link flaps a bit. 60
> > >> seconds is near my SCSI timeout of 77 seconds, so it doesn't buy me
> > >> much. I'd rather multipath be delivering traffic to the backup path than
> > >> sitting on its thumbs for 60 seconds doing nothing.
> 
> > > Certainly an enforced lower limit in the kernel is silly, and a
> > > per-device setting does make some sense.
> 
> > Here is the updated patch which implement the device_loss_timeout for 
> > each target instead of module parameter. It also reflects changes from 
> > previous feedbacks. Please review
> 
> Please put your patches inline for ease of reply...
> 
> You still seem to have a 30 second minimum timeout, and no way to
> disable it altogether.

And if I could read a patch, I'd notice that that was not a minimum, but
a default. Still, I have to have a 1 second timeout with no way to
disable entirely. Better, but...
-- 
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ofa-general][PATCH 3/4] SRP fail-over faster

2009-10-22 Thread David Dillow

On Thu, 2009-10-22 at 19:13 -0400, Vu Pham wrote:
> Jason Gunthorpe wrote:
> > On Thu, Oct 15, 2009 at 03:25:15PM -0400, David Dillow wrote:
> >> I use multipath with ALUA, and I don't mind if the link flaps a bit. 60
> >> seconds is near my SCSI timeout of 77 seconds, so it doesn't buy me
> >> much. I'd rather multipath be delivering traffic to the backup path than
> >> sitting on its thumbs for 60 seconds doing nothing.

> > Certainly an enforced lower limit in the kernel is silly, and a
> > per-device setting does make some sense.

> Here is the updated patch which implement the device_loss_timeout for 
> each target instead of module parameter. It also reflects changes from 
> previous feedbacks. Please review

Please put your patches inline for ease of reply...

You still seem to have a 30 second minimum timeout, and no way to
disable it altogether.
-- 
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ofa-general][PATCH 3/4] SRP fail-over faster

2009-10-22 Thread Vu Pham


Jason Gunthorpe wrote:

On Thu, Oct 15, 2009 at 03:25:15PM -0400, David Dillow wrote:
  

On Thu, 2009-10-15 at 09:23 -0700, Vu Pham wrote:


David Dillow wrote:
  

And if I want to disable this completely?
  

Unless these patches are bad and affect the stability of the driver, why 
do you want to disable it? If you don't use multipath/device-mapper and 
use /dev/sd**, everything will be  same
  

I use multipath with ALUA, and I don't mind if the link flaps a bit. 60
seconds is near my SCSI timeout of 77 seconds, so it doesn't buy me
much. I'd rather multipath be delivering traffic to the backup path than
sitting on its thumbs for 60 seconds doing nothing.



I've been left with a similar impression for several multipath things
I've seen in the past. True active/active multipath setups should have
a shorter timeout - there is no penalty for directing more traffic to
the 2nd path (the paths should be load balancing existing traffic in
the standard case anyhow).

An active/passive configuration might be different...

Certainly an enforced lower limit in the kernel is silly, and a
per-device setting does make some sense.

Jason
  
Here is the updated patch which implement the device_loss_timeout for 
each target instead of module parameter. It also reflects changes from 
previous feedbacks. Please review



Introducing device_loss_timeout per target granuality. Creating a timer to
clean up connection after device_loss_timeout expired. During
device_loss_timeout, the QP is in error state, srp will return DID_RESET
for outstanding I/Os and return FAILED for abort_cmd, reset_lun, and return
SUCCESS (without retrying reconnect) on reset_host

Signed-off-by: Vu Pham 

---

drivers/infiniband/ulp/srp/ib_srp.c |   94 ++-
 drivers/infiniband/ulp/srp/ib_srp.h |3 +
 2 files changed, 95 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/ulp/srp/ib_srp.c 
b/drivers/infiniband/ulp/srp/ib_srp.c
index e44939a..12404d5 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -433,6 +433,10 @@ static void srp_remove_work(struct work_struct *work)
return;
}
target->state = SRP_TARGET_REMOVED;
+
+   if (timer_pending(&target->qp_err_timer))
+   del_timer_sync(&target->qp_err_timer);
+
spin_unlock_irq(target->scsi_host->host_lock);
 
spin_lock(&target->srp_host->target_lock);
@@ -896,6 +900,50 @@ static void srp_handle_recv(struct srp_target_port 
*target, struct ib_wc *wc)
  DMA_FROM_DEVICE);
 }
 
+static void srp_reconnect_work(struct work_struct *work)
+{
+   struct srp_target_port *target =
+   container_of(work, struct srp_target_port, work);
+
+   srp_reconnect_target(target);
+   spin_lock_irq(target->scsi_host->host_lock);
+   target->work_in_progress = 0;
+   spin_unlock_irq(target->scsi_host->host_lock);
+}
+
+static void srp_qp_in_err_timer(unsigned long data)
+{
+   struct srp_target_port *target = (struct srp_target_port *)data;
+   struct srp_request *req, *tmp;
+
+   if (target->state != SRP_TARGET_LIVE)
+   return;
+
+   spin_lock_irq(target->scsi_host->host_lock);
+   list_for_each_entry_safe(req, tmp, &target->req_queue, list)
+   srp_reset_req(target, req);
+   spin_unlock_irq(target->scsi_host->host_lock);
+
+   spin_lock_irq(target->scsi_host->host_lock);
+   if (!target->work_in_progress) {
+   target->work_in_progress = 1;
+   INIT_WORK(&target->work, srp_reconnect_work);
+   schedule_work(&target->work);
+   }
+   spin_unlock_irq(target->scsi_host->host_lock);
+}
+
+static void srp_qp_err_add_timer(struct srp_target_port *target, int time)
+{
+   if (!timer_pending(&target->qp_err_timer)) {
+   setup_timer(&target->qp_err_timer,
+   srp_qp_in_err_timer,
+   (unsigned long)target);
+   target->qp_err_timer.expires = round_jiffies(time*HZ + jiffies);
+   add_timer(&target->qp_err_timer);
+   }
+}
+
 static void srp_completion(struct ib_cq *cq, void *target_ptr)
 {
struct srp_target_port *target = target_ptr;
@@ -904,11 +952,19 @@ static void srp_completion(struct ib_cq *cq, void 
*target_ptr)
ib_req_notify_cq(cq, IB_CQ_NEXT_COMP);
while (ib_poll_cq(cq, 1, &wc) > 0) {
if (wc.status) {
+   unsigned long flags;
+
shost_printk(KERN_ERR, target->scsi_host,
 PFX "failed %s status %d\n",
 wc.wr_id & SRP_OP_RECV ? "receive" : 
"send",
 wc.status);
-   target->qp_in_error = 1;
+   spin_loc

Re: [ofa-general][PATCH 3/4] SRP fail-over faster

2009-10-15 Thread Jason Gunthorpe

On Thu, Oct 15, 2009 at 03:25:15PM -0400, David Dillow wrote:
> On Thu, 2009-10-15 at 09:23 -0700, Vu Pham wrote:
> > David Dillow wrote:
> > > And if I want to disable this completely?
> > >   
> > 
> > Unless these patches are bad and affect the stability of the driver, why 
> > do you want to disable it? If you don't use multipath/device-mapper and 
> > use /dev/sd**, everything will be  same
> 
> I use multipath with ALUA, and I don't mind if the link flaps a bit. 60
> seconds is near my SCSI timeout of 77 seconds, so it doesn't buy me
> much. I'd rather multipath be delivering traffic to the backup path than
> sitting on its thumbs for 60 seconds doing nothing.

I've been left with a similar impression for several multipath things
I've seen in the past. True active/active multipath setups should have
a shorter timeout - there is no penalty for directing more traffic to
the 2nd path (the paths should be load balancing existing traffic in
the standard case anyhow).

An active/passive configuration might be different...

Certainly an enforced lower limit in the kernel is silly, and a
per-device setting does make some sense.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ofa-general][PATCH 3/4] SRP fail-over faster

2009-10-15 Thread David Dillow

On Thu, 2009-10-15 at 09:23 -0700, Vu Pham wrote:
> David Dillow wrote:
> > And if I want to disable this completely?
> >   
> 
> Unless these patches are bad and affect the stability of the driver, why 
> do you want to disable it? If you don't use multipath/device-mapper and 
> use /dev/sd**, everything will be  same

I use multipath with ALUA, and I don't mind if the link flaps a bit. 60
seconds is near my SCSI timeout of 77 seconds, so it doesn't buy me
much. I'd rather multipath be delivering traffic to the backup path than
sitting on its thumbs for 60 seconds doing nothing.
-- 
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ofa-general][PATCH 3/4] SRP fail-over faster

2009-10-15 Thread Vu Pham

David Dillow wrote:

On Wed, 2009-10-14 at 15:47 -0700, Roland Dreier wrote:

 > First it does not make sense for user to set it below 60; therefore,

 > >  > it is forced to have 60 and above

 > > Why not?  A minute seems to be a really long time given the point of
 > > these patches is supposed to be failing over faster.  Surely we can tell
 > > if a path really failed sooner than 60 seconds on an IB fabric.

 > When we fail-over, it will cause the luns ownership transfer in
 > target/storage. It's undesirable op unless necessary
 > Target/storage most likely can reboot and come back within 60 seconds
 > We don't want to create the situation of path bouncing

OK, I can see why in some (many) situations it makes sense to wait a
while before reporting a target as gone.  But why do we hard code the
policy of a minimum timeout of 60 seconds in the kernel?  Why not a
minimum of 120 seconds?  What if I know my storage is guaranteed to
reboot in 2 seconds -- why can't I have a timeout of 5 seconds?

You haven't really explained where the magic number of 60 seconds comes from.

On that note, does this have to be a module parameter -- what if I have
connections to different devices, that have different reboot guarantees?

It can be tuned down to target/device level instead of  module parameter.
What do you think, Roland? It can be a param. in login string and stored 
in target structure.

And if I really want a timeout of less than 60 seconds, why should I
have to patch my kernel?

Then target parameter would be the right approach.

And if I want to disable this completely?

Unless these patches are bad and affect the stability of the driver, why 
do you want to disable it? If you don't use multipath/device-mapper and 
use /dev/sd**, everything will be  same

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ofa-general][PATCH 3/4] SRP fail-over faster

2009-10-14 Thread David Dillow

On Wed, 2009-10-14 at 15:47 -0700, Roland Dreier wrote:
> > >  > First it does not make sense for user to set it below 60; therefore,
>  > >  > it is forced to have 60 and above
> 
>  > > Why not?  A minute seems to be a really long time given the point of
>  > > these patches is supposed to be failing over faster.  Surely we can tell
>  > > if a path really failed sooner than 60 seconds on an IB fabric.
> 
>  > When we fail-over, it will cause the luns ownership transfer in
>  > target/storage. It's undesirable op unless necessary
>  > Target/storage most likely can reboot and come back within 60 seconds
>  > We don't want to create the situation of path bouncing
> 
> OK, I can see why in some (many) situations it makes sense to wait a
> while before reporting a target as gone.  But why do we hard code the
> policy of a minimum timeout of 60 seconds in the kernel?  Why not a
> minimum of 120 seconds?  What if I know my storage is guaranteed to
> reboot in 2 seconds -- why can't I have a timeout of 5 seconds?
> 
> You haven't really explained where the magic number of 60 seconds comes from.

On that note, does this have to be a module parameter -- what if I have
connections to different devices, that have different reboot guarantees?

And if I really want a timeout of less than 60 seconds, why should I
have to patch my kernel?

And if I want to disable this completely?
-- 
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ofa-general][PATCH 3/4] SRP fail-over faster

2009-10-14 Thread Vu Pham

Roland Dreier wrote:

 > >  > First it does not make sense for user to set it below 60; therefore,
 > >  > it is forced to have 60 and above

 > > Why not?  A minute seems to be a really long time given the point of
 > > these patches is supposed to be failing over faster.  Surely we can tell
 > > if a path really failed sooner than 60 seconds on an IB fabric.

 > When we fail-over, it will cause the luns ownership transfer in
 > target/storage. It's undesirable op unless necessary
 > Target/storage most likely can reboot and come back within 60 seconds
 > We don't want to create the situation of path bouncing

OK, I can see why in some (many) situations it makes sense to wait a
while before reporting a target as gone.  But why do we hard code the
policy of a minimum timeout of 60 seconds in the kernel?  Why not a
minimum of 120 seconds?  What if I know my storage is guaranteed to
reboot in 2 seconds -- why can't I have a timeout of 5 seconds?

You haven't really explained where the magic number of 60 seconds comes from.

 - R.

I don't have a clear answer. Same as why 30 secs for scsi to start 
aborting command(s).
I think 60 secs is not too long, not too short for a taget coming back 
online

-vu
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ofa-general][PATCH 3/4] SRP fail-over faster

2009-10-14 Thread Roland Dreier


 > >  > First it does not make sense for user to set it below 60; therefore,
 > >  > it is forced to have 60 and above

 > > Why not?  A minute seems to be a really long time given the point of
 > > these patches is supposed to be failing over faster.  Surely we can tell
 > > if a path really failed sooner than 60 seconds on an IB fabric.

 > When we fail-over, it will cause the luns ownership transfer in
 > target/storage. It's undesirable op unless necessary
 > Target/storage most likely can reboot and come back within 60 seconds
 > We don't want to create the situation of path bouncing

OK, I can see why in some (many) situations it makes sense to wait a
while before reporting a target as gone.  But why do we hard code the
policy of a minimum timeout of 60 seconds in the kernel?  Why not a
minimum of 120 seconds?  What if I know my storage is guaranteed to
reboot in 2 seconds -- why can't I have a timeout of 5 seconds?

You haven't really explained where the magic number of 60 seconds comes from.

 - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ofa-general][PATCH 3/4] SRP fail-over faster

2009-10-14 Thread Vu Pham

Roland Dreier wrote:

 > First it does not make sense for user to set it below 60; therefore,
 > it is forced to have 60 and above

Why not?  A minute seems to be a really long time given the point of
these patches is supposed to be failing over faster.  Surely we can tell
if a path really failed sooner than 60 seconds on an IB fabric.

 - R.

When we fail-over, it will cause the luns ownership transfer in 
target/storage. It's undesirable op unless necessary

Target/storage most likely can reboot and come back within 60 seconds
We don't want to create the situation of path bouncing

-vu
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ofa-general][PATCH 3/4] SRP fail-over faster

2009-10-14 Thread Vu Pham

Roland Dreier wrote:

 > +static int srp_dev_loss_tmo = 60;

I don't think the name needs to be this abbreviated.  We don't
necessarily need the srp_ prefix, but probably "device_loss_timeout" is
much clearer without being too much longer.

OK

 > +
 > +module_param(srp_dev_loss_tmo, int, 0444);
 > +MODULE_PARM_DESC(srp_dev_loss_tmo,
 > +  "Default number of seconds that srp transport should \
 > +   insulate the lost of a remote port (default is 60 secs");

I can't understand this description.  What does "insulate the lost" of a
port mean?

I should change "remote port" to just "port". It means that multipath 
driver won't know about  port offline event (pulling cable, power 
cycling switch, target...) and won't act/fail-over because srp won't 
return error code until this timeout expired

 > +static void srp_reconnect_work(struct work_struct *work)
 > +{
 > + struct srp_target_port *target =
 > + container_of(work, struct srp_target_port, work);
 > +
 > + srp_reconnect_target(target);
 > + target->work_in_progress = 0;

surely this is racy... isn't it possible for a context to see
work_in_progress as 1, decide not to schedule the work, and then have it
set to 0 immediately afterwards by the workqueue context?

Yes, it is racy. It should be in lock_irq scsi host_lock

 > + target->qp_err_timer.expires = time * HZ + jiffies;

given that this is only with 1 second resolution, probably makes sense
to either make it a deferrable timer or round the timeout to avoid extra
wakeups.

OK - I'll round the timeout.

 > + add_timer(&target->qp_err_timer);

I don't see anywhere that this is canceled on module unload etc?

My mistake. Bart also pointed it out. I'll fix this.

 > + srp_qp_err_add_timer(target,
 > +  srp_dev_loss_tmo - 55);

 > + if (srp_dev_loss_tmo < 60)
 > + srp_dev_loss_tmo = 60;

I don't understand the 55 and the 60 here... what are these magic
numbers?  Wouldn't it make sense for the user to specify the actual
timeout that is used (value - 55) rather than the value and then
secretly subtracting 55?

 - R.

First it does not make sense for user to set it below 60; therefore, it 
is forced to have 60 and above

With async event handler, srp can detect local port offline and set 
timer exact device_loss_timeout; however, it does not have mechanism to 
detect remote port offline (srp_daemon need to register trap and 
communicate remote port in/out fabric down to srp driver)
I should just add timer (X seconds) instead of (device_loss_tmo - 55) in 
case receiving cqe error and/or connection close event

-vu

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ofa-general][PATCH 3/4] SRP fail-over faster

2009-10-14 Thread Roland Dreier


 > First it does not make sense for user to set it below 60; therefore,
 > it is forced to have 60 and above

Why not?  A minute seems to be a really long time given the point of
these patches is supposed to be failing over faster.  Surely we can tell
if a path really failed sooner than 60 seconds on an IB fabric.

 - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ofa-general][PATCH 3/4] SRP fail-over faster

2009-10-14 Thread Roland Dreier


 > +static int srp_dev_loss_tmo = 60;

I don't think the name needs to be this abbreviated.  We don't
necessarily need the srp_ prefix, but probably "device_loss_timeout" is
much clearer without being too much longer.

 > +
 > +module_param(srp_dev_loss_tmo, int, 0444);
 > +MODULE_PARM_DESC(srp_dev_loss_tmo,
 > + "Default number of seconds that srp transport should \
 > +  insulate the lost of a remote port (default is 60 secs");

I can't understand this description.  What does "insulate the lost" of a
port mean?

 > +static void srp_reconnect_work(struct work_struct *work)
 > +{
 > +struct srp_target_port *target =
 > +container_of(work, struct srp_target_port, work);
 > +
 > +srp_reconnect_target(target);
 > +target->work_in_progress = 0;

surely this is racy... isn't it possible for a context to see
work_in_progress as 1, decide not to schedule the work, and then have it
set to 0 immediately afterwards by the workqueue context?

 > +target->qp_err_timer.expires = time * HZ + jiffies;

given that this is only with 1 second resolution, probably makes sense
to either make it a deferrable timer or round the timeout to avoid extra
wakeups.

 > +add_timer(&target->qp_err_timer);

I don't see anywhere that this is canceled on module unload etc?

 > +srp_qp_err_add_timer(target,
 > + srp_dev_loss_tmo - 55);

 > +if (srp_dev_loss_tmo < 60)
 > +srp_dev_loss_tmo = 60;

I don't understand the 55 and the 60 here... what are these magic
numbers?  Wouldn't it make sense for the user to specify the actual
timeout that is used (value - 55) rather than the value and then
secretly subtracting 55?

 - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ofa-general][PATCH 3/4] SRP fail-over faster

2009-10-13 Thread Bart Van Assche

On Tue, Oct 13, 2009 at 12:57 AM, Vu Pham  wrote:
>
>
> Introducing srp_dev_loss_tmo module parameter. Creating a timer to clean up
> connection after srp_dev_loss_tmo expired. During srp_dev_loss_tmo, the qp
> is in error state, srp will return DID_RESET for outstanding I/O and return
> FAILED for abort_cmd, reset_lun, and return SUCCESS (without trying
> reconnect) on reset_host.
>
> Signed-off-by: Vu Pham 
>
>
>
> Index: ofed_kernel/drivers/infiniband/ulp/srp/ib_srp.c
> ===
> --- ofed_kernel.orig/drivers/infiniband/ulp/srp/ib_srp.c
> +++ ofed_kernel/drivers/infiniband/ulp/srp/ib_srp.c
> @@ -78,6 +77,13 @@
>  MODULE_PARM_DESC(mellanox_workarounds,
>                 "Enable workarounds for Mellanox SRP target bugs if != 0");
>
> +static int srp_dev_loss_tmo = 60;
> +
> +module_param(srp_dev_loss_tmo, int, 0444);
> +MODULE_PARM_DESC(srp_dev_loss_tmo,
> +                "Default number of seconds that srp transport should \
> +                 insulate the lost of a remote port (default is 60 secs");
> +
>  static void srp_add_one(struct ib_device *device);
>  static void srp_remove_one(struct ib_device *device);
>  static void srp_completion(struct ib_cq *cq, void *target_ptr);
> @@ -898,6 +926,48 @@
>                                      DMA_FROM_DEVICE);
>  }
>
> +static void srp_reconnect_work(struct work_struct *work)
> +{
> +       struct srp_target_port *target =
> +               container_of(work, struct srp_target_port, work);
> +
> +       srp_reconnect_target(target);
> +       target->work_in_progress = 0;
> +}
> +
> +static void srp_qp_in_err_timer(unsigned long data)
> +{
> +       struct srp_target_port *target = (struct srp_target_port *)data;
> +       struct srp_request *req, *tmp;
> +
> +       if (target->state != SRP_TARGET_LIVE)
> +               return;
> +
> +       spin_lock_irq(target->scsi_host->host_lock);
> +       list_for_each_entry_safe(req, tmp, &target->req_queue, list)
> +               srp_reset_req(target, req);
> +       spin_unlock_irq(target->scsi_host->host_lock);
> +
> +       spin_lock_irq(target->scsi_host->host_lock);
> +       if (!target->work_in_progress) {
> +               target->work_in_progress = 1;
> +               INIT_WORK(&target->work, srp_reconnect_work);
> +               schedule_work(&target->work);
> +       }
> +       spin_unlock_irq(target->scsi_host->host_lock);
> +}
> +
> +static void srp_qp_err_add_timer(struct srp_target_port *target, int time)
> +{
> +       if (!timer_pending(&target->qp_err_timer)) {
> +               setup_timer(&target->qp_err_timer,
> +                           srp_qp_in_err_timer,
> +                           (unsigned long)target);
> +               target->qp_err_timer.expires = time * HZ + jiffies;
> +               add_timer(&target->qp_err_timer);
> +       }
> +}

What will happen when the ib_srp kernel module is removed after the
timer has been set up but before the timer has fired ? Isn't a call to
del_timer_sync() missing in srp_remove_work() ?

> +
>  static void srp_completion(struct ib_cq *cq, void *target_ptr)
>  {
>        struct srp_target_port *target = target_ptr;
> @@ -960,11 +980,20 @@
>        ib_req_notify_cq(cq, IB_CQ_NEXT_COMP);
>        while (ib_poll_cq(cq, 1, &wc) > 0) {
>                if (wc.status) {
> +                       unsigned long flags;
> +
>                        shost_printk(KERN_ERR, target->scsi_host,
>                                     PFX "failed %s status %d\n",
>                                     wc.wr_id & SRP_OP_RECV ? "receive" :
> "send",
>                                     wc.status);
> -                       target->qp_in_error = 1;
> +                       spin_lock_irqsave(target->scsi_host->host_lock,
> flags);
> +                       if (!target->qp_in_error &&
> +                           target->state == SRP_TARGET_LIVE) {
> +                               target->qp_in_error = 1;
> +                               srp_qp_err_add_timer(target,
> +                                                    srp_dev_loss_tmo - 55);
> +                       }
> +                       spin_unlock_irqrestore(target->scsi_host->host_lock,
> flags);
>                        break;
>                }
>
> @@ -1274,5 +1299,6 @@
>        int attr_mask = 0;
>        int comp = 0;
>        int opcode = 0;
> +       unsigned long flags;
>
>        switch (event->event) {
> @@ -1301,6 +1381,14 @@
>                shost_printk(KERN_ERR, target->scsi_host,
>                             PFX "connection closed\n");
>
> +               spin_lock_irqsave(target->scsi_host->host_lock, flags);
> +               if (!target->qp_in_error &&
> +                   target->state == SRP_TARGET_LIVE) {
> +                       target->qp_in_error = 1;
> +                       srp_qp_err_add_timer(target,
> +                                            srp_dev_loss_tmo - 55);
> +               }
> +               s

[ofa-general][PATCH 3/4] SRP fail-over faster

2009-10-12 Thread Vu Pham



--- Begin Message ---
Introducing srp_dev_loss_tmo module parameter. Creating a timer to clean 
up connection after srp_dev_loss_tmo expired. During srp_dev_loss_tmo, 
the qp is in error state, srp will return DID_RESET for outstanding I/O 
and return FAILED for abort_cmd, reset_lun, and return SUCCESS (without 
trying reconnect) on reset_host.


Signed-off-by: Vu Pham 


Index: ofed_kernel/drivers/infiniband/ulp/srp/ib_srp.c
===
--- ofed_kernel.orig/drivers/infiniband/ulp/srp/ib_srp.c
+++ ofed_kernel/drivers/infiniband/ulp/srp/ib_srp.c
@@ -78,6 +77,13 @@
 MODULE_PARM_DESC(mellanox_workarounds,
 "Enable workarounds for Mellanox SRP target bugs if != 0");
 
+static int srp_dev_loss_tmo = 60;
+
+module_param(srp_dev_loss_tmo, int, 0444);
+MODULE_PARM_DESC(srp_dev_loss_tmo,
+"Default number of seconds that srp transport should \
+ insulate the lost of a remote port (default is 60 secs");
+
 static void srp_add_one(struct ib_device *device);
 static void srp_remove_one(struct ib_device *device);
 static void srp_completion(struct ib_cq *cq, void *target_ptr);
@@ -898,6 +926,48 @@
  DMA_FROM_DEVICE);
 }
 
+static void srp_reconnect_work(struct work_struct *work)
+{
+   struct srp_target_port *target =
+   container_of(work, struct srp_target_port, work);
+
+   srp_reconnect_target(target);
+   target->work_in_progress = 0;
+}
+
+static void srp_qp_in_err_timer(unsigned long data)
+{
+   struct srp_target_port *target = (struct srp_target_port *)data;
+   struct srp_request *req, *tmp;
+
+   if (target->state != SRP_TARGET_LIVE)
+   return;
+
+   spin_lock_irq(target->scsi_host->host_lock);
+   list_for_each_entry_safe(req, tmp, &target->req_queue, list)
+   srp_reset_req(target, req);
+   spin_unlock_irq(target->scsi_host->host_lock);
+
+   spin_lock_irq(target->scsi_host->host_lock);
+   if (!target->work_in_progress) {
+   target->work_in_progress = 1;
+   INIT_WORK(&target->work, srp_reconnect_work);
+   schedule_work(&target->work);
+   }
+   spin_unlock_irq(target->scsi_host->host_lock);
+}
+
+static void srp_qp_err_add_timer(struct srp_target_port *target, int time)
+{
+   if (!timer_pending(&target->qp_err_timer)) {
+   setup_timer(&target->qp_err_timer,
+   srp_qp_in_err_timer,
+   (unsigned long)target);
+   target->qp_err_timer.expires = time * HZ + jiffies;
+   add_timer(&target->qp_err_timer);
+   }
+}
+
 static void srp_completion(struct ib_cq *cq, void *target_ptr)
 {
struct srp_target_port *target = target_ptr;
@@ -960,11 +980,20 @@
ib_req_notify_cq(cq, IB_CQ_NEXT_COMP);
while (ib_poll_cq(cq, 1, &wc) > 0) {
if (wc.status) {
+   unsigned long flags;
+
shost_printk(KERN_ERR, target->scsi_host,
 PFX "failed %s status %d\n",
 wc.wr_id & SRP_OP_RECV ? "receive" : 
"send",
 wc.status);
-   target->qp_in_error = 1;
+   spin_lock_irqsave(target->scsi_host->host_lock, flags);
+   if (!target->qp_in_error &&
+   target->state == SRP_TARGET_LIVE) {
+   target->qp_in_error = 1;
+   srp_qp_err_add_timer(target,
+srp_dev_loss_tmo - 55);
+   }
+   spin_unlock_irqrestore(target->scsi_host->host_lock, 
flags);
break;
}
 
@@ -1274,5 +1299,6 @@
int attr_mask = 0;
int comp = 0;
int opcode = 0;
+   unsigned long flags;
 
switch (event->event) {
@@ -1301,6 +1381,14 @@
shost_printk(KERN_ERR, target->scsi_host,
 PFX "connection closed\n");
 
+   spin_lock_irqsave(target->scsi_host->host_lock, flags);
+   if (!target->qp_in_error &&
+   target->state == SRP_TARGET_LIVE) {
+   target->qp_in_error = 1;
+   srp_qp_err_add_timer(target,
+srp_dev_loss_tmo - 55);
+   }
+   spin_unlock_irqrestore(target->scsi_host->host_lock, flags);
target->status = 0;
break;
 
@@ -1443,9 +1529,22 @@
 static int srp_reset_host(struct scsi_cmnd *scmnd)
 {
struct srp_target_port *target = host_to_target(scmnd->device->host);
+   struct srp_request *req, *tmp;
int ret = FAILED;
 
-   shost_printk(KERN_ERR, target->scsi_host, PFX "SRP reset_host 
called\n

Re: [ofa-general][PATCH 3/4] SRP fail-over faster

Re: [ofa-general][PATCH 3/4] SRP fail-over faster

Re: [ofa-general][PATCH 3/4] SRP fail-over faster

Re: [ofa-general][PATCH 3/4] SRP fail-over faster

Re: [ofa-general][PATCH 3/4] SRP fail-over faster

Re: [ofa-general][PATCH 3/4] SRP fail-over faster

Re: [ofa-general][PATCH 3/4] SRP fail-over faster

Re: [ofa-general][PATCH 3/4] SRP fail-over faster

Re: [ofa-general][PATCH 3/4] SRP fail-over faster

Re: [ofa-general][PATCH 3/4] SRP fail-over faster

Re: [ofa-general][PATCH 3/4] SRP fail-over faster

Re: [ofa-general][PATCH 3/4] SRP fail-over faster

Re: [ofa-general][PATCH 3/4] SRP fail-over faster

Re: [ofa-general][PATCH 3/4] SRP fail-over faster

Re: [ofa-general][PATCH 3/4] SRP fail-over faster

Re: [ofa-general][PATCH 3/4] SRP fail-over faster

Re: [ofa-general][PATCH 3/4] SRP fail-over faster

Re: [ofa-general][PATCH 3/4] SRP fail-over faster

Re: [ofa-general][PATCH 3/4] SRP fail-over faster

Re: [ofa-general][PATCH 3/4] SRP fail-over faster

Re: [ofa-general][PATCH 3/4] SRP fail-over faster

Re: [ofa-general][PATCH 3/4] SRP fail-over faster

Re: [ofa-general][PATCH 3/4] SRP fail-over faster

Re: [ofa-general][PATCH 3/4] SRP fail-over faster

Re: [ofa-general][PATCH 3/4] SRP fail-over faster

Re: [ofa-general][PATCH 3/4] SRP fail-over faster

Re: [ofa-general][PATCH 3/4] SRP fail-over faster

Re: [ofa-general][PATCH 3/4] SRP fail-over faster

[ofa-general][PATCH 3/4] SRP fail-over faster

29 matches

Site Navigation

Mail list logo

Footer information