Re: [RFC] hpsa: work in progress "lockless monster" patches

2014-08-04 Thread Tomas Henzl
On 08/01/2014 06:40 PM, Webb Scales wrote:
> On 7/31/14 9:56 AM, Tomas Henzl wrote:
 In cmd_tagged_alloc "Thus, there should never be a collision here between
 two requests" if this is true you don't need the refcount and just a simple
 flag were sufficient for your other needs. (And maybe you get to ~971k 
 IOPS..)
>>> :-)  The code previously had the reference count in there to close
>>> certain race conditions' asynchronous accesses to the command block
>>> (e.g., between an abort and a completing command).  We hope that, using
>>> the block layer tags, those races no longer occur, but there didn't seem
>>> to be any point in removing the reference count until we'd had more
>>> experience with the code...and, in a fit of healthy paranoia I added the
>>> check to which you refer.  Unfortunately, in some of Rob Elliott's
>>> recent tests, he managed to drive a case where the check triggers.  So,
>>> until we reconcile that, I'm inclined to leave the check in place
>>> (hopefully it's not costing a full 1k IOPS :-) ).
>> Let us assume that the block layer never sends duplicate tags to the driver,
>> I think there are some weaknesses in the way how it's implemented,
>> for example here - from fail_all_outstanding_cmds:
>>  refcount = atomic_inc_return(&c->refcount);
>>  if (refcount > 1) {
>>  ...
>>  finish_cmd(c); - this finishes the command
>>  and the tag might be now reused,
>>  when that happens you'll see a race
>>  atomic_dec(&h->commands_outstanding);
>>  failcount++;
>>  }
>>  else {what happens when right now comes a call from block 
>> layer?}
> The next call from the block layer should find that the controller has 
> been marked as locked-up and return immediately with a no-connect error 
> (even before trying to allocate an HPSA command block), so I don't think 
> there is a race or conflict.

 
OK, the 'locked-up' logic and when it's activated - I mean I don't understand 
every
part of your driver. I have just seen that in your current implementation 
of cmd_tagged_alloc that you only print a debug message and continue with the 
same block.
That might bring some issues later.
 

>
>
>> When references are used it usually means, that when refcount==0 that
>> a release function frees the resources, in this case it is the tag number.
>> In your implementation you should ensure that when you signal
>> to the upper layer that the tag is free, that nothing else holds the 
>> reference.
> Actually, in this case, the resource is the HPSA command block which 
> corresponds to the tag number.  But, you are correct that trying to 
> manage them separately is apt to cause problems.  Having the 
> cmd_[tagged_]free() routine make the call to scsi_done() -- only when 
> the reference count is dropping to zero -- is an interesting 
> suggestion...but I'll have to do considerable investigation before I can 
> proceed with that.

I'm not sure if this is doable (calling scsi_done when refcount drops to zero)
I just see a weakness of the code that you can have called the scsi_done  
and still having a refcount > 0.

There is work in progress on the code and I have found no more issues (better 
said
maybe issues), so thanks for sharing the code now.

--tomash

>
>
>> If this is true the conclusion is that you don't need this kind of references
>> and a simple flag just to debug not yet fixed races is enough.
>> I'm maybe wrong because I don't understand what you want to protect
>> in the above example, so that makes me to have some doubts if I understand
>> the code properly.
> Absent the block layer support, the HPSA code has to be able to defend 
> against various races where there may be more than two interested 
> parties looking at the command block -- so a simple flag will not 
> suffice.  On the other hand, with the block layer support in place, 
> we're hoping that we can get rid of the reference count altogether, but 
> we're not there, yet.
>
>
>  Webb
>
>
>
 On 07/25/2014 09:28 PM, scame...@beardog.cce.hp.com wrote:
> hpsa: Work In Progress: "lockless monster" patches
>
> To be clear, I am not suggesting that these patches be merged at this 
> time.
>
> This patchset is vs. Christoph Hellwig's scsi-mq.4 branch which
> may be found here: git://git.infradead.org/users/hch/scsi.git
>
> We've been working for a long time on a patchset for hpsa to remove
> all the locks from the main i/o path in pursuit of high IOPS.  Some
> of that work is already upstream, but a lot more of it is not quite
> yet ready to be merged.  However, I think we've "gone dark" for a bit
> too long on this, and even though the patches aren't really ready to
> be merged just yet, I thought I should let other people who might be
> interested have a lo

Re: [RFC] hpsa: work in progress "lockless monster" patches

2014-08-04 Thread Tomas Henzl
On 07/31/2014 05:16 PM, scame...@beardog.cce.hp.com wrote:
> On Thu, Jul 31, 2014 at 03:56:13PM +0200, Tomas Henzl wrote:
>> Hi Steve, Webb,
>>
>> let me start with the part most important for me - from my previous mail
>> "And please have a look at "[PATCH] hpsa: refine the pci enble/disable 
>> handling"
>> I posted in June and add it to your series or review in the list. Thanks."
> Ok.  I will try to get to this.  
> (some customer issues have been eating a lot of my time this week).
>
> When I first saw that patch I knew that Rob Elliott had a lot of patches in 
> the
> works that did a lot of cleanup to the initialization code, so I expect some
> conflicts.

Thanks, I think I should port my patch on top of your series and repost.

>
>> other comments see below
>>
>>> Hi Tomas,
>>>
>>> Thanks for taking a look and for the feedback.  I'm actually the one 
>>> responsible for the changes you refer to, so I'll try to address your 
>>> comments.
>>>
>>>
>>> On 7/30/14 10:55 AM, Tomas Henzl wrote:
 I'm not sure if I got it right,
 it seems it uses the same cmd_pool for both alloc function,
 from (0 - reserved_cmds) it's cmd_alloc and above that it's handled
 by cmd_tagged_alloc.
 The logic in both functions seems to be valid for me.
>>> Good.  With Christoph's proposed changes, the block layer will select a 
>>> tag for I/O requests, and it provides for the driver to reserve part of 
>>> the tag space for its own use.  Thus, HPSA uses a common pool for both 
>>> I/O requests and internally-driven requests (IOCTLs, aborts, etc.), and 
>>> we index the pool using the tag.  The code is this way mostly to 
>>> minimize change from the previous version, but it's also handy in terms 
>>> of managing the pool (i.e., there is a single pool to administer, 
>>> instead of two, and all the items are initialized the same way, etc.).  
>>> The portion of the pool which is reserved to the driver (the 
>>> low-numbered tags) continues to be managed through a bitmap, just as it 
>>> was prior to this change; the bitmap does cover the whole pool (which is 
>>> a historical artifact), but the higher bits are never set, since 
>>> allocation of those entries is determined by the tag from the block 
>>> layer -- nevertheless, it is convenient to have the map extend over the 
>>> whole pool in certain corner cases (such as resets).
>>>
>>>
 In cmd_tagged_alloc "Thus, there should never be a collision here between
 two requests" if this is true you don't need the refcount and just a simple
 flag were sufficient for your other needs. (And maybe you get to ~971k 
 IOPS..)
>>> :-)  The code previously had the reference count in there to close 
>>> certain race conditions' asynchronous accesses to the command block 
>>> (e.g., between an abort and a completing command).  We hope that, using 
>>> the block layer tags, those races no longer occur, but there didn't seem 
>>> to be any point in removing the reference count until we'd had more 
>>> experience with the code...and, in a fit of healthy paranoia I added the 
>>> check to which you refer.  Unfortunately, in some of Rob Elliott's 
>>> recent tests, he managed to drive a case where the check triggers.  So, 
>>> until we reconcile that, I'm inclined to leave the check in place 
>>> (hopefully it's not costing a full 1k IOPS :-) ).
>> Let us assume that the block layer never sends duplicate tags to the driver,
>> I think there are some weaknesses in the way how it's implemented,
>> for example here - from fail_all_outstanding_cmds:
>> refcount = atomic_inc_return(&c->refcount);
>> if (refcount > 1) {
>> ...
>> finish_cmd(c); - this finishes the command
>>  and the tag might be now reused,
>>  when that happens you'll see a race
>> atomic_dec(&h->commands_outstanding);
>> failcount++;
>> }
>> else {what happens when right now comes a call from block 
>> layer?}
>> cmd_free(h, c);
>> 
>> When references are used it usually means, that when refcount==0 that
>> a release function frees the resources, in this case it is the tag number.
>> In your implementation you should ensure that when you signal
>> to the upper layer that the tag is free, that nothing else holds the 
>> reference.
>> If this is true the conclusion is that you don't need this kind of references
>> and a simple flag just to debug not yet fixed races is enough.
> Part of the motivation is we want to have this code be workable for distros
> that may not have the necessary kernel changes for us to be able to use the
> block layer tagging (e.g. tags reserved for driver use).  So I am floating
> the block layer tag patches, keeping them at the top of my stack of patches
> so that all this lockless stuff can still work even without using the block
> l

Re: [RFC] hpsa: work in progress "lockless monster" patches

2014-08-01 Thread Webb Scales

On 7/31/14 9:56 AM, Tomas Henzl wrote:

In cmd_tagged_alloc "Thus, there should never be a collision here between
two requests" if this is true you don't need the refcount and just a simple
flag were sufficient for your other needs. (And maybe you get to ~971k IOPS..)

:-)  The code previously had the reference count in there to close
certain race conditions' asynchronous accesses to the command block
(e.g., between an abort and a completing command).  We hope that, using
the block layer tags, those races no longer occur, but there didn't seem
to be any point in removing the reference count until we'd had more
experience with the code...and, in a fit of healthy paranoia I added the
check to which you refer.  Unfortunately, in some of Rob Elliott's
recent tests, he managed to drive a case where the check triggers.  So,
until we reconcile that, I'm inclined to leave the check in place
(hopefully it's not costing a full 1k IOPS :-) ).

Let us assume that the block layer never sends duplicate tags to the driver,
I think there are some weaknesses in the way how it's implemented,
for example here - from fail_all_outstanding_cmds:
 refcount = atomic_inc_return(&c->refcount);
 if (refcount > 1) {
 ...
 finish_cmd(c); - this finishes the command
and the tag might be now reused,
when that happens you'll see a race
 atomic_dec(&h->commands_outstanding);
 failcount++;
 }
 else {what happens when right now comes a call from block 
layer?}
The next call from the block layer should find that the controller has 
been marked as locked-up and return immediately with a no-connect error 
(even before trying to allocate an HPSA command block), so I don't think 
there is a race or conflict.




When references are used it usually means, that when refcount==0 that
a release function frees the resources, in this case it is the tag number.
In your implementation you should ensure that when you signal
to the upper layer that the tag is free, that nothing else holds the reference.
Actually, in this case, the resource is the HPSA command block which 
corresponds to the tag number.  But, you are correct that trying to 
manage them separately is apt to cause problems.  Having the 
cmd_[tagged_]free() routine make the call to scsi_done() -- only when 
the reference count is dropping to zero -- is an interesting 
suggestion...but I'll have to do considerable investigation before I can 
proceed with that.




If this is true the conclusion is that you don't need this kind of references
and a simple flag just to debug not yet fixed races is enough.
I'm maybe wrong because I don't understand what you want to protect
in the above example, so that makes me to have some doubts if I understand
the code properly.
Absent the block layer support, the HPSA code has to be able to defend 
against various races where there may be more than two interested 
parties looking at the command block -- so a simple flag will not 
suffice.  On the other hand, with the block layer support in place, 
we're hoping that we can get rid of the reference count altogether, but 
we're not there, yet.



Webb




On 07/25/2014 09:28 PM, scame...@beardog.cce.hp.com wrote:

hpsa: Work In Progress: "lockless monster" patches

To be clear, I am not suggesting that these patches be merged at this time.

This patchset is vs. Christoph Hellwig's scsi-mq.4 branch which
may be found here: git://git.infradead.org/users/hch/scsi.git

We've been working for a long time on a patchset for hpsa to remove
all the locks from the main i/o path in pursuit of high IOPS.  Some
of that work is already upstream, but a lot more of it is not quite
yet ready to be merged.  However, I think we've "gone dark" for a bit
too long on this, and even though the patches aren't really ready to
be merged just yet, I thought I should let other people who might be
interested have a look anyway, as things are starting to be at least
more stable than unstable.  Be warned though, there are still some
problems, esp. around error recovery.

That being said, with the right hardware (fast SAS SSDs, recent Smart
Arrays e.g. P430, with up-to-date firmware, attached to recent disk enclosures)
with these patches and the scsi-mq patches, it is possible to get around
~970k IOPS from a single logical drive on a single controller.

There are about 150 patches in this set.  Rather than bomb the list
with that, here is a link to a tarball of the patches in the form of
an stgit patch series:
https://github.com/smcameron/hpsa-lockless-patches-work-in-progress/blob/master/hpsa-lockless-vs-hch-scsi-mq.4-2014-07-25-1415CDT.tar.bz2?raw=true

In some cases, I have probably erred on the side of having too many too
small patches, on the theory that it is easier to bake a cake than to
unbake a cake.  Before these are subm

Re: [RFC] hpsa: work in progress "lockless monster" patches

2014-07-31 Thread scameron
On Sun, Jul 27, 2014 at 05:24:46PM +0200, Hannes Reinecke wrote:
> On 07/25/2014 09:28 PM, scame...@beardog.cce.hp.com wrote:
> >
> >hpsa: Work In Progress: "lockless monster" patches
> >
> >To be clear, I am not suggesting that these patches be merged at this time.
> >
> >This patchset is vs. Christoph Hellwig's scsi-mq.4 branch which
> >may be found here: git://git.infradead.org/users/hch/scsi.git
> >
> >We've been working for a long time on a patchset for hpsa to remove
> >all the locks from the main i/o path in pursuit of high IOPS.  Some
> >of that work is already upstream, but a lot more of it is not quite
> >yet ready to be merged.  However, I think we've "gone dark" for a bit
> >too long on this, and even though the patches aren't really ready to
> >be merged just yet, I thought I should let other people who might be
> >interested have a look anyway, as things are starting to be at least
> >more stable than unstable.  Be warned though, there are still some
> >problems, esp. around error recovery.
> >
> >That being said, with the right hardware (fast SAS SSDs, recent Smart
> >Arrays e.g. P430, with up-to-date firmware, attached to recent disk 
> >enclosures)
> >with these patches and the scsi-mq patches, it is possible to get around
> >~970k IOPS from a single logical drive on a single controller.
> >
> >There are about 150 patches in this set.  Rather than bomb the list
> >with that, here is a link to a tarball of the patches in the form of
> >an stgit patch series:
> >
> >https://github.com/smcameron/hpsa-lockless-patches-work-in-progress/blob/master/hpsa-lockless-vs-hch-scsi-mq.4-2014-07-25-1415CDT.tar.bz2?raw=true
> >
> >In some cases, I have probably erred on the side of having too many too
> >small patches, on the theory that it is easier to bake a cake than to
> >unbake a cake.  Before these are submitted "for reals", there will be
> >some squashing of patches, along with other cleaning up.
> >
> >There are some patches in this set which are already upstream in
> >James's tree which do not happen to be in Christoph's tree
> >(most of which are named "*_sent_to_james").  There are also
> >quite a few patches which are strictly for debugging and are not
> >ever intended to be merged.
> >
> Hmm. While you're about to engage on a massive rewrite _and_ we're 
> having 64bit LUN support now, what about getting rid of the weird 
> internal LUN mapping? That way you would get rid of this LUN rescan 
> thingie and the driver would look more sane.

Sorry for my slow reply to this.

I can sympathize with this desire, the device scanning code in the driver
is not my favorite, and it can undoubtedly be improved.  However, there
are some problems.

> Plus the REPORT LUN command would actually return the correct data ...

No, it wouldn't.  Smart Arrays are unfortunately pretty weird.

SCSI REPORT LUNS issued to a smart array will report only the logical
drives.  No physical devices (tape drives, etc.) will be reported.

Instead of SCSI REPORT LUNS, the driver uses a couple of proprietary
variants of this, CISS_REPORT_LOGICAL and CISS_REPORT_PHYSICAL, which
report logical drives and physical devices, and have their own oddities.

And it gets weirder.  This will require some exposition, so bear with me.

What's driving this giant patchball to remove locks from the driver is SSDs.
Suddenly, the assumption that disks are dirt slow relative to the host is not
quite so true anymore.  Smart Array looks something like this:

   +---+
   | host  |
   +-+-+
 |
 | PCI bus
 |
   +-|--+
   | |  |
   | +--RAID stack  |<--- smart array
   |running on  |
   |embedded system |
   |on PCI board|
   | |  |
   |-|--|
   | |  |
   |  Back end firmware |
   ||   |
   +|---+
|
| SAS
|
   +--+
   | physical devices (disks, etc.)   |
   +--+

with the advent of SSDs, it turns out the RAID stack firmware
running on the little embedded system on the PCI board becomes
a bottleneck and hurts performance.

It turns out that with a (significant) firmware change, we
can do something like this:

   +---+
   | host  |
   ++--+
|
| PCI bus
|
   +|---+
   ||  

Re: [RFC] hpsa: work in progress "lockless monster" patches

2014-07-31 Thread scameron
On Thu, Jul 31, 2014 at 03:56:13PM +0200, Tomas Henzl wrote:
> Hi Steve, Webb,
> 
> let me start with the part most important for me - from my previous mail
> "And please have a look at "[PATCH] hpsa: refine the pci enble/disable 
> handling"
> I posted in June and add it to your series or review in the list. Thanks."

Ok.  I will try to get to this.  
(some customer issues have been eating a lot of my time this week).

When I first saw that patch I knew that Rob Elliott had a lot of patches in the
works that did a lot of cleanup to the initialization code, so I expect some
conflicts.

> 
> other comments see below
> 
> > Hi Tomas,
> >
> > Thanks for taking a look and for the feedback.  I'm actually the one 
> > responsible for the changes you refer to, so I'll try to address your 
> > comments.
> >
> >
> > On 7/30/14 10:55 AM, Tomas Henzl wrote:
> >> I'm not sure if I got it right,
> >> it seems it uses the same cmd_pool for both alloc function,
> >> from (0 - reserved_cmds) it's cmd_alloc and above that it's handled
> >> by cmd_tagged_alloc.
> >> The logic in both functions seems to be valid for me.
> > Good.  With Christoph's proposed changes, the block layer will select a 
> > tag for I/O requests, and it provides for the driver to reserve part of 
> > the tag space for its own use.  Thus, HPSA uses a common pool for both 
> > I/O requests and internally-driven requests (IOCTLs, aborts, etc.), and 
> > we index the pool using the tag.  The code is this way mostly to 
> > minimize change from the previous version, but it's also handy in terms 
> > of managing the pool (i.e., there is a single pool to administer, 
> > instead of two, and all the items are initialized the same way, etc.).  
> > The portion of the pool which is reserved to the driver (the 
> > low-numbered tags) continues to be managed through a bitmap, just as it 
> > was prior to this change; the bitmap does cover the whole pool (which is 
> > a historical artifact), but the higher bits are never set, since 
> > allocation of those entries is determined by the tag from the block 
> > layer -- nevertheless, it is convenient to have the map extend over the 
> > whole pool in certain corner cases (such as resets).
> >
> >
> >> In cmd_tagged_alloc "Thus, there should never be a collision here between
> >> two requests" if this is true you don't need the refcount and just a simple
> >> flag were sufficient for your other needs. (And maybe you get to ~971k 
> >> IOPS..)
> > :-)  The code previously had the reference count in there to close 
> > certain race conditions' asynchronous accesses to the command block 
> > (e.g., between an abort and a completing command).  We hope that, using 
> > the block layer tags, those races no longer occur, but there didn't seem 
> > to be any point in removing the reference count until we'd had more 
> > experience with the code...and, in a fit of healthy paranoia I added the 
> > check to which you refer.  Unfortunately, in some of Rob Elliott's 
> > recent tests, he managed to drive a case where the check triggers.  So, 
> > until we reconcile that, I'm inclined to leave the check in place 
> > (hopefully it's not costing a full 1k IOPS :-) ).
> 
> Let us assume that the block layer never sends duplicate tags to the driver,
> I think there are some weaknesses in the way how it's implemented,
> for example here - from fail_all_outstanding_cmds:
> refcount = atomic_inc_return(&c->refcount);
> if (refcount > 1) {
> ...
> finish_cmd(c); - this finishes the command
>   and the tag might be now reused,
>   when that happens you'll see a race
> atomic_dec(&h->commands_outstanding);
> failcount++;
> }
> else {what happens when right now comes a call from block 
> layer?}
> cmd_free(h, c);
> 
> When references are used it usually means, that when refcount==0 that
> a release function frees the resources, in this case it is the tag number.
> In your implementation you should ensure that when you signal
> to the upper layer that the tag is free, that nothing else holds the 
> reference.
> If this is true the conclusion is that you don't need this kind of references
> and a simple flag just to debug not yet fixed races is enough.

Part of the motivation is we want to have this code be workable for distros
that may not have the necessary kernel changes for us to be able to use the
block layer tagging (e.g. tags reserved for driver use).  So I am floating
the block layer tag patches, keeping them at the top of my stack of patches
so that all this lockless stuff can still work even without using the block
layer tags.  So one reason the reference counting needs to go in is to
accomodate backporting the LLD without touching an older kernel's mid layer
or block layer code, which is import

Re: [RFC] hpsa: work in progress "lockless monster" patches

2014-07-31 Thread Tomas Henzl
Hi Steve, Webb,

let me start with the part most important for me - from my previous mail
"And please have a look at "[PATCH] hpsa: refine the pci enble/disable handling"
I posted in June and add it to your series or review in the list. Thanks."

other comments see below

> Hi Tomas,
>
> Thanks for taking a look and for the feedback.  I'm actually the one 
> responsible for the changes you refer to, so I'll try to address your 
> comments.
>
>
> On 7/30/14 10:55 AM, Tomas Henzl wrote:
>> I'm not sure if I got it right,
>> it seems it uses the same cmd_pool for both alloc function,
>> from (0 - reserved_cmds) it's cmd_alloc and above that it's handled
>> by cmd_tagged_alloc.
>> The logic in both functions seems to be valid for me.
> Good.  With Christoph's proposed changes, the block layer will select a 
> tag for I/O requests, and it provides for the driver to reserve part of 
> the tag space for its own use.  Thus, HPSA uses a common pool for both 
> I/O requests and internally-driven requests (IOCTLs, aborts, etc.), and 
> we index the pool using the tag.  The code is this way mostly to 
> minimize change from the previous version, but it's also handy in terms 
> of managing the pool (i.e., there is a single pool to administer, 
> instead of two, and all the items are initialized the same way, etc.).  
> The portion of the pool which is reserved to the driver (the 
> low-numbered tags) continues to be managed through a bitmap, just as it 
> was prior to this change; the bitmap does cover the whole pool (which is 
> a historical artifact), but the higher bits are never set, since 
> allocation of those entries is determined by the tag from the block 
> layer -- nevertheless, it is convenient to have the map extend over the 
> whole pool in certain corner cases (such as resets).
>
>
>> In cmd_tagged_alloc "Thus, there should never be a collision here between
>> two requests" if this is true you don't need the refcount and just a simple
>> flag were sufficient for your other needs. (And maybe you get to ~971k 
>> IOPS..)
> :-)  The code previously had the reference count in there to close 
> certain race conditions' asynchronous accesses to the command block 
> (e.g., between an abort and a completing command).  We hope that, using 
> the block layer tags, those races no longer occur, but there didn't seem 
> to be any point in removing the reference count until we'd had more 
> experience with the code...and, in a fit of healthy paranoia I added the 
> check to which you refer.  Unfortunately, in some of Rob Elliott's 
> recent tests, he managed to drive a case where the check triggers.  So, 
> until we reconcile that, I'm inclined to leave the check in place 
> (hopefully it's not costing a full 1k IOPS :-) ).

Let us assume that the block layer never sends duplicate tags to the driver,
I think there are some weaknesses in the way how it's implemented,
for example here - from fail_all_outstanding_cmds:
refcount = atomic_inc_return(&c->refcount);
if (refcount > 1) {
...
finish_cmd(c); - this finishes the command
and the tag might be now reused,
when that happens you'll see a race
atomic_dec(&h->commands_outstanding);
failcount++;
}
else {what happens when right now comes a call from block 
layer?}
cmd_free(h, c);

When references are used it usually means, that when refcount==0 that
a release function frees the resources, in this case it is the tag number.
In your implementation you should ensure that when you signal
to the upper layer that the tag is free, that nothing else holds the reference.
If this is true the conclusion is that you don't need this kind of references
and a simple flag just to debug not yet fixed races is enough.
I'm maybe wrong because I don't understand what you want to protect
in the above example, so that makes me to have some doubts if I understand 
the code properly.

>> cmd_alloc some minor changes are possible
>> - try to find free entries only to reserved_cmds
>>(the bitmap might be shorter too)
> Excellent catch!  I'll make that change.  (As I said, we currently rely 
> on the over-sized bit map, but there is no reason for this code to 
> search the whole thing...although, with the block layer doing the heavy 
> lifting, this allocation path is lightly used, and it will be rare that 
> it runs off the end of the reserved range.)
>
>
>> - next thread should start + 1 from the last result
> That's a fair point, but the suggestion is somewhat more complicated 
> than it seems:  if "start + 1" is greater than reserved_cmds, then we 
> want to start at zero, instead.  I think it ends up being more code than 
> it's worth.
>
> In fact, with the partitioned pool, the reserved-to-driver section is 
> now so small that it's probably not 

Re: [RFC] hpsa: work in progress "lockless monster" patches

2014-07-30 Thread Webb Scales

Hi Tomas,

Thanks for taking a look and for the feedback.  I'm actually the one 
responsible for the changes you refer to, so I'll try to address your 
comments.



On 7/30/14 10:55 AM, Tomas Henzl wrote:

I'm not sure if I got it right,
it seems it uses the same cmd_pool for both alloc function,
from (0 - reserved_cmds) it's cmd_alloc and above that it's handled
by cmd_tagged_alloc.
The logic in both functions seems to be valid for me.
Good.  With Christoph's proposed changes, the block layer will select a 
tag for I/O requests, and it provides for the driver to reserve part of 
the tag space for its own use.  Thus, HPSA uses a common pool for both 
I/O requests and internally-driven requests (IOCTLs, aborts, etc.), and 
we index the pool using the tag.  The code is this way mostly to 
minimize change from the previous version, but it's also handy in terms 
of managing the pool (i.e., there is a single pool to administer, 
instead of two, and all the items are initialized the same way, etc.).  
The portion of the pool which is reserved to the driver (the 
low-numbered tags) continues to be managed through a bitmap, just as it 
was prior to this change; the bitmap does cover the whole pool (which is 
a historical artifact), but the higher bits are never set, since 
allocation of those entries is determined by the tag from the block 
layer -- nevertheless, it is convenient to have the map extend over the 
whole pool in certain corner cases (such as resets).




In cmd_tagged_alloc "Thus, there should never be a collision here between
two requests" if this is true you don't need the refcount and just a simple
flag were sufficient for your other needs. (And maybe you get to ~971k IOPS..)
:-)  The code previously had the reference count in there to close 
certain race conditions' asynchronous accesses to the command block 
(e.g., between an abort and a completing command).  We hope that, using 
the block layer tags, those races no longer occur, but there didn't seem 
to be any point in removing the reference count until we'd had more 
experience with the code...and, in a fit of healthy paranoia I added the 
check to which you refer.  Unfortunately, in some of Rob Elliott's 
recent tests, he managed to drive a case where the check triggers.  So, 
until we reconcile that, I'm inclined to leave the check in place 
(hopefully it's not costing a full 1k IOPS :-) ).




cmd_alloc some minor changes are possible
- try to find free entries only to reserved_cmds
   (the bitmap might be shorter too)
Excellent catch!  I'll make that change.  (As I said, we currently rely 
on the over-sized bit map, but there is no reason for this code to 
search the whole thing...although, with the block layer doing the heavy 
lifting, this allocation path is lightly used, and it will be rare that 
it runs off the end of the reserved range.)




- next thread should start + 1 from the last result
That's a fair point, but the suggestion is somewhat more complicated 
than it seems:  if "start + 1" is greater than reserved_cmds, then we 
want to start at zero, instead.  I think it ends up being more code than 
it's worth.


In fact, with the partitioned pool, the reserved-to-driver section is 
now so small that it's probably not worth the effort to try to resume 
the search where the last search left off -- we might as well start at 
zero each time.  (Thanks for bringing that up!  ;-) )




- when nr_cmds is read from hw it should be tested (nr_cmds > reserved_cmds)

That's a fair point.



- what is the correct pronunciation of benignhly?
You would have to ask in certain parts of Boston or perhaps out on the 
South Fork of Long Island.  :-D  (I thought I fixed those typos -- I'm 
_sure_ I did! -- and yet)




@@ -5634,7 +5634,8 @@ static struct CommandList *cmd_alloc(struct ctlr_info *h)
  
  	offset = h->last_allocation; /* benighly racy */

for (;;) {
-   i = find_next_zero_bit(h->cmd_pool_bits, h->nr_cmds, offset);
+   offset %=  h->reserved_cmds;
+   i = find_next_zero_bit(h->cmd_pool_bits, h->reserved_cmds, 
offset);
Aside from changing the limit from nr_cmds to reserved_cmds, I don't 
think this change is helpful, because it inserts a modulo operation on 
every iteration, which is probably unnecessary.  (It would be better to 
ensure that offset is a reasonable value to begin with, and, even it is 
not, the call to find_next_zero_bit() will handle it properly and then 
the code will retry.)



@@ -5642,15 +5643,16 @@ static struct CommandList *cmd_alloc(struct ctlr_info 
*h)
c = h->cmd_pool + i;
refcount = atomic_inc_return(&c->refcount);
if (unlikely(refcount > 1)) {
-   cmd_free(h, c); /* already in use */
-   offset = (i + 1) % h->nr_cmds;
+   atomic_dec(&c->refcount); /* already in use, since we
+don't own the command just decrease the refcount */
+   

Re: [RFC] hpsa: work in progress "lockless monster" patches

2014-07-30 Thread Tomas Henzl
On 07/25/2014 09:28 PM, scame...@beardog.cce.hp.com wrote:
>
> hpsa: Work In Progress: "lockless monster" patches
>
> To be clear, I am not suggesting that these patches be merged at this time.
>
> This patchset is vs. Christoph Hellwig's scsi-mq.4 branch which
> may be found here: git://git.infradead.org/users/hch/scsi.git
>
> We've been working for a long time on a patchset for hpsa to remove
> all the locks from the main i/o path in pursuit of high IOPS.  Some
> of that work is already upstream, but a lot more of it is not quite
> yet ready to be merged.  However, I think we've "gone dark" for a bit
> too long on this, and even though the patches aren't really ready to
> be merged just yet, I thought I should let other people who might be
> interested have a look anyway, as things are starting to be at least
> more stable than unstable.  Be warned though, there are still some
> problems, esp. around error recovery.
>
> That being said, with the right hardware (fast SAS SSDs, recent Smart
> Arrays e.g. P430, with up-to-date firmware, attached to recent disk 
> enclosures)
> with these patches and the scsi-mq patches, it is possible to get around
> ~970k IOPS from a single logical drive on a single controller.
>
> There are about 150 patches in this set.  Rather than bomb the list
> with that, here is a link to a tarball of the patches in the form of
> an stgit patch series:

Hi Stephen,
I've looked only at the alloc functions - I'm not sure if I got it right,
it seems it uses the same cmd_pool for both alloc function,
from (0 - reserved_cmds) it's cmd_alloc and above that it's handled
by cmd_tagged_alloc.
The logic in both functions seems to be valid for me.

In cmd_tagged_alloc "Thus, there should never be a collision here between
two requests" if this is true you don't need the refcount and just a simple
flag were sufficient for your other needs. (And maybe you get to ~971k IOPS..)

cmd_alloc some minor changes are possible 
- try to find free entries only to reserved_cmds
  (the bitmap might be shorter too)
- next thread should start + 1 from the last result

- when nr_cmds is read from hw it should be tested (nr_cmds > reserved_cmds)
- what is the correct pronunciation of benignhly?
@@ -5634,7 +5634,8 @@ static struct CommandList *cmd_alloc(struct ctlr_info *h)
 
offset = h->last_allocation; /* benighly racy */
for (;;) {
-   i = find_next_zero_bit(h->cmd_pool_bits, h->nr_cmds, offset);
+   offset %=  h->reserved_cmds;
+   i = find_next_zero_bit(h->cmd_pool_bits, h->reserved_cmds, 
offset);
if (unlikely(i >= h->reserved_cmds)) {
offset = 0;
continue;
@@ -5642,15 +5643,16 @@ static struct CommandList *cmd_alloc(struct ctlr_info 
*h)
c = h->cmd_pool + i;
refcount = atomic_inc_return(&c->refcount);
if (unlikely(refcount > 1)) {
-   cmd_free(h, c); /* already in use */
-   offset = (i + 1) % h->nr_cmds;
+   atomic_dec(&c->refcount); /* already in use, since we
+don't own the command just decrease the refcount */
+   offset = i + 1;
continue;
}
set_bit(i & (BITS_PER_LONG - 1),
h->cmd_pool_bits + (i / BITS_PER_LONG));
break; /* it's ours now. */
}
-   h->last_allocation = i; /* benignly racy */
+   h->last_allocation = i + 1; /* benignly racy */
hpsa_cmd_partial_init(h, i, c);
return c;
 }
@@ -5670,6 +5672,7 @@ static void cmd_free(struct ctlr_info *h, struct 
CommandList *c)
clear_bit(i & (BITS_PER_LONG - 1),
  h->cmd_pool_bits + (i / BITS_PER_LONG));
}
+   else ; /*BUG ? or just atomic_dec and no if */
 }


And please have a look at "[PATCH] hpsa: refine the pci enble/disable handling"
I posted in June and add it to your series or review in the list.

Thanks,
Tomas


>
> https://github.com/smcameron/hpsa-lockless-patches-work-in-progress/blob/master/hpsa-lockless-vs-hch-scsi-mq.4-2014-07-25-1415CDT.tar.bz2?raw=true
>
> In some cases, I have probably erred on the side of having too many too
> small patches, on the theory that it is easier to bake a cake than to 
> unbake a cake.  Before these are submitted "for reals", there will be
> some squashing of patches, along with other cleaning up.
>
> There are some patches in this set which are already upstream in
> James's tree which do not happen to be in Christoph's tree
> (most of which are named "*_sent_to_james").  There are also
> quite a few patches which are strictly for debugging and are not
> ever intended to be merged. 
>
>
> Here is a list of all the patches sorted by author:
>
> Arnd Bergmann (1):
>   [SCSI] hpsa: fix non-x86 builds
>
> 

Re: [RFC] hpsa: work in progress "lockless monster" patches

2014-07-27 Thread Hannes Reinecke

On 07/25/2014 09:28 PM, scame...@beardog.cce.hp.com wrote:


hpsa: Work In Progress: "lockless monster" patches

To be clear, I am not suggesting that these patches be merged at this time.

This patchset is vs. Christoph Hellwig's scsi-mq.4 branch which
may be found here: git://git.infradead.org/users/hch/scsi.git

We've been working for a long time on a patchset for hpsa to remove
all the locks from the main i/o path in pursuit of high IOPS.  Some
of that work is already upstream, but a lot more of it is not quite
yet ready to be merged.  However, I think we've "gone dark" for a bit
too long on this, and even though the patches aren't really ready to
be merged just yet, I thought I should let other people who might be
interested have a look anyway, as things are starting to be at least
more stable than unstable.  Be warned though, there are still some
problems, esp. around error recovery.

That being said, with the right hardware (fast SAS SSDs, recent Smart
Arrays e.g. P430, with up-to-date firmware, attached to recent disk enclosures)
with these patches and the scsi-mq patches, it is possible to get around
~970k IOPS from a single logical drive on a single controller.

There are about 150 patches in this set.  Rather than bomb the list
with that, here is a link to a tarball of the patches in the form of
an stgit patch series:

https://github.com/smcameron/hpsa-lockless-patches-work-in-progress/blob/master/hpsa-lockless-vs-hch-scsi-mq.4-2014-07-25-1415CDT.tar.bz2?raw=true

In some cases, I have probably erred on the side of having too many too
small patches, on the theory that it is easier to bake a cake than to
unbake a cake.  Before these are submitted "for reals", there will be
some squashing of patches, along with other cleaning up.

There are some patches in this set which are already upstream in
James's tree which do not happen to be in Christoph's tree
(most of which are named "*_sent_to_james").  There are also
quite a few patches which are strictly for debugging and are not
ever intended to be merged.

Hmm. While you're about to engage on a massive rewrite _and_ we're 
having 64bit LUN support now, what about getting rid of the weird 
internal LUN mapping? That way you would get rid of this LUN rescan 
thingie and the driver would look more sane.

Plus the REPORT LUN command would actually return the correct data ...

Cheers,

Hannes
--
Dr. Hannes Reinecke   zSeries & Storage
h...@suse.de  +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] hpsa: work in progress "lockless monster" patches

2014-07-26 Thread Christoph Hellwig
> Christoph Hellwig (1):
>   reserve block tags in scsi host

So you found this one useful.  This begs the question how we should
move forward with it, as it will only work with the blk-mq path in
it's current form.

I can see three ways:

 - we implement equivalent functionality in the old block tagging code
   and make it available for any driver
 - we require drivers specific workarounds for the !mq path
 - we allow drivers to force the use of the blk-mq path if they want
   to use advanced features like this.  This might become nessecary
   anyway when we expose actual multiqueue support to LLDDs.

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] hpsa: work in progress "lockless monster" patches

2014-07-25 Thread scameron

hpsa: Work In Progress: "lockless monster" patches

To be clear, I am not suggesting that these patches be merged at this time.

This patchset is vs. Christoph Hellwig's scsi-mq.4 branch which
may be found here: git://git.infradead.org/users/hch/scsi.git

We've been working for a long time on a patchset for hpsa to remove
all the locks from the main i/o path in pursuit of high IOPS.  Some
of that work is already upstream, but a lot more of it is not quite
yet ready to be merged.  However, I think we've "gone dark" for a bit
too long on this, and even though the patches aren't really ready to
be merged just yet, I thought I should let other people who might be
interested have a look anyway, as things are starting to be at least
more stable than unstable.  Be warned though, there are still some
problems, esp. around error recovery.

That being said, with the right hardware (fast SAS SSDs, recent Smart
Arrays e.g. P430, with up-to-date firmware, attached to recent disk enclosures)
with these patches and the scsi-mq patches, it is possible to get around
~970k IOPS from a single logical drive on a single controller.

There are about 150 patches in this set.  Rather than bomb the list
with that, here is a link to a tarball of the patches in the form of
an stgit patch series:

https://github.com/smcameron/hpsa-lockless-patches-work-in-progress/blob/master/hpsa-lockless-vs-hch-scsi-mq.4-2014-07-25-1415CDT.tar.bz2?raw=true

In some cases, I have probably erred on the side of having too many too
small patches, on the theory that it is easier to bake a cake than to 
unbake a cake.  Before these are submitted "for reals", there will be
some squashing of patches, along with other cleaning up.

There are some patches in this set which are already upstream in
James's tree which do not happen to be in Christoph's tree
(most of which are named "*_sent_to_james").  There are also
quite a few patches which are strictly for debugging and are not
ever intended to be merged. 


Here is a list of all the patches sorted by author:

Arnd Bergmann (1):
  [SCSI] hpsa: fix non-x86 builds

Christoph Hellwig (1):
  reserve block tags in scsi host

Joe Handzik (9):
  hpsa: add masked physical devices into h->dev[] array
  hpsa: add hpsa_bmic_id_physical_device function
  hpsa: get queue depth for physical devices
  hpsa: use ioaccel2 path to submit IOs to physical drives in HBA mode.
  hpsa: Get queue depth from identify physical bmic for physical disks.
  hpsa: add ioaccel sg chaining for the ioaccel2 path
  Set the phys_disk value for a CommandList structure that is
submitted. Squash
  hpsa: unmap ioaccel2 commands before, not after adding to resubmit
workqueue
  hpsa: add more ioaccel2 error handling, including underrun statuses.

Nicholas Bellinger (1):
  hpsa: Convert SCSI LLD ->queuecommand() for host_lock less operation

Robert Elliott (44):
  Change scsi.c scsi_log_completion() to print strings for QUEUED,
  Set scsi_logging_level to be more verbose to get better messages
  hpsa: do not unconditionally copy sense data
  hpsa: optimize cmd_alloc function by remembering last allocation
  From a154642aeed291c7cfe4b9ea9da932156030f7d1 Mon Sep 17 00:00:00
2001
  hpsa: Clean up host, channel, target, lun prints
  hpsa: do not check cmd_alloc return value - it cannnot return NULL
  hpsa: return -1 rather than -ENOMEM in hpsa_get_raid_map if fill_cmd
fails
  hpsa: return -ENOMEM not 1 from ioaccel2_alloc_cmds_and_bft on
allocation failure
  hpsa: return -ENOMEM not 1 from hpsa_alloc_ioaccel_cmd_and_bft on
allocation failure
  hpsa: return -ENOMEM not -EFAULT from hpsa_passthru_ioctl on kmalloc
failure
  hpsa: pass error from pci_set_consistent_dma_mask intact from
hpsa_message
  hpsa: detect and report failures changing controller transport modes
  hpsa: add hpsa_disable_interrupt_mode
  hpsa: rename free_irqs to hpsa_free_irqs and move before
hpsa_request_irq
  hpsa: rename hpsa_request_irq to hpsa_request_irqs
  hpsa: on failure of request_irq, free the irqs that we did get
  hpsa: make hpsa_pci_init disable interrupts and pci_disable_device on
critical failures
  hpsa: fix a string constant broken into two lines
  hpsa: avoid unneccesary calls to resource freeing functions in
hpsa_init_one
  hpsa: add hpsa_free_cfgtables function to undo what
hpsa_find_cfgtables does
  hpsa: report failure to ioremap config table
  hpsa: add hpsa_free_pci_init function
  hpsa: clean up error handling in hpsa_pci_init
  hpsa: make hpsa_remove_one use hpsa_pci_free_init
  hpsa: rename hpsa_alloc_ioaccel_cmd_and_bft to
hpsa_alloc_ioaccel1_cmd_and_bft
  hpsa: rename hpsa_allocate_cmd_pool to hpsa_alloc_cmd_pool
  hpsa: rename ioaccel2_alloc_cmds_and_bft to
hpsa_alloc_ioaccel2_cmd_and_bft
  hpsa: fix memory leak in hpsa_alloc_cmd_pool
  hpsa: return status not void from hpsa_put_ctlr_into_performant_mode
  hpsa: clean up error handling in hpsa_init_one
  hpsa: report allocation failures while alloca