Re: [RFC] hpsa: work in progress "lockless monster" patches
On 08/01/2014 06:40 PM, Webb Scales wrote: > On 7/31/14 9:56 AM, Tomas Henzl wrote: In cmd_tagged_alloc "Thus, there should never be a collision here between two requests" if this is true you don't need the refcount and just a simple flag were sufficient for your other needs. (And maybe you get to ~971k IOPS..) >>> :-) The code previously had the reference count in there to close >>> certain race conditions' asynchronous accesses to the command block >>> (e.g., between an abort and a completing command). We hope that, using >>> the block layer tags, those races no longer occur, but there didn't seem >>> to be any point in removing the reference count until we'd had more >>> experience with the code...and, in a fit of healthy paranoia I added the >>> check to which you refer. Unfortunately, in some of Rob Elliott's >>> recent tests, he managed to drive a case where the check triggers. So, >>> until we reconcile that, I'm inclined to leave the check in place >>> (hopefully it's not costing a full 1k IOPS :-) ). >> Let us assume that the block layer never sends duplicate tags to the driver, >> I think there are some weaknesses in the way how it's implemented, >> for example here - from fail_all_outstanding_cmds: >> refcount = atomic_inc_return(&c->refcount); >> if (refcount > 1) { >> ... >> finish_cmd(c); - this finishes the command >> and the tag might be now reused, >> when that happens you'll see a race >> atomic_dec(&h->commands_outstanding); >> failcount++; >> } >> else {what happens when right now comes a call from block >> layer?} > The next call from the block layer should find that the controller has > been marked as locked-up and return immediately with a no-connect error > (even before trying to allocate an HPSA command block), so I don't think > there is a race or conflict. OK, the 'locked-up' logic and when it's activated - I mean I don't understand every part of your driver. I have just seen that in your current implementation of cmd_tagged_alloc that you only print a debug message and continue with the same block. That might bring some issues later. > > >> When references are used it usually means, that when refcount==0 that >> a release function frees the resources, in this case it is the tag number. >> In your implementation you should ensure that when you signal >> to the upper layer that the tag is free, that nothing else holds the >> reference. > Actually, in this case, the resource is the HPSA command block which > corresponds to the tag number. But, you are correct that trying to > manage them separately is apt to cause problems. Having the > cmd_[tagged_]free() routine make the call to scsi_done() -- only when > the reference count is dropping to zero -- is an interesting > suggestion...but I'll have to do considerable investigation before I can > proceed with that. I'm not sure if this is doable (calling scsi_done when refcount drops to zero) I just see a weakness of the code that you can have called the scsi_done and still having a refcount > 0. There is work in progress on the code and I have found no more issues (better said maybe issues), so thanks for sharing the code now. --tomash > > >> If this is true the conclusion is that you don't need this kind of references >> and a simple flag just to debug not yet fixed races is enough. >> I'm maybe wrong because I don't understand what you want to protect >> in the above example, so that makes me to have some doubts if I understand >> the code properly. > Absent the block layer support, the HPSA code has to be able to defend > against various races where there may be more than two interested > parties looking at the command block -- so a simple flag will not > suffice. On the other hand, with the block layer support in place, > we're hoping that we can get rid of the reference count altogether, but > we're not there, yet. > > > Webb > > > On 07/25/2014 09:28 PM, scame...@beardog.cce.hp.com wrote: > hpsa: Work In Progress: "lockless monster" patches > > To be clear, I am not suggesting that these patches be merged at this > time. > > This patchset is vs. Christoph Hellwig's scsi-mq.4 branch which > may be found here: git://git.infradead.org/users/hch/scsi.git > > We've been working for a long time on a patchset for hpsa to remove > all the locks from the main i/o path in pursuit of high IOPS. Some > of that work is already upstream, but a lot more of it is not quite > yet ready to be merged. However, I think we've "gone dark" for a bit > too long on this, and even though the patches aren't really ready to > be merged just yet, I thought I should let other people who might be > interested have a lo
Re: [RFC] hpsa: work in progress "lockless monster" patches
On 07/31/2014 05:16 PM, scame...@beardog.cce.hp.com wrote: > On Thu, Jul 31, 2014 at 03:56:13PM +0200, Tomas Henzl wrote: >> Hi Steve, Webb, >> >> let me start with the part most important for me - from my previous mail >> "And please have a look at "[PATCH] hpsa: refine the pci enble/disable >> handling" >> I posted in June and add it to your series or review in the list. Thanks." > Ok. I will try to get to this. > (some customer issues have been eating a lot of my time this week). > > When I first saw that patch I knew that Rob Elliott had a lot of patches in > the > works that did a lot of cleanup to the initialization code, so I expect some > conflicts. Thanks, I think I should port my patch on top of your series and repost. > >> other comments see below >> >>> Hi Tomas, >>> >>> Thanks for taking a look and for the feedback. I'm actually the one >>> responsible for the changes you refer to, so I'll try to address your >>> comments. >>> >>> >>> On 7/30/14 10:55 AM, Tomas Henzl wrote: I'm not sure if I got it right, it seems it uses the same cmd_pool for both alloc function, from (0 - reserved_cmds) it's cmd_alloc and above that it's handled by cmd_tagged_alloc. The logic in both functions seems to be valid for me. >>> Good. With Christoph's proposed changes, the block layer will select a >>> tag for I/O requests, and it provides for the driver to reserve part of >>> the tag space for its own use. Thus, HPSA uses a common pool for both >>> I/O requests and internally-driven requests (IOCTLs, aborts, etc.), and >>> we index the pool using the tag. The code is this way mostly to >>> minimize change from the previous version, but it's also handy in terms >>> of managing the pool (i.e., there is a single pool to administer, >>> instead of two, and all the items are initialized the same way, etc.). >>> The portion of the pool which is reserved to the driver (the >>> low-numbered tags) continues to be managed through a bitmap, just as it >>> was prior to this change; the bitmap does cover the whole pool (which is >>> a historical artifact), but the higher bits are never set, since >>> allocation of those entries is determined by the tag from the block >>> layer -- nevertheless, it is convenient to have the map extend over the >>> whole pool in certain corner cases (such as resets). >>> >>> In cmd_tagged_alloc "Thus, there should never be a collision here between two requests" if this is true you don't need the refcount and just a simple flag were sufficient for your other needs. (And maybe you get to ~971k IOPS..) >>> :-) The code previously had the reference count in there to close >>> certain race conditions' asynchronous accesses to the command block >>> (e.g., between an abort and a completing command). We hope that, using >>> the block layer tags, those races no longer occur, but there didn't seem >>> to be any point in removing the reference count until we'd had more >>> experience with the code...and, in a fit of healthy paranoia I added the >>> check to which you refer. Unfortunately, in some of Rob Elliott's >>> recent tests, he managed to drive a case where the check triggers. So, >>> until we reconcile that, I'm inclined to leave the check in place >>> (hopefully it's not costing a full 1k IOPS :-) ). >> Let us assume that the block layer never sends duplicate tags to the driver, >> I think there are some weaknesses in the way how it's implemented, >> for example here - from fail_all_outstanding_cmds: >> refcount = atomic_inc_return(&c->refcount); >> if (refcount > 1) { >> ... >> finish_cmd(c); - this finishes the command >> and the tag might be now reused, >> when that happens you'll see a race >> atomic_dec(&h->commands_outstanding); >> failcount++; >> } >> else {what happens when right now comes a call from block >> layer?} >> cmd_free(h, c); >> >> When references are used it usually means, that when refcount==0 that >> a release function frees the resources, in this case it is the tag number. >> In your implementation you should ensure that when you signal >> to the upper layer that the tag is free, that nothing else holds the >> reference. >> If this is true the conclusion is that you don't need this kind of references >> and a simple flag just to debug not yet fixed races is enough. > Part of the motivation is we want to have this code be workable for distros > that may not have the necessary kernel changes for us to be able to use the > block layer tagging (e.g. tags reserved for driver use). So I am floating > the block layer tag patches, keeping them at the top of my stack of patches > so that all this lockless stuff can still work even without using the block > l
Re: [RFC] hpsa: work in progress "lockless monster" patches
On 7/31/14 9:56 AM, Tomas Henzl wrote: In cmd_tagged_alloc "Thus, there should never be a collision here between two requests" if this is true you don't need the refcount and just a simple flag were sufficient for your other needs. (And maybe you get to ~971k IOPS..) :-) The code previously had the reference count in there to close certain race conditions' asynchronous accesses to the command block (e.g., between an abort and a completing command). We hope that, using the block layer tags, those races no longer occur, but there didn't seem to be any point in removing the reference count until we'd had more experience with the code...and, in a fit of healthy paranoia I added the check to which you refer. Unfortunately, in some of Rob Elliott's recent tests, he managed to drive a case where the check triggers. So, until we reconcile that, I'm inclined to leave the check in place (hopefully it's not costing a full 1k IOPS :-) ). Let us assume that the block layer never sends duplicate tags to the driver, I think there are some weaknesses in the way how it's implemented, for example here - from fail_all_outstanding_cmds: refcount = atomic_inc_return(&c->refcount); if (refcount > 1) { ... finish_cmd(c); - this finishes the command and the tag might be now reused, when that happens you'll see a race atomic_dec(&h->commands_outstanding); failcount++; } else {what happens when right now comes a call from block layer?} The next call from the block layer should find that the controller has been marked as locked-up and return immediately with a no-connect error (even before trying to allocate an HPSA command block), so I don't think there is a race or conflict. When references are used it usually means, that when refcount==0 that a release function frees the resources, in this case it is the tag number. In your implementation you should ensure that when you signal to the upper layer that the tag is free, that nothing else holds the reference. Actually, in this case, the resource is the HPSA command block which corresponds to the tag number. But, you are correct that trying to manage them separately is apt to cause problems. Having the cmd_[tagged_]free() routine make the call to scsi_done() -- only when the reference count is dropping to zero -- is an interesting suggestion...but I'll have to do considerable investigation before I can proceed with that. If this is true the conclusion is that you don't need this kind of references and a simple flag just to debug not yet fixed races is enough. I'm maybe wrong because I don't understand what you want to protect in the above example, so that makes me to have some doubts if I understand the code properly. Absent the block layer support, the HPSA code has to be able to defend against various races where there may be more than two interested parties looking at the command block -- so a simple flag will not suffice. On the other hand, with the block layer support in place, we're hoping that we can get rid of the reference count altogether, but we're not there, yet. Webb On 07/25/2014 09:28 PM, scame...@beardog.cce.hp.com wrote: hpsa: Work In Progress: "lockless monster" patches To be clear, I am not suggesting that these patches be merged at this time. This patchset is vs. Christoph Hellwig's scsi-mq.4 branch which may be found here: git://git.infradead.org/users/hch/scsi.git We've been working for a long time on a patchset for hpsa to remove all the locks from the main i/o path in pursuit of high IOPS. Some of that work is already upstream, but a lot more of it is not quite yet ready to be merged. However, I think we've "gone dark" for a bit too long on this, and even though the patches aren't really ready to be merged just yet, I thought I should let other people who might be interested have a look anyway, as things are starting to be at least more stable than unstable. Be warned though, there are still some problems, esp. around error recovery. That being said, with the right hardware (fast SAS SSDs, recent Smart Arrays e.g. P430, with up-to-date firmware, attached to recent disk enclosures) with these patches and the scsi-mq patches, it is possible to get around ~970k IOPS from a single logical drive on a single controller. There are about 150 patches in this set. Rather than bomb the list with that, here is a link to a tarball of the patches in the form of an stgit patch series: https://github.com/smcameron/hpsa-lockless-patches-work-in-progress/blob/master/hpsa-lockless-vs-hch-scsi-mq.4-2014-07-25-1415CDT.tar.bz2?raw=true In some cases, I have probably erred on the side of having too many too small patches, on the theory that it is easier to bake a cake than to unbake a cake. Before these are subm
Re: [RFC] hpsa: work in progress "lockless monster" patches
On Sun, Jul 27, 2014 at 05:24:46PM +0200, Hannes Reinecke wrote: > On 07/25/2014 09:28 PM, scame...@beardog.cce.hp.com wrote: > > > >hpsa: Work In Progress: "lockless monster" patches > > > >To be clear, I am not suggesting that these patches be merged at this time. > > > >This patchset is vs. Christoph Hellwig's scsi-mq.4 branch which > >may be found here: git://git.infradead.org/users/hch/scsi.git > > > >We've been working for a long time on a patchset for hpsa to remove > >all the locks from the main i/o path in pursuit of high IOPS. Some > >of that work is already upstream, but a lot more of it is not quite > >yet ready to be merged. However, I think we've "gone dark" for a bit > >too long on this, and even though the patches aren't really ready to > >be merged just yet, I thought I should let other people who might be > >interested have a look anyway, as things are starting to be at least > >more stable than unstable. Be warned though, there are still some > >problems, esp. around error recovery. > > > >That being said, with the right hardware (fast SAS SSDs, recent Smart > >Arrays e.g. P430, with up-to-date firmware, attached to recent disk > >enclosures) > >with these patches and the scsi-mq patches, it is possible to get around > >~970k IOPS from a single logical drive on a single controller. > > > >There are about 150 patches in this set. Rather than bomb the list > >with that, here is a link to a tarball of the patches in the form of > >an stgit patch series: > > > >https://github.com/smcameron/hpsa-lockless-patches-work-in-progress/blob/master/hpsa-lockless-vs-hch-scsi-mq.4-2014-07-25-1415CDT.tar.bz2?raw=true > > > >In some cases, I have probably erred on the side of having too many too > >small patches, on the theory that it is easier to bake a cake than to > >unbake a cake. Before these are submitted "for reals", there will be > >some squashing of patches, along with other cleaning up. > > > >There are some patches in this set which are already upstream in > >James's tree which do not happen to be in Christoph's tree > >(most of which are named "*_sent_to_james"). There are also > >quite a few patches which are strictly for debugging and are not > >ever intended to be merged. > > > Hmm. While you're about to engage on a massive rewrite _and_ we're > having 64bit LUN support now, what about getting rid of the weird > internal LUN mapping? That way you would get rid of this LUN rescan > thingie and the driver would look more sane. Sorry for my slow reply to this. I can sympathize with this desire, the device scanning code in the driver is not my favorite, and it can undoubtedly be improved. However, there are some problems. > Plus the REPORT LUN command would actually return the correct data ... No, it wouldn't. Smart Arrays are unfortunately pretty weird. SCSI REPORT LUNS issued to a smart array will report only the logical drives. No physical devices (tape drives, etc.) will be reported. Instead of SCSI REPORT LUNS, the driver uses a couple of proprietary variants of this, CISS_REPORT_LOGICAL and CISS_REPORT_PHYSICAL, which report logical drives and physical devices, and have their own oddities. And it gets weirder. This will require some exposition, so bear with me. What's driving this giant patchball to remove locks from the driver is SSDs. Suddenly, the assumption that disks are dirt slow relative to the host is not quite so true anymore. Smart Array looks something like this: +---+ | host | +-+-+ | | PCI bus | +-|--+ | | | | +--RAID stack |<--- smart array |running on | |embedded system | |on PCI board| | | | |-|--| | | | | Back end firmware | || | +|---+ | | SAS | +--+ | physical devices (disks, etc.) | +--+ with the advent of SSDs, it turns out the RAID stack firmware running on the little embedded system on the PCI board becomes a bottleneck and hurts performance. It turns out that with a (significant) firmware change, we can do something like this: +---+ | host | ++--+ | | PCI bus | +|---+ ||
Re: [RFC] hpsa: work in progress "lockless monster" patches
On Thu, Jul 31, 2014 at 03:56:13PM +0200, Tomas Henzl wrote: > Hi Steve, Webb, > > let me start with the part most important for me - from my previous mail > "And please have a look at "[PATCH] hpsa: refine the pci enble/disable > handling" > I posted in June and add it to your series or review in the list. Thanks." Ok. I will try to get to this. (some customer issues have been eating a lot of my time this week). When I first saw that patch I knew that Rob Elliott had a lot of patches in the works that did a lot of cleanup to the initialization code, so I expect some conflicts. > > other comments see below > > > Hi Tomas, > > > > Thanks for taking a look and for the feedback. I'm actually the one > > responsible for the changes you refer to, so I'll try to address your > > comments. > > > > > > On 7/30/14 10:55 AM, Tomas Henzl wrote: > >> I'm not sure if I got it right, > >> it seems it uses the same cmd_pool for both alloc function, > >> from (0 - reserved_cmds) it's cmd_alloc and above that it's handled > >> by cmd_tagged_alloc. > >> The logic in both functions seems to be valid for me. > > Good. With Christoph's proposed changes, the block layer will select a > > tag for I/O requests, and it provides for the driver to reserve part of > > the tag space for its own use. Thus, HPSA uses a common pool for both > > I/O requests and internally-driven requests (IOCTLs, aborts, etc.), and > > we index the pool using the tag. The code is this way mostly to > > minimize change from the previous version, but it's also handy in terms > > of managing the pool (i.e., there is a single pool to administer, > > instead of two, and all the items are initialized the same way, etc.). > > The portion of the pool which is reserved to the driver (the > > low-numbered tags) continues to be managed through a bitmap, just as it > > was prior to this change; the bitmap does cover the whole pool (which is > > a historical artifact), but the higher bits are never set, since > > allocation of those entries is determined by the tag from the block > > layer -- nevertheless, it is convenient to have the map extend over the > > whole pool in certain corner cases (such as resets). > > > > > >> In cmd_tagged_alloc "Thus, there should never be a collision here between > >> two requests" if this is true you don't need the refcount and just a simple > >> flag were sufficient for your other needs. (And maybe you get to ~971k > >> IOPS..) > > :-) The code previously had the reference count in there to close > > certain race conditions' asynchronous accesses to the command block > > (e.g., between an abort and a completing command). We hope that, using > > the block layer tags, those races no longer occur, but there didn't seem > > to be any point in removing the reference count until we'd had more > > experience with the code...and, in a fit of healthy paranoia I added the > > check to which you refer. Unfortunately, in some of Rob Elliott's > > recent tests, he managed to drive a case where the check triggers. So, > > until we reconcile that, I'm inclined to leave the check in place > > (hopefully it's not costing a full 1k IOPS :-) ). > > Let us assume that the block layer never sends duplicate tags to the driver, > I think there are some weaknesses in the way how it's implemented, > for example here - from fail_all_outstanding_cmds: > refcount = atomic_inc_return(&c->refcount); > if (refcount > 1) { > ... > finish_cmd(c); - this finishes the command > and the tag might be now reused, > when that happens you'll see a race > atomic_dec(&h->commands_outstanding); > failcount++; > } > else {what happens when right now comes a call from block > layer?} > cmd_free(h, c); > > When references are used it usually means, that when refcount==0 that > a release function frees the resources, in this case it is the tag number. > In your implementation you should ensure that when you signal > to the upper layer that the tag is free, that nothing else holds the > reference. > If this is true the conclusion is that you don't need this kind of references > and a simple flag just to debug not yet fixed races is enough. Part of the motivation is we want to have this code be workable for distros that may not have the necessary kernel changes for us to be able to use the block layer tagging (e.g. tags reserved for driver use). So I am floating the block layer tag patches, keeping them at the top of my stack of patches so that all this lockless stuff can still work even without using the block layer tags. So one reason the reference counting needs to go in is to accomodate backporting the LLD without touching an older kernel's mid layer or block layer code, which is import
Re: [RFC] hpsa: work in progress "lockless monster" patches
Hi Steve, Webb, let me start with the part most important for me - from my previous mail "And please have a look at "[PATCH] hpsa: refine the pci enble/disable handling" I posted in June and add it to your series or review in the list. Thanks." other comments see below > Hi Tomas, > > Thanks for taking a look and for the feedback. I'm actually the one > responsible for the changes you refer to, so I'll try to address your > comments. > > > On 7/30/14 10:55 AM, Tomas Henzl wrote: >> I'm not sure if I got it right, >> it seems it uses the same cmd_pool for both alloc function, >> from (0 - reserved_cmds) it's cmd_alloc and above that it's handled >> by cmd_tagged_alloc. >> The logic in both functions seems to be valid for me. > Good. With Christoph's proposed changes, the block layer will select a > tag for I/O requests, and it provides for the driver to reserve part of > the tag space for its own use. Thus, HPSA uses a common pool for both > I/O requests and internally-driven requests (IOCTLs, aborts, etc.), and > we index the pool using the tag. The code is this way mostly to > minimize change from the previous version, but it's also handy in terms > of managing the pool (i.e., there is a single pool to administer, > instead of two, and all the items are initialized the same way, etc.). > The portion of the pool which is reserved to the driver (the > low-numbered tags) continues to be managed through a bitmap, just as it > was prior to this change; the bitmap does cover the whole pool (which is > a historical artifact), but the higher bits are never set, since > allocation of those entries is determined by the tag from the block > layer -- nevertheless, it is convenient to have the map extend over the > whole pool in certain corner cases (such as resets). > > >> In cmd_tagged_alloc "Thus, there should never be a collision here between >> two requests" if this is true you don't need the refcount and just a simple >> flag were sufficient for your other needs. (And maybe you get to ~971k >> IOPS..) > :-) The code previously had the reference count in there to close > certain race conditions' asynchronous accesses to the command block > (e.g., between an abort and a completing command). We hope that, using > the block layer tags, those races no longer occur, but there didn't seem > to be any point in removing the reference count until we'd had more > experience with the code...and, in a fit of healthy paranoia I added the > check to which you refer. Unfortunately, in some of Rob Elliott's > recent tests, he managed to drive a case where the check triggers. So, > until we reconcile that, I'm inclined to leave the check in place > (hopefully it's not costing a full 1k IOPS :-) ). Let us assume that the block layer never sends duplicate tags to the driver, I think there are some weaknesses in the way how it's implemented, for example here - from fail_all_outstanding_cmds: refcount = atomic_inc_return(&c->refcount); if (refcount > 1) { ... finish_cmd(c); - this finishes the command and the tag might be now reused, when that happens you'll see a race atomic_dec(&h->commands_outstanding); failcount++; } else {what happens when right now comes a call from block layer?} cmd_free(h, c); When references are used it usually means, that when refcount==0 that a release function frees the resources, in this case it is the tag number. In your implementation you should ensure that when you signal to the upper layer that the tag is free, that nothing else holds the reference. If this is true the conclusion is that you don't need this kind of references and a simple flag just to debug not yet fixed races is enough. I'm maybe wrong because I don't understand what you want to protect in the above example, so that makes me to have some doubts if I understand the code properly. >> cmd_alloc some minor changes are possible >> - try to find free entries only to reserved_cmds >>(the bitmap might be shorter too) > Excellent catch! I'll make that change. (As I said, we currently rely > on the over-sized bit map, but there is no reason for this code to > search the whole thing...although, with the block layer doing the heavy > lifting, this allocation path is lightly used, and it will be rare that > it runs off the end of the reserved range.) > > >> - next thread should start + 1 from the last result > That's a fair point, but the suggestion is somewhat more complicated > than it seems: if "start + 1" is greater than reserved_cmds, then we > want to start at zero, instead. I think it ends up being more code than > it's worth. > > In fact, with the partitioned pool, the reserved-to-driver section is > now so small that it's probably not
Re: [RFC] hpsa: work in progress "lockless monster" patches
Hi Tomas, Thanks for taking a look and for the feedback. I'm actually the one responsible for the changes you refer to, so I'll try to address your comments. On 7/30/14 10:55 AM, Tomas Henzl wrote: I'm not sure if I got it right, it seems it uses the same cmd_pool for both alloc function, from (0 - reserved_cmds) it's cmd_alloc and above that it's handled by cmd_tagged_alloc. The logic in both functions seems to be valid for me. Good. With Christoph's proposed changes, the block layer will select a tag for I/O requests, and it provides for the driver to reserve part of the tag space for its own use. Thus, HPSA uses a common pool for both I/O requests and internally-driven requests (IOCTLs, aborts, etc.), and we index the pool using the tag. The code is this way mostly to minimize change from the previous version, but it's also handy in terms of managing the pool (i.e., there is a single pool to administer, instead of two, and all the items are initialized the same way, etc.). The portion of the pool which is reserved to the driver (the low-numbered tags) continues to be managed through a bitmap, just as it was prior to this change; the bitmap does cover the whole pool (which is a historical artifact), but the higher bits are never set, since allocation of those entries is determined by the tag from the block layer -- nevertheless, it is convenient to have the map extend over the whole pool in certain corner cases (such as resets). In cmd_tagged_alloc "Thus, there should never be a collision here between two requests" if this is true you don't need the refcount and just a simple flag were sufficient for your other needs. (And maybe you get to ~971k IOPS..) :-) The code previously had the reference count in there to close certain race conditions' asynchronous accesses to the command block (e.g., between an abort and a completing command). We hope that, using the block layer tags, those races no longer occur, but there didn't seem to be any point in removing the reference count until we'd had more experience with the code...and, in a fit of healthy paranoia I added the check to which you refer. Unfortunately, in some of Rob Elliott's recent tests, he managed to drive a case where the check triggers. So, until we reconcile that, I'm inclined to leave the check in place (hopefully it's not costing a full 1k IOPS :-) ). cmd_alloc some minor changes are possible - try to find free entries only to reserved_cmds (the bitmap might be shorter too) Excellent catch! I'll make that change. (As I said, we currently rely on the over-sized bit map, but there is no reason for this code to search the whole thing...although, with the block layer doing the heavy lifting, this allocation path is lightly used, and it will be rare that it runs off the end of the reserved range.) - next thread should start + 1 from the last result That's a fair point, but the suggestion is somewhat more complicated than it seems: if "start + 1" is greater than reserved_cmds, then we want to start at zero, instead. I think it ends up being more code than it's worth. In fact, with the partitioned pool, the reserved-to-driver section is now so small that it's probably not worth the effort to try to resume the search where the last search left off -- we might as well start at zero each time. (Thanks for bringing that up! ;-) ) - when nr_cmds is read from hw it should be tested (nr_cmds > reserved_cmds) That's a fair point. - what is the correct pronunciation of benignhly? You would have to ask in certain parts of Boston or perhaps out on the South Fork of Long Island. :-D (I thought I fixed those typos -- I'm _sure_ I did! -- and yet) @@ -5634,7 +5634,8 @@ static struct CommandList *cmd_alloc(struct ctlr_info *h) offset = h->last_allocation; /* benighly racy */ for (;;) { - i = find_next_zero_bit(h->cmd_pool_bits, h->nr_cmds, offset); + offset %= h->reserved_cmds; + i = find_next_zero_bit(h->cmd_pool_bits, h->reserved_cmds, offset); Aside from changing the limit from nr_cmds to reserved_cmds, I don't think this change is helpful, because it inserts a modulo operation on every iteration, which is probably unnecessary. (It would be better to ensure that offset is a reasonable value to begin with, and, even it is not, the call to find_next_zero_bit() will handle it properly and then the code will retry.) @@ -5642,15 +5643,16 @@ static struct CommandList *cmd_alloc(struct ctlr_info *h) c = h->cmd_pool + i; refcount = atomic_inc_return(&c->refcount); if (unlikely(refcount > 1)) { - cmd_free(h, c); /* already in use */ - offset = (i + 1) % h->nr_cmds; + atomic_dec(&c->refcount); /* already in use, since we +don't own the command just decrease the refcount */ +
Re: [RFC] hpsa: work in progress "lockless monster" patches
On 07/25/2014 09:28 PM, scame...@beardog.cce.hp.com wrote: > > hpsa: Work In Progress: "lockless monster" patches > > To be clear, I am not suggesting that these patches be merged at this time. > > This patchset is vs. Christoph Hellwig's scsi-mq.4 branch which > may be found here: git://git.infradead.org/users/hch/scsi.git > > We've been working for a long time on a patchset for hpsa to remove > all the locks from the main i/o path in pursuit of high IOPS. Some > of that work is already upstream, but a lot more of it is not quite > yet ready to be merged. However, I think we've "gone dark" for a bit > too long on this, and even though the patches aren't really ready to > be merged just yet, I thought I should let other people who might be > interested have a look anyway, as things are starting to be at least > more stable than unstable. Be warned though, there are still some > problems, esp. around error recovery. > > That being said, with the right hardware (fast SAS SSDs, recent Smart > Arrays e.g. P430, with up-to-date firmware, attached to recent disk > enclosures) > with these patches and the scsi-mq patches, it is possible to get around > ~970k IOPS from a single logical drive on a single controller. > > There are about 150 patches in this set. Rather than bomb the list > with that, here is a link to a tarball of the patches in the form of > an stgit patch series: Hi Stephen, I've looked only at the alloc functions - I'm not sure if I got it right, it seems it uses the same cmd_pool for both alloc function, from (0 - reserved_cmds) it's cmd_alloc and above that it's handled by cmd_tagged_alloc. The logic in both functions seems to be valid for me. In cmd_tagged_alloc "Thus, there should never be a collision here between two requests" if this is true you don't need the refcount and just a simple flag were sufficient for your other needs. (And maybe you get to ~971k IOPS..) cmd_alloc some minor changes are possible - try to find free entries only to reserved_cmds (the bitmap might be shorter too) - next thread should start + 1 from the last result - when nr_cmds is read from hw it should be tested (nr_cmds > reserved_cmds) - what is the correct pronunciation of benignhly? @@ -5634,7 +5634,8 @@ static struct CommandList *cmd_alloc(struct ctlr_info *h) offset = h->last_allocation; /* benighly racy */ for (;;) { - i = find_next_zero_bit(h->cmd_pool_bits, h->nr_cmds, offset); + offset %= h->reserved_cmds; + i = find_next_zero_bit(h->cmd_pool_bits, h->reserved_cmds, offset); if (unlikely(i >= h->reserved_cmds)) { offset = 0; continue; @@ -5642,15 +5643,16 @@ static struct CommandList *cmd_alloc(struct ctlr_info *h) c = h->cmd_pool + i; refcount = atomic_inc_return(&c->refcount); if (unlikely(refcount > 1)) { - cmd_free(h, c); /* already in use */ - offset = (i + 1) % h->nr_cmds; + atomic_dec(&c->refcount); /* already in use, since we +don't own the command just decrease the refcount */ + offset = i + 1; continue; } set_bit(i & (BITS_PER_LONG - 1), h->cmd_pool_bits + (i / BITS_PER_LONG)); break; /* it's ours now. */ } - h->last_allocation = i; /* benignly racy */ + h->last_allocation = i + 1; /* benignly racy */ hpsa_cmd_partial_init(h, i, c); return c; } @@ -5670,6 +5672,7 @@ static void cmd_free(struct ctlr_info *h, struct CommandList *c) clear_bit(i & (BITS_PER_LONG - 1), h->cmd_pool_bits + (i / BITS_PER_LONG)); } + else ; /*BUG ? or just atomic_dec and no if */ } And please have a look at "[PATCH] hpsa: refine the pci enble/disable handling" I posted in June and add it to your series or review in the list. Thanks, Tomas > > https://github.com/smcameron/hpsa-lockless-patches-work-in-progress/blob/master/hpsa-lockless-vs-hch-scsi-mq.4-2014-07-25-1415CDT.tar.bz2?raw=true > > In some cases, I have probably erred on the side of having too many too > small patches, on the theory that it is easier to bake a cake than to > unbake a cake. Before these are submitted "for reals", there will be > some squashing of patches, along with other cleaning up. > > There are some patches in this set which are already upstream in > James's tree which do not happen to be in Christoph's tree > (most of which are named "*_sent_to_james"). There are also > quite a few patches which are strictly for debugging and are not > ever intended to be merged. > > > Here is a list of all the patches sorted by author: > > Arnd Bergmann (1): > [SCSI] hpsa: fix non-x86 builds > >
Re: [RFC] hpsa: work in progress "lockless monster" patches
On 07/25/2014 09:28 PM, scame...@beardog.cce.hp.com wrote: hpsa: Work In Progress: "lockless monster" patches To be clear, I am not suggesting that these patches be merged at this time. This patchset is vs. Christoph Hellwig's scsi-mq.4 branch which may be found here: git://git.infradead.org/users/hch/scsi.git We've been working for a long time on a patchset for hpsa to remove all the locks from the main i/o path in pursuit of high IOPS. Some of that work is already upstream, but a lot more of it is not quite yet ready to be merged. However, I think we've "gone dark" for a bit too long on this, and even though the patches aren't really ready to be merged just yet, I thought I should let other people who might be interested have a look anyway, as things are starting to be at least more stable than unstable. Be warned though, there are still some problems, esp. around error recovery. That being said, with the right hardware (fast SAS SSDs, recent Smart Arrays e.g. P430, with up-to-date firmware, attached to recent disk enclosures) with these patches and the scsi-mq patches, it is possible to get around ~970k IOPS from a single logical drive on a single controller. There are about 150 patches in this set. Rather than bomb the list with that, here is a link to a tarball of the patches in the form of an stgit patch series: https://github.com/smcameron/hpsa-lockless-patches-work-in-progress/blob/master/hpsa-lockless-vs-hch-scsi-mq.4-2014-07-25-1415CDT.tar.bz2?raw=true In some cases, I have probably erred on the side of having too many too small patches, on the theory that it is easier to bake a cake than to unbake a cake. Before these are submitted "for reals", there will be some squashing of patches, along with other cleaning up. There are some patches in this set which are already upstream in James's tree which do not happen to be in Christoph's tree (most of which are named "*_sent_to_james"). There are also quite a few patches which are strictly for debugging and are not ever intended to be merged. Hmm. While you're about to engage on a massive rewrite _and_ we're having 64bit LUN support now, what about getting rid of the weird internal LUN mapping? That way you would get rid of this LUN rescan thingie and the driver would look more sane. Plus the REPORT LUN command would actually return the correct data ... Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage h...@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg) -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] hpsa: work in progress "lockless monster" patches
> Christoph Hellwig (1): > reserve block tags in scsi host So you found this one useful. This begs the question how we should move forward with it, as it will only work with the blk-mq path in it's current form. I can see three ways: - we implement equivalent functionality in the old block tagging code and make it available for any driver - we require drivers specific workarounds for the !mq path - we allow drivers to force the use of the blk-mq path if they want to use advanced features like this. This might become nessecary anyway when we expose actual multiqueue support to LLDDs. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC] hpsa: work in progress "lockless monster" patches
hpsa: Work In Progress: "lockless monster" patches To be clear, I am not suggesting that these patches be merged at this time. This patchset is vs. Christoph Hellwig's scsi-mq.4 branch which may be found here: git://git.infradead.org/users/hch/scsi.git We've been working for a long time on a patchset for hpsa to remove all the locks from the main i/o path in pursuit of high IOPS. Some of that work is already upstream, but a lot more of it is not quite yet ready to be merged. However, I think we've "gone dark" for a bit too long on this, and even though the patches aren't really ready to be merged just yet, I thought I should let other people who might be interested have a look anyway, as things are starting to be at least more stable than unstable. Be warned though, there are still some problems, esp. around error recovery. That being said, with the right hardware (fast SAS SSDs, recent Smart Arrays e.g. P430, with up-to-date firmware, attached to recent disk enclosures) with these patches and the scsi-mq patches, it is possible to get around ~970k IOPS from a single logical drive on a single controller. There are about 150 patches in this set. Rather than bomb the list with that, here is a link to a tarball of the patches in the form of an stgit patch series: https://github.com/smcameron/hpsa-lockless-patches-work-in-progress/blob/master/hpsa-lockless-vs-hch-scsi-mq.4-2014-07-25-1415CDT.tar.bz2?raw=true In some cases, I have probably erred on the side of having too many too small patches, on the theory that it is easier to bake a cake than to unbake a cake. Before these are submitted "for reals", there will be some squashing of patches, along with other cleaning up. There are some patches in this set which are already upstream in James's tree which do not happen to be in Christoph's tree (most of which are named "*_sent_to_james"). There are also quite a few patches which are strictly for debugging and are not ever intended to be merged. Here is a list of all the patches sorted by author: Arnd Bergmann (1): [SCSI] hpsa: fix non-x86 builds Christoph Hellwig (1): reserve block tags in scsi host Joe Handzik (9): hpsa: add masked physical devices into h->dev[] array hpsa: add hpsa_bmic_id_physical_device function hpsa: get queue depth for physical devices hpsa: use ioaccel2 path to submit IOs to physical drives in HBA mode. hpsa: Get queue depth from identify physical bmic for physical disks. hpsa: add ioaccel sg chaining for the ioaccel2 path Set the phys_disk value for a CommandList structure that is submitted. Squash hpsa: unmap ioaccel2 commands before, not after adding to resubmit workqueue hpsa: add more ioaccel2 error handling, including underrun statuses. Nicholas Bellinger (1): hpsa: Convert SCSI LLD ->queuecommand() for host_lock less operation Robert Elliott (44): Change scsi.c scsi_log_completion() to print strings for QUEUED, Set scsi_logging_level to be more verbose to get better messages hpsa: do not unconditionally copy sense data hpsa: optimize cmd_alloc function by remembering last allocation From a154642aeed291c7cfe4b9ea9da932156030f7d1 Mon Sep 17 00:00:00 2001 hpsa: Clean up host, channel, target, lun prints hpsa: do not check cmd_alloc return value - it cannnot return NULL hpsa: return -1 rather than -ENOMEM in hpsa_get_raid_map if fill_cmd fails hpsa: return -ENOMEM not 1 from ioaccel2_alloc_cmds_and_bft on allocation failure hpsa: return -ENOMEM not 1 from hpsa_alloc_ioaccel_cmd_and_bft on allocation failure hpsa: return -ENOMEM not -EFAULT from hpsa_passthru_ioctl on kmalloc failure hpsa: pass error from pci_set_consistent_dma_mask intact from hpsa_message hpsa: detect and report failures changing controller transport modes hpsa: add hpsa_disable_interrupt_mode hpsa: rename free_irqs to hpsa_free_irqs and move before hpsa_request_irq hpsa: rename hpsa_request_irq to hpsa_request_irqs hpsa: on failure of request_irq, free the irqs that we did get hpsa: make hpsa_pci_init disable interrupts and pci_disable_device on critical failures hpsa: fix a string constant broken into two lines hpsa: avoid unneccesary calls to resource freeing functions in hpsa_init_one hpsa: add hpsa_free_cfgtables function to undo what hpsa_find_cfgtables does hpsa: report failure to ioremap config table hpsa: add hpsa_free_pci_init function hpsa: clean up error handling in hpsa_pci_init hpsa: make hpsa_remove_one use hpsa_pci_free_init hpsa: rename hpsa_alloc_ioaccel_cmd_and_bft to hpsa_alloc_ioaccel1_cmd_and_bft hpsa: rename hpsa_allocate_cmd_pool to hpsa_alloc_cmd_pool hpsa: rename ioaccel2_alloc_cmds_and_bft to hpsa_alloc_ioaccel2_cmd_and_bft hpsa: fix memory leak in hpsa_alloc_cmd_pool hpsa: return status not void from hpsa_put_ctlr_into_performant_mode hpsa: clean up error handling in hpsa_init_one hpsa: report allocation failures while alloca