Re: [Ksummit-2013-discuss] [ATTEND] scsi-mq prototype discussion

2013-07-16 Thread scameron
On Tue, Jul 16, 2013 at 02:07:29PM -0700, Nicholas A. Bellinger wrote:
> On Sat, 2013-07-13 at 06:53 +, James Bottomley wrote:
> > On Fri, 2013-07-12 at 12:52 +0200, Hannes Reinecke wrote:
> > > On 07/12/2013 03:33 AM, Nicholas A. Bellinger wrote:
> > > > On Thu, 2013-07-11 at 18:02 -0700, Greg KH wrote:
> > > >> On Thu, Jul 11, 2013 at 05:23:32PM -0700, Nicholas A. Bellinger wrote:
> > > >>> Drilling down the work items ahead of a real mainline push is high on
> > > >>> priority list for discussion.
> > > >>>
> > > >>> The parties to be included in such a discussion are:
> > > >>>
> > > >>>   - Jens Axboe (blk-mq author)
> > > >>>   - James Bottomley (scsi maintainer)
> > > >>>   - Christoph Hellwig (scsi)
> > > >>>   - Martin Petersen (scsi)
> > > >>>   - Tejun Heo (block + libata)
> > > >>>   - Hannes Reinecke (scsi error recovery)
> > > >>>   - Kent Overstreet (block, per-cpu ida)
> > > >>>   - Stephen Cameron (scsi-over-pcie driver)
> > > >>>   - Andrew Vasquez (qla2xxx LLD)
> > > >>>   - James Smart (lpfc LLD)
> > > >>
> > > >> Isn't this something that should have been discussed at the storage
> > > >> mini-summit a few months ago?
> > > > 
> > > > The scsi-mq prototype, along with blk-mq (in it's current form) did not
> > > > exist a few short months ago.  ;)
> > > > 
> > > >>  It seems very specific to one subsystem to be a kernel summit topic,
> > > >> don't you think?
> > > > 
> > > > It's no more subsystem specific than half of the other proposals so far,
> > > > and given it's reach across multiple subsystems (block, scsi, target),
> > > > and the amount of off-list interest on the topic, I think it would make
> > > > a good candidate for discussion.
> > > > 
> > > And it'll open up new approaches which previously were dismissed,
> > > like re-implementing multipathing on top of scsi-mq, giving us the
> > > single scsi device like other UNIX systems.
> > > 
> > > Also I do think there's quite some synergy to be had, as with blk-mq
> > > we could nail each queue to a processor, which would eliminate the
> > > need for locking.
> > > Which could be useful for other subsystems, too.
> > 
> > Lets start with discussing this on the list, please, and then see where
> > we go from there ...
> > 
> 
> Yes, the discussion is beginning to make it's way to the list.  I've
> mostly been waiting for blk-mq to get a wider review before taking the
> early scsi-mq prototype driver to a larger public audience.
> 
> Primarily, I'm now reaching out to the people most effected by existing
> scsi_request_fn() based performance limitations.  Most of them have
> abandoned existing scsi_request_fn() based logic in favor of raw block
> make_request() based drivers, and are now estimating the amount of
> effort to move to an scsi-mq based approach.
> 
> Regardless, as the prototype progresses over the next months, having a
> face-to-face discussion with the key parties in the room would be very
> helpful given the large amount of effort involved to actually make this
> type of generational shift in SCSI actually happen.

I'd be very interested in attending this, if invited.

-- steve


--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: hpsa - BUG: using smp_processor_id() in preemptible [00000000 00000000] code: kworker/u:0/6

2013-07-26 Thread scameron
On Fri, Jul 26, 2013 at 06:28:02AM -0400, John Kacur wrote:
> 
> 
> - Original Message -
> > [Adding missing cc to linux-scsi]
> > On Thu, 2013-07-25 at 23:33 +0200, John Kacur wrote:
> > > Hi
> > > 
> > > We're seeing this on a 3.6 kernel with the real-time patch applied, but it
> > > looks like it is relevant with the real-time patch in the latest kernel
> 
> This should read, "it looks like it is relevant WITHOUT the real-time patch 
> in the latest kernel".
> 
> 
> > > too.
> > > 
> > > [   49.688847] hpsa :03:00.0: hpsa0: <0x323a> at IRQ 67 using DAC
> > > [   49.749928] scsi0 : hpsa
> > > [   49.784437] BUG: using smp_processor_id() in preemptible [
> > > ] code: kworker/u:0/6
> > > [   49.784465] caller is enqueue_cmd_and_start_io+0x5a/0x100 [hpsa]
> > > [   49.784468] Pid: 6, comm: kworker/u:0 Not tainted
> > > 3.6.11.5-rt37.52.el6rt.x86_64.debug #1
> > > [   49.784471] Call Trace:
> > > [   49.784512]  [] debug_smp_processor_id+0x123/0x150
> > > [   49.784520]  [] enqueue_cmd_and_start_io+0x5a/0x100
> > > [hpsa]
> > > [   49.784529]  []
> > > hpsa_scsi_do_simple_cmd_core+0xeb/0x110 [hpsa]
> > > [   49.784537]  [] ? swiotlb_dma_mapping_error+0x18/0x30
> > > [   49.784544]  [] ? swiotlb_dma_mapping_error+0x18/0x30
> > > [   49.784553]  []
> > > hpsa_scsi_do_simple_cmd_with_retry+0x91/0x280 [hpsa]
> > > [   49.784562]  []
> > > hpsa_scsi_do_report_luns.clone.2+0xd8/0x130 [hpsa]
> > > [   49.784571]  []
> > > hpsa_gather_lun_info.clone.3+0x3a/0x1a0 [hpsa]
> > > [   49.784580]  [] hpsa_update_scsi_devices+0x11f/0x4f0
> > > [hpsa]
> > > [   49.784592]  [] ? sub_preempt_count+0xa9/0xe0
> > > [   49.784601]  [] hpsa_scan_start+0xfd/0x150 [hpsa]
> > > [   49.784613]  [] ? rt_spin_lock_slowunlock+0x78/0x90
> > > [   49.784626]  [] do_scsi_scan_host+0x37/0xa0
> > > [   49.784632]  [] do_scan_async+0x1a/0x30
> > > [   49.784643]  [] async_run_entry_fn+0x9b/0x1d0
> > > [   49.784655]  [] process_one_work+0x1f2/0x620
> > > [   49.784661]  [] ? process_one_work+0x180/0x620
> > > [   49.784668]  [] ? worker_thread+0x5e/0x3a0
> > > [   49.784674]  [] ? async_schedule+0x20/0x20
> > > [   49.784681]  [] worker_thread+0x133/0x3a0
> > > [   49.784688]  [] ? manage_workers+0x190/0x190
> > > [   49.784696]  [] kthread+0xa6/0xb0
> > > [   49.784707]  [] kernel_thread_helper+0x4/0x10
> > > [   49.784715]  [] ? finish_task_switch+0x8c/0x110
> > > [   49.784721]  [] ? _raw_spin_unlock_irq+0x3b/0x70
> > > [   49.784727]  [] ? retint_restore_args+0xe/0xe
> > > [   49.784734]  [] ? kthreadd+0x1e0/0x1e0
> > > [   49.784739]  [] ? gs_change+0xb/0xb
> > > 
> > > ---
> > > 
> > > When I look at the code I see this call chain
> > > enqueue_cmd_and_start_io()->
> > >   set_performant_mode()->
> > >   smp_processor_id()
> > > Which if you have debugging enabled calls debug_processor_id() and
> > > triggers the warning.
> > > 
> > > I'm not very familiar with the hpsa code, so I'm not entirely sure what
> > > the purpose of this line is
> > > 
> > > c->Header.ReplyQueue = smp_processor_id() % h->nreply_queues;
> > > 
> > > Is the purpose to simply try to get a range of ReplyQueue numbers, but
> > > somewhat arbitrary?  Or is it necessary that the current processor_id
> > > is used? If it is the former, and you're not accessing per cpu structures,
> > > or pinning a cpu, or anything like that then I would think it is safe to
> > > change this to a raw_smp_processor_id() to get rid of a false positive
> > > warning.

It's not critical that they match (will work if they don't) but for certain
workloads you can get more performance if you pin processes to cpus and
arrange msix interrupt vectors so that commands are likely to complete on
the same cpu they originated from.

In any case, I think your analysis is correct.  Thanks.

> > > 
> > > diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
> > > index 7f4f790..4e19267 100644
> > > --- a/drivers/scsi/hpsa.c
> > > +++ b/drivers/scsi/hpsa.c
> > > @@ -583,7 +583,7 @@ static void set_performant_mode(struct ctlr_info *h,
> > > struct CommandList *c)
> > >   c->busaddr |= 1 | (h->blockFetchTable[c->Header.SGList] << 1);
> > >   if (likely(h->msix_vector))
> > >   c->Header.ReplyQueue =
> > > - smp_processor_id() % h->nreply_queues;
> > > + raw_smp_processor_id() % h->nreply_queues;
> > >   }
> > >  }

Ack.

-- steve

> > >  
> > > 
> > > Thanks
> > > 
> > > John Kacur
> > 
> > 
> > 
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Questions about your scsi-mq prototype

2013-07-26 Thread scameron
On Thu, Jul 25, 2013 at 07:16:39PM -0700, Nicholas A. Bellinger wrote:
> On Tue, 2013-07-16 at 10:39 -0700, Nicholas A. Bellinger wrote:
> > On Mon, 2013-07-15 at 16:41 -0500, scame...@beardog.cce.hp.com wrote:
> > > Hi,
> > > 
> > > I have some questions about your scsi-mq prototype.  I have
> > > CC'ed Chayan Biswas from Sandisk who is the other guy working
> > > with me on the scsi-over-pcie driver.  BTW, if you would prefer
> > > I CC linux-kernel or linux-scsi or other people with such
> > > questions (or refrain from bothering you at all) just let me know.
> > > 
> > 
> > Thanks for asking.  Adding CC's to linux-scsi + Jens now..
> > 
> > > I wonder if you can provide a summary of how you envision
> > > a scsi LLD is expected to interact with the scsi-mq code.  I
> > > gather that:
> > > 
> > > 1. scsi_mq in scsi_host_template should be set to true.
> > 
> > Correct.
> > 
> > > 2. scsi LLD should provide queuecommand_mq function
> > >(in addition to normal queuecommand? or instead of?
> > > virtio_scsi.c seems to provide both.)
> > 
> > In addition to for the moment, correct.
> > 
> > > 3. LLD should tell scsi mid layer size of its commands via
> > >scsi_host_template cmd_size field.
> > >I guess midlayer will pre-allocate per queue to avoid 
> > >the midlayer threads contending to allocate commands maybe?
> > >and hands this space to LLD to use via sc->SCp.ptr (sc is struct 
> > > scsi_cmnd)
> > >(right? or am I off in the weeds here?)
> > 
> > This is all exactly correct.
> > 
> > The one extra point here wrt ->queuecommand_mq() is that it's using the
> > same parameters for the moment, but once nr_hw_queues > 1 is
> > implemented, we will expect to pass down the associated blk_mq_hw_ctx in
> > order for the LLD to determine which hardware context the struct
> > scsi_cmnd is being dispatched upon.
> > 
> > > 4. Wondering about how the multiple queue thing will work.
> > >I see this:
> > > 
> > >#warning FIXME: Need to pass nr_hw_queues into scsi-mq
> > > 
> > >and I don't see any code setting sdev_mq_reg.nr_hw_queues to
> > >anything other than 1.
> > > 
> > >So maybe I'm just looking too early and that part isn't done yet?
> > > 
> > 
> > Correct, not done yet..
> > 
> > However, so far with a single SCSI LUN with nr_hw_queues=1, 4k blocksize
> > randrw performance is on the order of 1M, vs. ~250K using legacy
> > scsi_request_fn().
> > 
> > > In any case, since we switched away from a scsi driver for the 
> > > scsi-over-pcie
> > > driver to a block driver, the original scsi version of the driver has lain
> > > fallow, and the first thing we would need to do before we could begin to 
> > > try
> > > to use your code is to update that driver to work with our current 
> > > hardware, and
> > > take advantage of other code improvements made in the block driver.
> > > 
> > 
> 
> Hi Stephen & Robert,
> 
> Just curious if you've been able to make any progress here..?

Only just started to try to make the scsi variant of the sop driver
work again yesterday and start bringing it up to date since it was
last touched in December, and so I haven't done anything with the
scsi-mq code yet.  Too many other things demanding my attention at
work lately, but at least I was able to spend nearly all day yesterday
working on the scsi SOP driver.

> 
> Also as per our earlier off-list discussion, you should be able to
> safely run existing scsi_request_fn() based LLDs along side new
> SHT->scsi_mq=true enabled code for prototyping purposes.

Thanks, good to know.

> 
> Btw, any chance to get an SOP sample to hack on..?  I'd like to connect
> one of these to IvyBridge-EP silicon for various scsi-mq and target mode
> performance testing.  ;)

You mean hardware?  Not for me to answer, but I'd say it's not very
likely at the moment.

-- steve

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] hpsa: remove unneeded loop

2013-08-01 Thread scameron
On Thu, Aug 01, 2013 at 03:11:22PM +0200, Tomas Henzl wrote:
> From: Tomas Henzl 
> 
> The cmd_pool_bits is protected everywhere with a spinlock, 
> we don't need the test_and_set_bit, set_bit is enough and the loop
> can be removed too.
> 
> Signed-off-by: Tomas Henzl 
> ---
>  drivers/scsi/hpsa.c | 15 ++-
>  1 file changed, 6 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
> index 796482b..d7df01e 100644
> --- a/drivers/scsi/hpsa.c
> +++ b/drivers/scsi/hpsa.c
> @@ -2662,15 +2662,12 @@ static struct CommandList *cmd_alloc(struct ctlr_info 
> *h)
>   unsigned long flags;
>  
>   spin_lock_irqsave(&h->lock, flags);
> - do {
> - i = find_first_zero_bit(h->cmd_pool_bits, h->nr_cmds);
> - if (i == h->nr_cmds) {
> - spin_unlock_irqrestore(&h->lock, flags);
> - return NULL;
> - }
> - } while (test_and_set_bit
> -  (i & (BITS_PER_LONG - 1),
> -   h->cmd_pool_bits + (i / BITS_PER_LONG)) != 0);
> + i = find_first_zero_bit(h->cmd_pool_bits, h->nr_cmds);
> + if (i == h->nr_cmds) {
> + spin_unlock_irqrestore(&h->lock, flags);
> + return NULL;
> + }
> + set_bit(i & (BITS_PER_LONG - 1), h->cmd_pool_bits + (i / 
> BITS_PER_LONG));
>   h->nr_allocs++;
>   spin_unlock_irqrestore(&h->lock, flags);
>  
> -- 
> 1.8.3.1
> 

Would it be better instead to just not use the spinlock for protecting
cmd_pool_bits?  I have thought about doing this for awhile, but haven't
gotten around to it.

I think the while loop is safe without the spin lock.  And then it is
not needed in cmd_free either.

-- steve

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] hpsa: remove unneeded variable

2013-08-01 Thread scameron
On Thu, Aug 01, 2013 at 03:14:52PM +0200, Tomas Henzl wrote:
> From: Tomas Henzl 
> 
> Remove unneeded variables.
> 
> Signed-off-by: Tomas Henzl 
> 
> ---
>  drivers/scsi/hpsa.c | 2 --
>  drivers/scsi/hpsa.h | 2 --
>  2 files changed, 4 deletions(-)
> 
> diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
> index 48fa81e..e0f6b00 100644
> --- a/drivers/scsi/hpsa.c
> +++ b/drivers/scsi/hpsa.c
> @@ -2668,7 +2668,6 @@ static struct CommandList *cmd_alloc(struct ctlr_info 
> *h)
>   return NULL;
>   }
>   set_bit(i & (BITS_PER_LONG - 1), h->cmd_pool_bits + (i / 
> BITS_PER_LONG));
> - h->nr_allocs++;
>   spin_unlock_irqrestore(&h->lock, flags);
>  
>   c = h->cmd_pool + i;
> @@ -2740,7 +2739,6 @@ static void cmd_free(struct ctlr_info *h, struct 
> CommandList *c)
>   spin_lock_irqsave(&h->lock, flags);
>   clear_bit(i & (BITS_PER_LONG - 1),
> h->cmd_pool_bits + (i / BITS_PER_LONG));
> - h->nr_frees++;
>   spin_unlock_irqrestore(&h->lock, flags);
>  }
>  
> diff --git a/drivers/scsi/hpsa.h b/drivers/scsi/hpsa.h
> index 9816479..bc85e72 100644
> --- a/drivers/scsi/hpsa.h
> +++ b/drivers/scsi/hpsa.h
> @@ -98,8 +98,6 @@ struct ctlr_info {
>   struct ErrorInfo*errinfo_pool;
>   dma_addr_t  errinfo_pool_dhandle;
>   unsigned long   *cmd_pool_bits;
> - int nr_allocs;
> - int nr_frees;
>   int scan_finished;
>   spinlock_t  scan_lock;
>   wait_queue_head_t   scan_wait_queue;
> -- 
> 1.8.3.1

Ack.

-- steve

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] hpsa: fix a race in cmd_free/scsi_done

2013-08-01 Thread scameron
On Thu, Aug 01, 2013 at 03:14:00PM +0200, Tomas Henzl wrote:
> From: Tomas Henzl 
> 
> When the driver calls scsi_done and after that frees it's internal
> preallocated memory it can happen that a new job is enqueud before
> the memory is freed. The allocation fails and the message
> "cmd_alloc returned NULL" is shown.
> Patch below fixes it by moving cmd->scsi_done after cmd_free.
> 
> Signed-off-by: Tomas Henzl 
> ---
>  drivers/scsi/hpsa.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
> index d7df01e..48fa81e 100644
> --- a/drivers/scsi/hpsa.c
> +++ b/drivers/scsi/hpsa.c
> @@ -1180,8 +1180,8 @@ static void complete_scsi_command(struct CommandList 
> *cp)
>   scsi_set_resid(cmd, ei->ResidualCnt);
>  
>   if (ei->CommandStatus == 0) {
> - cmd->scsi_done(cmd);
>   cmd_free(h, cp);
> + cmd->scsi_done(cmd);
>   return;
>   }
>  
> @@ -1353,8 +1353,8 @@ static void complete_scsi_command(struct CommandList 
> *cp)
>   dev_warn(&h->pdev->dev, "cp %p returned unknown status %x\n",
>   cp, ei->CommandStatus);
>   }
> - cmd->scsi_done(cmd);
>   cmd_free(h, cp);
> + cmd->scsi_done(cmd);
>  }
>  
>  static void hpsa_pci_unmap(struct pci_dev *pdev,
> -- 
> 1.8.3.1

Oh, nice catch!!!  Those mysterious and seemingly never reproducible
legends of the "cmd_alloc returned NULL" message that would show up
*extremely* infrequently are finally explained!

Ack.

-- steve
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] hpsa: remove unneeded loop

2013-08-01 Thread scameron
On Thu, Aug 01, 2013 at 04:05:20PM +0200, Tomas Henzl wrote:
> On 08/01/2013 03:39 PM, scame...@beardog.cce.hp.com wrote:
> > On Thu, Aug 01, 2013 at 03:11:22PM +0200, Tomas Henzl wrote:
> >> From: Tomas Henzl 
> >>
> >> The cmd_pool_bits is protected everywhere with a spinlock, 
> >> we don't need the test_and_set_bit, set_bit is enough and the loop
> >> can be removed too.
> >>
> >> Signed-off-by: Tomas Henzl 
> >> ---
> >>  drivers/scsi/hpsa.c | 15 ++-
> >>  1 file changed, 6 insertions(+), 9 deletions(-)
> >>
> >> diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
> >> index 796482b..d7df01e 100644
> >> --- a/drivers/scsi/hpsa.c
> >> +++ b/drivers/scsi/hpsa.c
> >> @@ -2662,15 +2662,12 @@ static struct CommandList *cmd_alloc(struct 
> >> ctlr_info *h)
> >>unsigned long flags;
> >>  
> >>spin_lock_irqsave(&h->lock, flags);
> >> -  do {
> >> -  i = find_first_zero_bit(h->cmd_pool_bits, h->nr_cmds);
> >> -  if (i == h->nr_cmds) {
> >> -  spin_unlock_irqrestore(&h->lock, flags);
> >> -  return NULL;
> >> -  }
> >> -  } while (test_and_set_bit
> >> -   (i & (BITS_PER_LONG - 1),
> >> -h->cmd_pool_bits + (i / BITS_PER_LONG)) != 0);
> >> +  i = find_first_zero_bit(h->cmd_pool_bits, h->nr_cmds);
> >> +  if (i == h->nr_cmds) {
> >> +  spin_unlock_irqrestore(&h->lock, flags);
> >> +  return NULL;
> >> +  }
> >> +  set_bit(i & (BITS_PER_LONG - 1), h->cmd_pool_bits + (i / 
> >> BITS_PER_LONG));
> >>h->nr_allocs++;
> >>spin_unlock_irqrestore(&h->lock, flags);
> >>  
> >> -- 
> >> 1.8.3.1
> >>
> > Would it be better instead to just not use the spinlock for protecting
> > cmd_pool_bits?  I have thought about doing this for awhile, but haven't
> > gotten around to it.
> >
> > I think the while loop is safe without the spin lock.  And then it is
> > not needed in cmd_free either.
> 
> I was evaluating the same idea for a while too, a loop and inside just the 
> test_and_set_bit,
> maybe even a stored value to start with a likely empty bit from last time to 
> tune it a bit.
> But I know almost nothing about the use pattern, so I decided for the least 
> invasive change
> to the existing code, to not make it worse.

Only reason I haven't done it is I'm loathe to make such a change to the main 
i/o
path without testing it like crazy before unleashing it, and it's never been a 
convenient time to slide such a change in around here and get proper testing
done (and there are other rather large changes brewing).

However, we have been using a similar scheme with the SCSI over PCIe driver,
here: https://github.com/HPSmartStorage/scsi-over-pcie/blob/master/block/sop.c
in alloc_request() around line 1476 without problems, and nvme-core.c contains
similar code in alloc_cmdid(), so I am confident it's sound in principle.
I would want to beat on it though, in case it ends up exposing a firmware bug
or something (not that I think it will, but you never know.)

-- steve


 
> 
> 
> >
> > -- steve
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] hpsa: remove unneeded loop

2013-08-01 Thread scameron
On Thu, Aug 01, 2013 at 04:59:45PM +0200, Tomas Henzl wrote:
> On 08/01/2013 04:21 PM, scame...@beardog.cce.hp.com wrote:
> > On Thu, Aug 01, 2013 at 04:05:20PM +0200, Tomas Henzl wrote:
> >> On 08/01/2013 03:39 PM, scame...@beardog.cce.hp.com wrote:
> >>> On Thu, Aug 01, 2013 at 03:11:22PM +0200, Tomas Henzl wrote:
>  From: Tomas Henzl 
> 
>  The cmd_pool_bits is protected everywhere with a spinlock, 
>  we don't need the test_and_set_bit, set_bit is enough and the loop
>  can be removed too.
> 
>  Signed-off-by: Tomas Henzl 
>  ---
>   drivers/scsi/hpsa.c | 15 ++-
>   1 file changed, 6 insertions(+), 9 deletions(-)
> 
>  diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
>  index 796482b..d7df01e 100644
>  --- a/drivers/scsi/hpsa.c
>  +++ b/drivers/scsi/hpsa.c
>  @@ -2662,15 +2662,12 @@ static struct CommandList *cmd_alloc(struct 
>  ctlr_info *h)
>   unsigned long flags;
>   
>   spin_lock_irqsave(&h->lock, flags);
>  -do {
>  -i = find_first_zero_bit(h->cmd_pool_bits, h->nr_cmds);
>  -if (i == h->nr_cmds) {
>  -spin_unlock_irqrestore(&h->lock, flags);
>  -return NULL;
>  -}
>  -} while (test_and_set_bit
>  - (i & (BITS_PER_LONG - 1),
>  -  h->cmd_pool_bits + (i / BITS_PER_LONG)) != 0);
>  +i = find_first_zero_bit(h->cmd_pool_bits, h->nr_cmds);
>  +if (i == h->nr_cmds) {
>  +spin_unlock_irqrestore(&h->lock, flags);
>  +return NULL;
>  +}
>  +set_bit(i & (BITS_PER_LONG - 1), h->cmd_pool_bits + (i / 
>  BITS_PER_LONG));
>   h->nr_allocs++;
>   spin_unlock_irqrestore(&h->lock, flags);
>   
>  -- 
>  1.8.3.1
> 
> >>> Would it be better instead to just not use the spinlock for protecting
> >>> cmd_pool_bits?  I have thought about doing this for awhile, but haven't
> >>> gotten around to it.
> >>>
> >>> I think the while loop is safe without the spin lock.  And then it is
> >>> not needed in cmd_free either.
> >> I was evaluating the same idea for a while too, a loop and inside just the 
> >> test_and_set_bit,
> >> maybe even a stored value to start with a likely empty bit from last time 
> >> to tune it a bit.
> >> But I know almost nothing about the use pattern, so I decided for the 
> >> least invasive change
> >> to the existing code, to not make it worse.
> > Only reason I haven't done it is I'm loathe to make such a change to the 
> > main i/o
> > path without testing it like crazy before unleashing it, and it's never 
> > been a 
> > convenient time to slide such a change in around here and get proper testing
> > done (and there are other rather large changes brewing).
> >
> > However, we have been using a similar scheme with the SCSI over PCIe driver,
> > here: 
> > https://github.com/HPSmartStorage/scsi-over-pcie/blob/master/block/sop.c
> > in alloc_request() around line 1476 without problems, and nvme-core.c 
> > contains
> > similar code in alloc_cmdid(), so I am confident it's sound in principle.
> > I would want to beat on it though, in case it ends up exposing a firmware 
> > bug
> > or something (not that I think it will, but you never know.)
> 
> I think the code is sound, maybe it could hypothetically return -EBUSY, 
> because
> the find_first_zero_bit is not atomic, but this possibility is so low that it 
> doesn't matter.
> Btw. on line 1284 - isn't it similar to patch 2/3 ?

find_first_zero_bit is not atomic, but the test_and_set_bit, which is what
counts, is atomic.   That find_first_zero_bit is not atomic confused me about
this code for a long time, and is why the spin lock was there in the first
place.  But if there's a race on the find_first_zero_bit and it returns the
same bit to multiple concurrent threads, only one thread will win the
test_and_set_bit, and the other threads will go back around the loop to try
again, and get a different bit.

I don't think a thread can get stuck in there never winning until all the bits
are used up because there should always be enough bits for all the commands we
would ever try to send concurrently, so every thread that gets in there should
eventually get a bit.

Or, am I missing some subtlety?

> 
> Back to this patch - we can take it as it is, because of the spinlock it 
> should be safe,
> or omit it, you can then post a spinlock-less patch. I'll let the decision on 
> you.

I think I like the spin-lock-less variant better.  But I want to test it out
here for awhile first.

-- steve

> 
> tomash
> 
> 
> >
> > -- steve
> >
> >
> >  
> >>
> >>> -- steve
> >>>
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> >>> the body of a message to majord...@vger.kernel.org
> >>> More majordomo info at  h

Re: [PATCH 1/3] hpsa: remove unneeded loop

2013-08-01 Thread scameron
On Thu, Aug 01, 2013 at 05:39:36PM +0200, Tomas Henzl wrote:
> On 08/01/2013 05:19 PM, scame...@beardog.cce.hp.com wrote:

[...]

> >> Btw. on line 1284 - isn't it similar to patch 2/3 ?

^^^ Oh, missed this the first time around, the sop driver uses the 
make_request_fn()
interface, and it's not a stacked driver either, so there is no limit to the 
number
of bios the block layer can stuff in -- make_request_fn must succeed.
If we get full we just chain them together using pointers already in the struct
bio for that purpose, so storing them in the driver requires no memory 
allocation
on the driver's part.  So while it's somewhat similar, we already have to handle
the case of the block layer stuffing infinite bios into the driver, so getting
full is not terribly out of the ordinary in that driver.

That being said, I'm poking around other bits of code lying around here
looking for similar problems, so thanks again for that one.

> > find_first_zero_bit is not atomic, but the test_and_set_bit, which is what
> > counts, is atomic.   That find_first_zero_bit is not atomic confused me 
> > about
> > this code for a long time, and is why the spin lock was there in the first
> > place.  But if there's a race on the find_first_zero_bit and it returns the
> > same bit to multiple concurrent threads, only one thread will win the
> > test_and_set_bit, and the other threads will go back around the loop to try
> > again, and get a different bit.
> 
> Yes.
> But, let's expect just one zero bit at the end of the list. The 
> find_first_zero_bit(ffzb)
> starts now,  thread+1 zeroes a new bit at the beginning, ffzb continues,
> thread+2 takes the zero bit at the end. The result it that ffzb hasn't found 
> a zero bit
> even though that at every moment that bit was there.Ffter that the function 
> returns -EBUSY.
> rc = (u16) find_first_zero_bit(qinfo->request_bits, qinfo->qdepth);
> if (rc >= qinfo->qdepth-1)
>   return (u16) -EBUSY;
> Still, I think that this is almost impossible, and if it should happen
> a requeue is not so bad.

Oh, wow.  Didn't think of that.  Hmm, technically no guarantee that
any given thread would ever get a bit, if all the other threads keep
snatching them away just ahead of an unlucky thread.

Could we, instead of giving up, go back around and try again on the theory
that some bits should be free in there someplace and the thread shouldn't
be infinitely unlucky?

[...]

-- steve
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] hpsa: remove unneeded loop

2013-08-06 Thread scameron
On Fri, Aug 02, 2013 at 01:13:59PM +0200, Tomas Henzl wrote:
> On 08/01/2013 06:18 PM, scame...@beardog.cce.hp.com wrote:
> > On Thu, Aug 01, 2013 at 05:39:36PM +0200, Tomas Henzl wrote:
> >> On 08/01/2013 05:19 PM, scame...@beardog.cce.hp.com wrote:
> > [...]
> >
>  Btw. on line 1284 - isn't it similar to patch 2/3 ?
> > ^^^ Oh, missed this the first time around, the sop driver uses the 
> > make_request_fn()
> > interface, and it's not a stacked driver either, so there is no limit to 
> > the number
> > of bios the block layer can stuff in -- make_request_fn must succeed.
> > If we get full we just chain them together using pointers already in the 
> > struct
> > bio for that purpose, so storing them in the driver requires no memory 
> > allocation
> > on the driver's part.  So while it's somewhat similar, we already have to 
> > handle
> > the case of the block layer stuffing infinite bios into the driver, so 
> > getting
> > full is not terribly out of the ordinary in that driver.
> 
> OK.
> 
> >
> > That being said, I'm poking around other bits of code lying around here
> > looking for similar problems, so thanks again for that one.
> >
> >>> find_first_zero_bit is not atomic, but the test_and_set_bit, which is what
> >>> counts, is atomic.   That find_first_zero_bit is not atomic confused me 
> >>> about
> >>> this code for a long time, and is why the spin lock was there in the first
> >>> place.  But if there's a race on the find_first_zero_bit and it returns 
> >>> the
> >>> same bit to multiple concurrent threads, only one thread will win the
> >>> test_and_set_bit, and the other threads will go back around the loop to 
> >>> try
> >>> again, and get a different bit.
> >> Yes.
> >> But, let's expect just one zero bit at the end of the list. The 
> >> find_first_zero_bit(ffzb)
> >> starts now,  thread+1 zeroes a new bit at the beginning, ffzb continues,
> >> thread+2 takes the zero bit at the end. The result it that ffzb hasn't 
> >> found a zero bit
> >> even though that at every moment that bit was there.Ffter that the 
> >> function returns -EBUSY.
> >> rc = (u16) find_first_zero_bit(qinfo->request_bits, qinfo->qdepth);
> >> if (rc >= qinfo->qdepth-1)
> >>return (u16) -EBUSY;
> >> Still, I think that this is almost impossible, and if it should happen
> >> a requeue is not so bad.
> > Oh, wow.  Didn't think of that.  Hmm, technically no guarantee that
> > any given thread would ever get a bit, if all the other threads keep
> > snatching them away just ahead of an unlucky thread.
> >
> > Could we, instead of giving up, go back around and try again on the theory
> > that some bits should be free in there someplace and the thread shouldn't
> > be infinitely unlucky?
> 
> In theory that gives you also no guarantee, it's likely that for a guarantee 
> some
> kind of locking is needed, the spinlock, which already is there, gives you 
> that. 
> Otoh, a very high likelihood is probably enough and give better overall 
> throughput,
> maybe some statistics/testing is needed? I don't know how much faster is it
> without the spinlock.

On thinking about this a bit more, it would be a shame if we closed the
hole allowing the "cmd_alloc returned NULL" message (the scsi_done() cmd_free()
race) and then immediately opened up another different hole that permitted the
same problem to occur. 

So to be safe, I think we should go with your patch as is -- leave
the spin lock, but get rid of the unnecessary loop.

-- steve

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] hpsa: fix warning with smp_processor_id() in preemptible

2013-08-12 Thread scameron
On Mon, Aug 12, 2013 at 09:33:32AM -0400, John Kacur wrote:
> 
> 
> - Original Message -
> > * John Kacur | 2013-07-26 16:42:30 [+0200]:
> > 
> > >Signed-off-by: John Kacur 
> > >Acked-by: Stephen 
> > 
> > ping.
> > 
> > I checked the branches for-next, scsi-fixes, fixes and misc at [0] and I
> > didn't see it. I'm going to take this for 3.10-rt but please don't lose
> > it on its way to Linus :)
> > 
> > [0] git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi.git
> 
> 
> I hope it was clear to everyone that this patch was intended for upstream.
> It was discovered by running the real-time kernel, but it exposes a (minor) 
> problem
> that should be fixed in the mainline kernel. Please apply it there Stephen, 
> and
> push it upstream appropriately.

I don't have such pushing abilities.  Acking it as I've done is
all I can do.  Bug James. :)

-- steve

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 07/10] hpsa: hide logical drives with format in progress from linux

2013-09-27 Thread scameron
On Fri, Sep 27, 2013 at 03:22:19PM +0200, Tomas Henzl wrote:
> On 09/23/2013 08:34 PM, Stephen M. Cameron wrote:
> > From: Stephen M. Cameron 
> >
> > SCSI mid layer doesn't seem to handle logical drives undergoing format
> > very well.  scsi_add_device on such devices seems to result in hitting
> > those devices with a TUR at a rate of 3Hz for awhile, transitioning
> > to hitting them with a READ(10) at a much higher rate indefinitely,
> > and at boot time, this prevents the system from coming up.  If we
> > do not expose such devices to the kernel, it isn't bothered by them.
> 
> Is the result of this patch that the drive is no more visible for the user
> and he can't follow the formatting progress? 

Yes (subsequent patch monitors the progress and brings the drive
online when it's ready).

> I think a better option is to fix the kernel to handle formatting devices 
> better

Yeah, you're probably right. (This is what comes of writing code for all
the distros then forward porting to kernel.org code.  Grumble-grumble-management
grumble-grumble real-world problems.)

> or harden the hpsa so it can cope with TURs or reads (ignore) from a 
> formatting
> device.

I don't think hpsa driver had any problem with the TURs or READs though,
they would be returned to the mid layer just fine (TUR returned sense data
indicating not ready, format in progress, I forget what the reads
returned, whatever the firmware filled in for the sense data, which
was reasonable), but the mid-layer was relentless and just never
really proceeded, iirc.

Since we were trying to make this work on existing OSes where fixing the
SCSI mid layer wasn't an option, we came up with this.

> 
> Also maybe a cmd_special_free is missing - see below

D'oh.  Ok, now that's just embarassing.  Thanks.

-- steve

> 
> Cheers, Tomas
> Signed-off-by: Stephen M. Cameron 
> ---
>  drivers/scsi/hpsa.c |   50 --
>  drivers/scsi/hpsa.h |1 +
>  2 files changed, 49 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
> index b7f405f..38e3af4 100644
> --- a/drivers/scsi/hpsa.c
> +++ b/drivers/scsi/hpsa.c
> @@ -1010,6 +1010,20 @@ static void adjust_hpsa_scsi_table(struct ctlr_info 
> *h, int hostno,
>   for (i = 0; i < nsds; i++) {
>   if (!sd[i]) /* if already added above. */
>   continue;
> +
> + /* Don't add devices which are NOT READY, FORMAT IN PROGRESS
> +  * as the SCSI mid-layer does not handle such devices well.
> +  * It relentlessly loops sending TUR at 3Hz, then READ(10)
> +  * at 160Hz, and prevents the system from coming up.
> +  */
> + if (sd[i]->format_in_progress) {
> + dev_info(&h->pdev->dev,
> + "Logical drive format in progress, device 
> c%db%dt%dl%d offline.\n",
> + h->scsi_host->host_no,
> + sd[i]->bus, sd[i]->target, sd[i]->lun);
> + continue;
> + }
> +
>   device_change = hpsa_scsi_find_entry(sd[i], h->dev,
>   h->ndevices, &entry);
>   if (device_change == DEVICE_NOT_FOUND) {
> @@ -1715,6 +1729,34 @@ static inline void hpsa_set_bus_target_lun(struct 
> hpsa_scsi_dev_t *device,
>   device->lun = lun;
>  }
>  
> +static unsigned char hpsa_format_in_progress(struct ctlr_info *h,
> + unsigned char scsi3addr[])
> +{
> + struct CommandList *c;
> + unsigned char *sense, sense_key, asc, ascq;
> +#define ASC_LUN_NOT_READY 0x04
> +#define ASCQ_LUN_NOT_READY_FORMAT_IN_PROGRESS 0x04
> +
> +
> + c = cmd_special_alloc(h);
> + if (!c)
> + return 0;
> + fill_cmd(c, TEST_UNIT_READY, h, NULL, 0, 0, scsi3addr, TYPE_CMD);
> + hpsa_scsi_do_simple_cmd_core(h, c);
> + sense = c->err_info->SenseInfo;
> + sense_key = sense[2];
> + asc = sense[12];
> + ascq = sense[13];
> + if (c->err_info->CommandStatus == CMD_TARGET_STATUS &&
> + c->err_info->ScsiStatus == SAM_STAT_CHECK_CONDITION &&
> + sense_key == NOT_READY &&
> + asc == ASC_LUN_NOT_READY &&
> + ascq == ASCQ_LUN_NOT_READY_FORMAT_IN_PROGRESS)
> + return 1;
> return^ without cmd_special_free
> 
> + cmd_special_free(h, c);
> + return 0;
> +}
> +
>  static int hpsa_update_device_info(struct ctlr_info *h,
>   unsigned char scsi3addr[], struct hpsa_scsi_dev_t *this_device,
>   unsigned char *is_OBDR_device)
> @@ -1753,10 +1795,14 @@ static int hpsa_update_device_info(struct ctlr_info 
> *h,
>   sizeof(this_device->device_id));
>  
>   if (this_device->devtype == TYPE_DISK &&
> - is_logical_dev_addr_mode(scsi3addr))
> + is_logical_dev_addr_mode(scsi3addr)) {
>   hpsa_get_raid_level(h, scsi3addr, &this_device->raid_level);
> - e

Re: [PATCH 07/10] hpsa: hide logical drives with format in progress from linux

2013-09-27 Thread scameron
On Fri, Sep 27, 2013 at 04:01:30PM +0200, Tomas Henzl wrote:
> On 09/27/2013 03:34 PM, scame...@beardog.cce.hp.com wrote:
> > On Fri, Sep 27, 2013 at 03:22:19PM +0200, Tomas Henzl wrote:
> >> On 09/23/2013 08:34 PM, Stephen M. Cameron wrote:
> >>> From: Stephen M. Cameron 
> >>>
> >>> SCSI mid layer doesn't seem to handle logical drives undergoing format
> >>> very well.  scsi_add_device on such devices seems to result in hitting
> >>> those devices with a TUR at a rate of 3Hz for awhile, transitioning
> >>> to hitting them with a READ(10) at a much higher rate indefinitely,
> >>> and at boot time, this prevents the system from coming up.  If we
> >>> do not expose such devices to the kernel, it isn't bothered by them.
> >> Is the result of this patch that the drive is no more visible for the user
> >> and he can't follow the formatting progress? 
> > Yes (subsequent patch monitors the progress and brings the drive
> > online when it's ready).
> >
> >> I think a better option is to fix the kernel to handle formatting devices 
> >> better
> > Yeah, you're probably right. (This is what comes of writing code for all
> > the distros then forward porting to kernel.org code.  
> > Grumble-grumble-management
> > grumble-grumble real-world problems.)
> >
> >> or harden the hpsa so it can cope with TURs or reads (ignore) from a 
> >> formatting
> >> device.
> > I don't think hpsa driver had any problem with the TURs or READs though,
> > they would be returned to the mid layer just fine (TUR returned sense data
> > indicating not ready, format in progress, I forget what the reads
> > returned, whatever the firmware filled in for the sense data, which
> > was reasonable), but the mid-layer was relentless and just never
> > really proceeded, iirc.
> >
> > Since we were trying to make this work on existing OSes where fixing the
> > SCSI mid layer wasn't an option, we came up with this.
> 
> I'm actually glad that you care about existing OSes :)

And the pain of porting would be much the same regardless of
whether the port is forward or backward, I suppose.

> 
> Do you know whether the midlayer has similar problems with other drivers?

No, not sure.   One thing that's a bit unusual about hpsa is it uses
the scan_start and scan_finished members of scsi_host_template, so hpsa
does its own scanning, rather than let the midlayer do the scanning which
is due to Smart Array's weirdness around the vicinity of SCSI_REPORT_LUNS.

I suspect that a lld driver calling scsi_add_device() on something which
is NOT READY/FORMAT IN PROGRESS is what provokes the trouble.  Most drivers
do not call scsi_add_device() directly at all, so it's quite possible most
drivers do not experience such a problem. A few do call scsi_add_device()
directly, like ipr or pmcraid, so these might conceivably have a similar
problem.  

We ran into this problem with what we call "Rapid Parity Initialization", which
is what you get when the RAID controller leaves the logical volume in a NOT
READY/FORMAT IN PROGRESS state and devotes itself entirely to initializing
parity data and when that's done, then the volume starts acting normally.  

Initializing the parity data can take quite a long time (hours), but not as
long as initializing it on the fly under load, which, with very large,
relatively slow drives can take nigh on forever, hence the "rapid" parity
initialization moniker.  So, if those other RAID controllers don't have a
similar feature that produces a relatively long lived NOT READY/FORMAT IN
PROGRESS state, they may not bump into the problem.

It has been awhile since I've tried letting the driver call scsi_add_device()
on a device which is undergoing Rapid Parity Initialization, so I need to try
that with current code and see how it behaves.  I haven't thought about how to
fix it within the SCSI mid layer (presuming it still doesn't behave well)
since previously we only concerned ourselves with avoiding provoking the
undesirable behavior.

-- steve

> 
> Tomas
> 
> >
> >> Also maybe a cmd_special_free is missing - see below
> > D'oh.  Ok, now that's just embarassing.  Thanks.
> >
> > -- steve
> >
> >> Cheers, Tomas
> >> Signed-off-by: Stephen M. Cameron 
> >> ---
> >>  drivers/scsi/hpsa.c |   50 
> >> --
> >>  drivers/scsi/hpsa.h |1 +
> >>  2 files changed, 49 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
> >> index b7f405f..38e3af4 100644
> >> --- a/drivers/scsi/hpsa.c
> >> +++ b/drivers/scsi/hpsa.c
> >> @@ -1010,6 +1010,20 @@ static void adjust_hpsa_scsi_table(struct ctlr_info 
> >> *h, int hostno,
> >>for (i = 0; i < nsds; i++) {
> >>if (!sd[i]) /* if already added above. */
> >>continue;
> >> +
> >> +  /* Don't add devices which are NOT READY, FORMAT IN PROGRESS
> >> +   * as the SCSI mid-layer does not handle such devices well.
> >> +   * It relentlessly loops sending TUR at 3Hz, then R

Re: [PATCH 07/10] hpsa: hide logical drives with format in progress from linux

2013-09-27 Thread scameron
On Fri, Sep 27, 2013 at 12:54:39PM -0400, Douglas Gilbert wrote:
> On 13-09-27 10:41 AM, scame...@beardog.cce.hp.com wrote:
> >On Fri, Sep 27, 2013 at 04:01:30PM +0200, Tomas Henzl wrote:
> >>On 09/27/2013 03:34 PM, scame...@beardog.cce.hp.com wrote:
> >>>On Fri, Sep 27, 2013 at 03:22:19PM +0200, Tomas Henzl wrote:
> On 09/23/2013 08:34 PM, Stephen M. Cameron wrote:
> >From: Stephen M. Cameron 
> >
> >SCSI mid layer doesn't seem to handle logical drives undergoing format
> >very well.  scsi_add_device on such devices seems to result in hitting
> >those devices with a TUR at a rate of 3Hz for awhile, transitioning
> >to hitting them with a READ(10) at a much higher rate indefinitely,
> >and at boot time, this prevents the system from coming up.  If we
> >do not expose such devices to the kernel, it isn't bothered by them.
> Is the result of this patch that the drive is no more visible for the 
> user
> and he can't follow the formatting progress?
> >>>Yes (subsequent patch monitors the progress and brings the drive
> >>>online when it's ready).
> >>>
> I think a better option is to fix the kernel to handle formatting 
> devices better
> >>>Yeah, you're probably right. (This is what comes of writing code for all
> >>>the distros then forward porting to kernel.org code.  
> >>>Grumble-grumble-management
> >>>grumble-grumble real-world problems.)
> >>>
> or harden the hpsa so it can cope with TURs or reads (ignore) from a 
> formatting
> device.
> >>>I don't think hpsa driver had any problem with the TURs or READs though,
> >>>they would be returned to the mid layer just fine (TUR returned sense 
> >>>data
> >>>indicating not ready, format in progress, I forget what the reads
> >>>returned, whatever the firmware filled in for the sense data, which
> >>>was reasonable), but the mid-layer was relentless and just never
> >>>really proceeded, iirc.
> >>>
> >>>Since we were trying to make this work on existing OSes where fixing the
> >>>SCSI mid layer wasn't an option, we came up with this.
> >>
> >>I'm actually glad that you care about existing OSes :)
> >
> >And the pain of porting would be much the same regardless of
> >whether the port is forward or backward, I suppose.
> >
> >>
> >>Do you know whether the midlayer has similar problems with other drivers?
> >
> >No, not sure.   One thing that's a bit unusual about hpsa is it uses
> >the scan_start and scan_finished members of scsi_host_template, so hpsa
> >does its own scanning, rather than let the midlayer do the scanning which
> >is due to Smart Array's weirdness around the vicinity of SCSI_REPORT_LUNS.
> >
> >I suspect that a lld driver calling scsi_add_device() on something which
> >is NOT READY/FORMAT IN PROGRESS is what provokes the trouble.  Most drivers
> >do not call scsi_add_device() directly at all, so it's quite possible most
> >drivers do not experience such a problem. A few do call scsi_add_device()
> >directly, like ipr or pmcraid, so these might conceivably have a similar
> >problem.
> >
> >We ran into this problem with what we call "Rapid Parity Initialization", 
> >which
> >is what you get when the RAID controller leaves the logical volume in a NOT
> >READY/FORMAT IN PROGRESS state and devotes itself entirely to initializing
> >parity data and when that's done, then the volume starts acting normally.
> >
> >Initializing the parity data can take quite a long time (hours), but not as
> >long as initializing it on the fly under load, which, with very large,
> >relatively slow drives can take nigh on forever, hence the "rapid" parity
> >initialization moniker.  So, if those other RAID controllers don't have a
> >similar feature that produces a relatively long lived NOT READY/FORMAT IN
> >PROGRESS state, they may not bump into the problem.
> 
> {0x04,0x04,"Logical unit not ready, format in progress"},
> {0x04,0x05,"Logical unit not ready, rebuild in progress"},
> {0x04,0x06,"Logical unit not ready, recalculation in progress"},
> {0x04,0x07,"Logical unit not ready, operation in progress"},
> ...
> {0x04,0x1b,"Logical unit not ready, sanitize in progress"},
> 
> Wouldn't perhaps 0x4,0x5 be more accurate?  If someone managed to
> send a FORMAT UNIT or SANITIZE to a physical drive behind your LV,
> that would be a completely different issue.

Perhaps, but 0x04/0x04 is what the firmware returns in this instance.

-- steve

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 07/10] hpsa: hide logical drives with format in progress from linux

2013-09-27 Thread scameron
On Fri, Sep 27, 2013 at 03:22:19PM +0200, Tomas Henzl wrote:
> On 09/23/2013 08:34 PM, Stephen M. Cameron wrote:
> > From: Stephen M. Cameron 
> >
> > SCSI mid layer doesn't seem to handle logical drives undergoing format
> > very well.  scsi_add_device on such devices seems to result in hitting
> > those devices with a TUR at a rate of 3Hz for awhile, transitioning
> > to hitting them with a READ(10) at a much higher rate indefinitely,
> > and at boot time, this prevents the system from coming up.  If we
> > do not expose such devices to the kernel, it isn't bothered by them.
> 
> Is the result of this patch that the drive is no more visible for the user
> and he can't follow the formatting progress? 
> I think a better option is to fix the kernel to handle formatting devices 
> better
> or harden the hpsa so it can cope with TURs or reads (ignore) from a 
> formatting device.

So here is the behavior I see with linux-3.12-rc2 when create a logical
drive with rapid parity initialization enabled and then reboot
before the drive finishes.  Note that scsi 0:0:0:1 is
the device that's in this state.  Interspersed are some notes from
me, prefixed "smc> "

Summary: First you see sd (I think) printing dots very slowly.
Then you see udev get angry.  Then a couple stack traces one
from modprobe and one from dmraid, and the system doesn't
boot up.  20-something minutes have elapsed at this point. It
may eventually boot when the RPI finally finishes, but at this
point, I don't care, because 20 minutes is too long to be holding
things up.


HP HPSA Driver (v 3.4.0-1)  
hpsa :02:00.0: can't disable ASPM; OS doesn't have ASPM control 
hpsa :02:00.0: MSIX 
hpsa :02:00.0: hpsa0: <0x323b> at IRQ 64 using DAC  
scsi0 : hpsa
hpsa :02:00.0: RAID  device c0b3t0l0 added. 
hpsa :02:00.0: Direct-Access device c0b0t0l0 added. 
hpsa :02:00.0: Direct-Access device c0b0t0l1 added. 
hpsa :02:00.0: Direct-Access device c0b0t0l2 added. 
usb 1-1.3: new low-speed USB device number 3 using ehci-pci 
scsi 0:3:0:0: RAID  HP   P420i5.19 PQ: 0 ANSI: 5
scsi 0:0:0:0: Direct-Access HP   LOGICAL VOLUME   5.19 PQ: 0 ANSI: 5
scsi 0:0:0:1: Direct-Access HP   LOGICAL VOLUME   5.19 PQ: 0 ANSI: 5
scsi 0:0:0:2: Direct-Access HP   LOGICAL VOLUME   5.19 PQ: 0 ANSI: 5
ata_piix :00:1f.2: MAP [
 P0 P2 P1 P3 ]  
usb 1-1.3: New USB device found, idVendor=0624, idProduct=0341  
usb 1-1.3: New USB device strings: Mfr=1, Product=2, SerialNumber=0 
usb 1-1.3: Product: HP 336047-B21   
usb 1-1.3: Manufacturer: Avocent
input: Avocent HP 336047-B21 as /devices/pci:00/:00:1a.0/usb1/1-1/1-1.31
hid-generic 0003:0624:0341.0001: input,hidraw0: USB HID v1.10 Keyboard [Avocent0
scsi1 : ata_piix
scsi2 : ata_piix
ata1: SATA max UDMA/133 cmd 0x4000 ctl 0x4008 bmdma 0x4020 irq 17   
ata2: SATA max UDMA/133 cmd 0x4010 ctl 0x4018 bmdma 0x4028 irq 17   
input: Avocent HP 336047-B21 as /devices/pci:00/:00:1a.0/usb1/1-1/1-1.32
hid-generic 0003:0624:0341.0002: input,hidraw1: USB HID v1.10 Mouse [Avocent HP1
sd 0:0:0:0: [sda] 2344160432 512-byte logical blocks: (1.20 TB/1.09 TiB)
sd 0:0:0:1: [sdb] Spinning up disk...   
usb 2-1.3: new high-speed USB device number 3 using ehci-pci
sd 0:0:0:2: [sdc] 390651840 512-byte logical blocks: (200 GB/186 GiB)   
sd 0:0:0:0: [sda] Write Protect is off  
sd 0:0:0:2: [sdc] Write Protect is off  
sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DA
sd 0:0:0:2: [sdc] Write cache: disabled, read cache: enabled, doesn't support DA
 sdc: unknown partition table   
sd 0:0:0:2: [sdc] Attached SCSI disk
 sda: sda1 sda2 sda3
sd 0:0:0:0: [sda] Attached SCSI disk
usb 2-1.3: New USB device found, idVendor=0424, idProduct=2660  
usb 2-1.3: New USB device strings: Mfr=0, Product=0, SerialNumber=0 
hub 2-1.3:1.0: USB hub found 

Re: [PATCH 07/10] hpsa: hide logical drives with format in progress from linux

2013-09-30 Thread scameron
On Fri, Sep 27, 2013 at 04:58:41PM +0200, Tomas Henzl wrote:
> On 09/27/2013 04:41 PM, scame...@beardog.cce.hp.com wrote:
> > On Fri, Sep 27, 2013 at 04:01:30PM +0200, Tomas Henzl wrote:
> >> On 09/27/2013 03:34 PM, scame...@beardog.cce.hp.com wrote:
> >>> On Fri, Sep 27, 2013 at 03:22:19PM +0200, Tomas Henzl wrote:
>  On 09/23/2013 08:34 PM, Stephen M. Cameron wrote:
> > From: Stephen M. Cameron 
> >
> > SCSI mid layer doesn't seem to handle logical drives undergoing format
> > very well.  scsi_add_device on such devices seems to result in hitting
> > those devices with a TUR at a rate of 3Hz for awhile, transitioning
> > to hitting them with a READ(10) at a much higher rate indefinitely,
> > and at boot time, this prevents the system from coming up.  If we
> > do not expose such devices to the kernel, it isn't bothered by them.
>  Is the result of this patch that the drive is no more visible for the 
>  user
>  and he can't follow the formatting progress? 
> >>> Yes (subsequent patch monitors the progress and brings the drive
> >>> online when it's ready).
> >>>
>  I think a better option is to fix the kernel to handle formatting 
>  devices better
> >>> Yeah, you're probably right. (This is what comes of writing code for all
> >>> the distros then forward porting to kernel.org code.  
> >>> Grumble-grumble-management
> >>> grumble-grumble real-world problems.)
> >>>
>  or harden the hpsa so it can cope with TURs or reads (ignore) from a 
>  formatting
>  device.
> >>> I don't think hpsa driver had any problem with the TURs or READs though,
> >>> they would be returned to the mid layer just fine (TUR returned sense data
> >>> indicating not ready, format in progress, I forget what the reads
> >>> returned, whatever the firmware filled in for the sense data, which
> >>> was reasonable), but the mid-layer was relentless and just never
> >>> really proceeded, iirc.
> >>>
> >>> Since we were trying to make this work on existing OSes where fixing the
> >>> SCSI mid layer wasn't an option, we came up with this.
> >> I'm actually glad that you care about existing OSes :)
> > And the pain of porting would be much the same regardless of
> > whether the port is forward or backward, I suppose.
> >
> >> Do you know whether the midlayer has similar problems with other drivers?
> > No, not sure.   One thing that's a bit unusual about hpsa is it uses
> > the scan_start and scan_finished members of scsi_host_template, so hpsa
> > does its own scanning, rather than let the midlayer do the scanning which
> > is due to Smart Array's weirdness around the vicinity of SCSI_REPORT_LUNS.
> >
> > I suspect that a lld driver calling scsi_add_device() on something which
> > is NOT READY/FORMAT IN PROGRESS is what provokes the trouble.  Most drivers
> > do not call scsi_add_device() directly at all, so it's quite possible most
> > drivers do not experience such a problem. A few do call scsi_add_device()
> > directly, like ipr or pmcraid, so these might conceivably have a similar
> > problem.  
> >
> > We ran into this problem with what we call "Rapid Parity Initialization", 
> > which
> > is what you get when the RAID controller leaves the logical volume in a NOT
> > READY/FORMAT IN PROGRESS state and devotes itself entirely to initializing
> > parity data and when that's done, then the volume starts acting normally.  
> >
> > Initializing the parity data can take quite a long time (hours), but not as
> > long as initializing it on the fly under load, which, with very large,
> > relatively slow drives can take nigh on forever, hence the "rapid" parity
> > initialization moniker.  So, if those other RAID controllers don't have a
> > similar feature that produces a relatively long lived NOT READY/FORMAT IN
> > PROGRESS state, they may not bump into the problem.
> >
> > It has been awhile since I've tried letting the driver call 
> > scsi_add_device()
> > on a device which is undergoing Rapid Parity Initialization, so I need to 
> > try
> > that with current code and see how it behaves.  I haven't thought about how 
> > to
> > fix it within the SCSI mid layer (presuming it still doesn't behave well)
> > since previously we only concerned ourselves with avoiding provoking the
> > undesirable behavior.
> >
> > -- steve
> 
> Thanks for the explanation. I hope I can look into this later. Sometimes 
> later. When my
> real-world problems go away...

I have taken a stab at it... still working on it a bit, but it worked at
least one time for me.  Before I go too far off into the weeds, let me
describe what I have done.

In the sd driver, sd_revalidate_disk() calls sd_spinup_disk().
I modified sd_spinup_disk() to detect the NOT READY/FORMAT IN PROGRESS
status and return a status indicating that this state has been detected:

defer_revalidation = sd_spinup_disk(sdkp);
...

then, similar to media not present condition, skip trying to get
a bunch of stuff.

 

Re: [PATCH 07/10] hpsa: hide logical drives with format in progress from linux

2013-10-10 Thread scameron
On Fri, Sep 27, 2013 at 08:34:51AM -0500, scame...@beardog.cce.hp.com wrote:
> On Fri, Sep 27, 2013 at 03:22:19PM +0200, Tomas Henzl wrote:
> > On 09/23/2013 08:34 PM, Stephen M. Cameron wrote:
> > > From: Stephen M. Cameron 
> > >
> > > SCSI mid layer doesn't seem to handle logical drives undergoing format
> > > very well.  scsi_add_device on such devices seems to result in hitting
> > > those devices with a TUR at a rate of 3Hz for awhile, transitioning
> > > to hitting them with a READ(10) at a much higher rate indefinitely,
> > > and at boot time, this prevents the system from coming up.  If we
> > > do not expose such devices to the kernel, it isn't bothered by them.
> > 
> > Is the result of this patch that the drive is no more visible for the user
> > and he can't follow the formatting progress? 
> 
> Yes (subsequent patch monitors the progress and brings the drive
> online when it's ready).
> 
> > I think a better option is to fix the kernel to handle formatting devices 
> > better
> 
> Yeah, you're probably right. (This is what comes of writing code for all
> the distros then forward porting to kernel.org code.  
> Grumble-grumble-management
> grumble-grumble real-world problems.)

[...]

So I took a stab at modifying sd (below) which I am now thinking may not
be the right approach.  Empirically, it seems to work, but there are a 
couple of things not quite right.

The gist of the patch is, when sd_spinup_device encounters a scsi disk which
is NOT READY/FORMAT IN PROGRESS, it adds it to a list, and starts up a
thread to monitor the list (if one isn't already running).  The thread
periodically polls the disks until they become ready, and revalidates them
and removes them from the list when they become ready.  When the list is
empty the thread exits.  There are a lot of complicated locking and
get_device/put_device reference counting to make sure various rugs don't
get ripped out from under the thread.

One problematic area is on rmmod of sd.  It's using kthread_stop() to stop
the thread, but what if the thread's already gone?  I didn't figure out
how to get around that, and came to the conclusion that having the thread
sometimes decide when to exit on it's own, and having it sometimes be told
to exit on rmmod of sd probably can't work (although that is exactly what
I would like to do.)  So, hmm, not satisfactory.

But then I started thinking that maybe I'm approaching it all wrong.  Maybe
this should be done (mostly) from userspace.  Maybe some small change to sd
just to keep it from waiting around too long for disks that are in 
NOT READY/FORMAT IN PROGRESS state, and some change to udevd(?) or udev
rules... or something udev-ey? so that it can notice when a device that is
in this state has been added to the system and launch a userspace daemon to
poll the device with TURs via SG_IO and when it becomes ready, trigger the
revalidation from userspace.  That would probably be better than what I've
done below.

But, I'm not really sure where to start to get something like that rolling,
presuming doing something like that would be a better approach.

-- steve


commit 6b4265426c8aff03e4c28372409a3f1c2c760c28
Author: Stephen M. Cameron 
Date:   Wed Oct 9 19:03:46 2013 -0500

scsi: improve probing of disks that are temporarily offline

The current handling of disks which are NOT READY/FORMAT
IN PROGRESS is not ideal.  Instead of waiting for a period of time
for the disk to become ready and then giving up in sd_spinup_disk(),
we can create a thread to wait for a list of disks which are
expected to eventually become ready which periodically polls such
disks and revalidates them when they become ready.  When the list
of not-ready disks becomes empty, the thread will exit.  This allows
other things to proceed, for example, allowing the system to boot
without a the usual delay that is mostly futile since the odds
are good that a format in progress won't finish during the time
that sd_spinup_disk would have waited.


diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index e62d17d..3158780 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -51,6 +51,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -126,6 +127,19 @@ static DEFINE_MUTEX(sd_ref_mutex);
 static struct kmem_cache *sd_cdb_cache;
 static mempool_t *sd_cdb_pool;
 
+static DEFINE_SPINLOCK(sd_revalidation_lock);
+struct sd_deferred_disk_list_entry {
+   struct scsi_disk *sdkp;
+   struct list_head deferred_list;
+};
+static struct list_head sd_deferred_disk_list =
+   LIST_HEAD_INIT(sd_deferred_disk_list);
+#define REVALIDATION_THREAD_STOPPED 0
+#define REVALIDATION_THREAD_RUNNING 1
+#define REVALIDATION_THREAD_STOPPING 2
+static int sd_revalidation_thread_state;
+static struct task_struct *sd_revalidation_thread;
+
 static const char *sd_cache_types[] = {
"write through", "none", "write back",
"write back, no read (daft)"
@@ -17

Re: [PATCH V2] hpsa: refine the pci enable/disable handling

2014-08-14 Thread scameron
On Thu, Aug 14, 2014 at 04:12:39PM +0200, Tomas Henzl wrote:
> When a second(kdump) kernel starts and the hard reset method is used
> the driver calls pci_disable_device without previously enabling it,
> so the kernel shows a warning -
> [   16.876248] WARNING: at drivers/pci/pci.c:1431 
> pci_disable_device+0x84/0x90()
> [   16.882686] Device hpsa
> disabling already-disabled device
> ...
> This patch fixes it, in addition to this I tried to balance also some other 
> pairs
> of enable/disable device in the driver.
> Unfortunately I wasn't able to verify the functionality for the case of a sw 
> reset,
> because of a lack of proper hw.
> 
> Changes in v2:
> - fixed a checkpatch issue
> - removed a no more valid comment
> - added pci_enable/disable pair when kdump kernel starts
> - fixed a typo in subject line
> 
> Signed-off-by: Tomas Henzl 
> ---
>  drivers/scsi/hpsa.c | 42 --
>  1 file changed, 28 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
> index 6b35d0d..cf793c3 100644
> --- a/drivers/scsi/hpsa.c
> +++ b/drivers/scsi/hpsa.c
> @@ -5971,10 +5971,6 @@ static int hpsa_kdump_hard_reset_controller(struct 
> pci_dev *pdev)
>  
>   /* Save the PCI command register */
>   pci_read_config_word(pdev, 4, &command_register);
> - /* Turn the board off.  This is so that later pci_restore_state()
> -  * won't turn the board on before the rest of config space is ready.
> -  */
> - pci_disable_device(pdev);
>   pci_save_state(pdev);
>  
>   /* find the first memory BAR, so we can find the cfg table */
> @@ -6022,11 +6018,6 @@ static int hpsa_kdump_hard_reset_controller(struct 
> pci_dev *pdev)
>   goto unmap_cfgtable;
>  
>   pci_restore_state(pdev);
> - rc = pci_enable_device(pdev);
> - if (rc) {
> - dev_warn(&pdev->dev, "failed to enable device.\n");
> - goto unmap_cfgtable;
> - }
>   pci_write_config_word(pdev, 4, command_register);
>  
>   /* Some devices (notably the HP Smart Array 5i Controller)
> @@ -6541,6 +6532,23 @@ static int hpsa_init_reset_devices(struct pci_dev 
> *pdev)
>   if (!reset_devices)
>   return 0;
>  
> + /* kdump kernel is loading, we don't know in which state is
> +  * the pci interface. The dev->enable_cnt is equal zero
> +  * so we call enable+disable, wait a while and switch it on.
> +  */
> + rc = pci_enable_device(pdev);
> + if (rc) {
> + dev_warn(&pdev->dev, "Failed to enable PCI device\n");
> + return -ENODEV;
> + }
> + pci_disable_device(pdev);
> + msleep(260);/* a randomly chosen number */
> + rc = pci_enable_device(pdev);
> + if (rc) {
> + dev_warn(&pdev->dev, "failed to enable device.\n");
> + return -ENODEV;
> + }
> +
>   /* Reset the controller with a PCI power-cycle or via doorbell */
>   rc = hpsa_kdump_hard_reset_controller(pdev);
>  
> @@ -6549,10 +6557,11 @@ static int hpsa_init_reset_devices(struct pci_dev 
> *pdev)
>* "performant mode".  Or, it might be 640x, which can't reset
>* due to concerns about shared bbwc between 6402/6404 pair.
>*/
> - if (rc == -ENOTSUPP)
> - return rc; /* just try to do the kdump anyhow. */
> - if (rc)
> - return -ENODEV;
> + if (rc) {
> + if (rc != -ENOTSUPP) /* just try to do the kdump anyhow. */
> + rc = -ENODEV;
> + goto out_disable;
> + }
>  
>   /* Now try to get the controller to respond to a no-op */
>   dev_warn(&pdev->dev, "Waiting for controller to respond to no-op\n");
> @@ -6563,7 +6572,11 @@ static int hpsa_init_reset_devices(struct pci_dev 
> *pdev)
>   dev_warn(&pdev->dev, "no-op failed%s\n",
>   (i < 11 ? "; re-trying" : ""));
>   }
> - return 0;
> +
> +out_disable:
> +
> + pci_disable_device(pdev);
> + return rc;
>  }
>  
>  static int hpsa_allocate_cmd_pool(struct ctlr_info *h)
> @@ -6743,6 +6756,7 @@ static void 
> hpsa_undo_allocations_after_kdump_soft_reset(struct ctlr_info *h)
>   iounmap(h->transtable);
>   if (h->cfgtable)
>   iounmap(h->cfgtable);
> + pci_disable_device(h->pdev);
>   pci_release_regions(h->pdev);
>   kfree(h);
>  }
> -- 
> 1.8.3.1

Ack.

Reviewed-by: Stephen M. Cameron 

-- steve

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 RESEND 09/23] hpsa: Use pci_enable_msix_range() instead of pci_enable_msix()

2014-08-15 Thread scameron
On Wed, Jul 16, 2014 at 08:05:13PM +0200, Alexander Gordeev wrote:
> As result of deprecation of MSI-X/MSI enablement functions
> pci_enable_msix() and pci_enable_msi_block() all drivers
> using these two interfaces need to be updated to use the
> new pci_enable_msi_range()  or pci_enable_msi_exact()
> and pci_enable_msix_range() or pci_enable_msix_exact()
> interfaces.
> 
> Cc: "Stephen M. Cameron" 
> Cc: iss_storage...@hp.com
> Cc: linux-scsi@vger.kernel.org
> Cc: linux-...@vger.kernel.org
> Signed-off-by: Alexander Gordeev 
> ---
>  drivers/scsi/hpsa.c |   27 ---
>  1 files changed, 12 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
> index 648dec2..5c4ab6e 100644
> --- a/drivers/scsi/hpsa.c
> +++ b/drivers/scsi/hpsa.c
> @@ -6160,25 +6160,22 @@ static void hpsa_interrupt_mode(struct ctlr_info *h)
>   h->msix_vector = MAX_REPLY_QUEUES;
>   if (h->msix_vector > num_online_cpus())
>   h->msix_vector = num_online_cpus();
> - err = pci_enable_msix(h->pdev, hpsa_msix_entries,
> -   h->msix_vector);
> - if (err > 0) {
> + err = pci_enable_msix_range(h->pdev, hpsa_msix_entries,
> + 1, h->msix_vector);
> + if (err < 0) {
> + dev_warn(&h->pdev->dev, "MSI-X init failed %d\n", err);
> + h->msix_vector = 0;
> + goto single_msi_mode;
> + } else if (err < h->msix_vector) {
>   dev_warn(&h->pdev->dev, "only %d MSI-X vectors "
>  "available\n", err);
> - h->msix_vector = err;
> - err = pci_enable_msix(h->pdev, hpsa_msix_entries,
> -   h->msix_vector);
> - }
> - if (!err) {
> - for (i = 0; i < h->msix_vector; i++)
> - h->intr[i] = hpsa_msix_entries[i].vector;
> - return;
> - } else {
> - dev_warn(&h->pdev->dev, "MSI-X init failed %d\n",
> -err);
> - h->msix_vector = 0;
>   }
> + h->msix_vector = err;
> + for (i = 0; i < h->msix_vector; i++)
> + h->intr[i] = hpsa_msix_entries[i].vector;
> + return;
>   }
> +single_msi_mode:
>   if (pci_find_capability(h->pdev, PCI_CAP_ID_MSI)) {
>   dev_info(&h->pdev->dev, "MSI\n");
>   if (!pci_enable_msi(h->pdev))
> -- 
> 1.7.7.6

Ack. 

I had previously acked an older, but seemingly semantically identical patch.
Was requested to ack this one as well.

-- steve

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Question about scsi error handling and LLD

2014-08-19 Thread scameron


Is it permitted for a LLD to call scsi_done() for a command which
has been or is attempting to be aborted via the LLD abort error handler
function?

I am looking at the scsi mid layer code, and I'm a bit confused.

There seem to be a few ways the LLD abort handler function can
get called.

One path:

scsi_times_out() calls scsi_abort_command() which queues
cmd->abort_work in 10ms.  This will cause scmd_eh_abort_handler()
to be called, which can call scsi_try_to_abort_cmd() which
calls the LLD abort error handler function.  If the LLD abort
error handler function returns SUCCESS, then scmd_eh_abort_handler()
may requeue the command via scsi_queue_insert() re-using the same
scsi_cmnd pointer.  Or, it may call scsi_finish_cmnd() on that
scsi_cmnd.

Both of those actions, requeueing the same scsi command, or calling
scsi_finish_cmnd(), seem incompatible with the LLD also having
called scsi_done() on the command.

I think, scsi_done eventually frees the scsi_cmnd...

scsi_done() -> blk_complete_request -> __blk_complete_request ->
-> raise_softirq_irqoff -> ??? -> scsi_softirq_done() ->
-> scsi_finish_cmnd() -> scsi_io_completion -> scsi_end_request()
-> scsi_mq_uninit_cmd() ... etc.

It seems like either calling scsi_finish_cmnd() too many times (once via
LLD calling scsi_done, and once via scmd_eh_abort_handler calling
scsi_finish_cmnd()), or calling scsi_done() via LLD concurrently with
re-using that same scsi command (via scmd_eh_abort_handler calling
scsi_queue_insert()) would both be bad things.  Correct?

So, without completely understanding it, it seems like if the LLD
abort handler is going to return SUCCESS, it must be sure that it
hasn't already called scsi_done() on the command.  Or to put it
another way, the LLD must make sure there is not a race between
the abort handler and the interrupt handler, and the instant
the LLD abort handler is called for a command, calling scsi_done()
for that command is no longer an option for the LLD.  But, I do
not see how to synchronize that.

I hope I'm missing something here, because I am having a hard time
seeing how to eliminate the race between the LLD abort handler and
command completion in the LLD, esp. since now (due to my hardware's
specific weirdness) I may in the interrupt handler requeue the command
internally in a workqueue within the LLD for submission to HW a 2nd
time before completion back to the mid layer under some circumstances
(e.g. fast path failed, try slower but more robust RAID stack path.)

But, then there is this comment attached to scsi_times_out() which
makes me feel better:

 * Notes:
 * We do not need to lock this.  There is the potential for a race
 * only in that the normal completion handling might run, but if the
 * normal completion function determines that the timer has already
 * fired, then it mustn't do anything.
 */

So, perhaps that is the answer I'm looking for.  If the comment is
correct and my understanding of it is correct, then the LLD *is* free
to call scsi_done() even though it races with the LLD abort handler?

I'm not finding the code that implements what that comment claims though.
Is it in the block layer code?

-- steve


--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Question about scsi error handling and LLD

2014-08-19 Thread scameron
On Tue, Aug 19, 2014 at 12:16:12PM -0500, James Bottomley wrote:
> On Tue, 2014-08-19 at 10:30 -0500, scame...@beardog.cce.hp.com wrote:
> > 
> > Is it permitted for a LLD to call scsi_done() for a command which
> > has been or is attempting to be aborted via the LLD abort error handler
> > function?
> > 
> > I am looking at the scsi mid layer code, and I'm a bit confused.
> 
> Hannes asked a similar question; does this help:
> 
> http://marc.info/?l=linux-scsi&m=140267838320488
[...]

Yes, that helps.  Thanks.

-- steve

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [SCSI] hpsa: fix return value check in start_controller_lockup_detector()

2013-10-29 Thread scameron
On Tue, Oct 29, 2013 at 02:43:56PM +0800, Wei Yongjun wrote:
> From: Wei Yongjun 
> 
> In case of error, the function kthread_run() returns ERR_PTR()
> and never returns NULL. The NULL test in the return value check
> should be replaced with IS_ERR().
> 
> Signed-off-by: Wei Yongjun 
> ---
>  drivers/scsi/hpsa.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
> index 891c86b..f413b14 100644
> --- a/drivers/scsi/hpsa.c
> +++ b/drivers/scsi/hpsa.c
> @@ -4757,7 +4757,8 @@ static void start_controller_lockup_detector(struct 
> ctlr_info *h)
>   kthread_run(detect_controller_lockup_thread,
>   NULL, HPSA);
>   }
> - if (!hpsa_lockup_detector) {
> + if (IS_ERR(hpsa_lockup_detector)) {
> + hpsa_lockup_detector = NULL;
>   dev_warn(&h->pdev->dev,
>   "Could not start lockup detector thread\n");
>   return;

Ack.

Thanks.

-- steve

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [SCSI] hpsa: fix return value check in start_controller_lockup_detector()

2013-10-30 Thread scameron
On Tue, Oct 29, 2013 at 02:52:38PM -0700, James Bottomley wrote:
> On Tue, 2013-10-29 at 11:09 -0500, scame...@beardog.cce.hp.com wrote:
> > On Tue, Oct 29, 2013 at 02:43:56PM +0800, Wei Yongjun wrote:
> > > From: Wei Yongjun 
> > > 
> > > In case of error, the function kthread_run() returns ERR_PTR()
> > > and never returns NULL. The NULL test in the return value check
> > > should be replaced with IS_ERR().
> > > 
> > > Signed-off-by: Wei Yongjun 
> > > ---
> > >  drivers/scsi/hpsa.c | 3 ++-
> > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
> > > index 891c86b..f413b14 100644
> > > --- a/drivers/scsi/hpsa.c
> > > +++ b/drivers/scsi/hpsa.c
> > > @@ -4757,7 +4757,8 @@ static void start_controller_lockup_detector(struct 
> > > ctlr_info *h)
> > >   kthread_run(detect_controller_lockup_thread,
> > >   NULL, HPSA);
> > >   }
> > > - if (!hpsa_lockup_detector) {
> > > + if (IS_ERR(hpsa_lockup_detector)) {
> > > + hpsa_lockup_detector = NULL;
> > >   dev_warn(&h->pdev->dev,
> > >   "Could not start lockup detector thread\n");
> > >   return;
> > 
> > Ack.
> 
> Hmm, this whole lockup thread start/stop thing is horribly racy with
> hotplug (and bind and unbind).
> 
> Firstly, why do you need an actual thread, why not just use a workqueue?
> That way you don't need a list or a lock, you can just schedule work for
> each instance?  since they don't interfere, no global list and no lock.
> 
> If you insist on a thread, it should probably start at module insertion
> and be shut down at module removal, so no races.  You can have it sleep
> forever on list_empty(&hpsa_ctlr_list) and wake it when you add a
> controller.

Alright, I'll work on it.  Thanks.

-- steve

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/11] hpsa: add 5 second delay after doorbell reset

2013-11-08 Thread scameron
On Fri, Nov 08, 2013 at 02:51:37PM +0100, Tomas Henzl wrote:
> On 11/07/2013 05:45 PM, Stephen M. Cameron wrote:
> > From: Stephen M. Cameron 
> >
> > The hardware guys tell us that after initiating a software
> > reset via the doorbell register we need to wait 5 seconds before
> > attempting to talk to the board *at all*.  This means that we
> > cannot watch the board to verify it transitions from "ready" to
> > to "not ready" then back "ready", since this transition will
> > most likely happen during those 5 seconds (though we can still
> > verify the reset happens by watching the "driver version" field
> > get cleared.)
> >
> > Signed-off-by: Stephen M. Cameron 
> > ---
> >  drivers/scsi/hpsa.c |   32 +++-
> >  1 files changed, 23 insertions(+), 9 deletions(-)
> >
> > diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
> > index 20fc598..fff5fd3 100644
> > --- a/drivers/scsi/hpsa.c
> > +++ b/drivers/scsi/hpsa.c
> > @@ -3781,6 +3781,13 @@ static int hpsa_controller_hard_reset(struct pci_dev 
> > *pdev,
> >  */
> > dev_info(&pdev->dev, "using doorbell to reset controller\n");
> > writel(use_doorbell, vaddr + SA5_DOORBELL);
> > +
> > +   /* PMC hardware guys tell us we need a 5 second delay after
> > +* doorbell reset and before any attempt to talk to the board
> > +* at all to ensure that this actually works and doesn't fall
> > +* over in some weird corner cases.
> > +*/
> > +   msleep(5000);
> > } else { /* Try to do it the PCI power state way */
> >  
> > /* Quoting from the Open CISS Specification: "The Power
> > @@ -3977,15 +3984,22 @@ static int hpsa_kdump_hard_reset_controller(struct 
> > pci_dev *pdev)
> >need a little pause here */
> > msleep(HPSA_POST_RESET_PAUSE_MSECS);
> 
> I know it's complicated with a lot of different devices and fw versions,
> but here^ we wait for 3sec - isn't the method - wait for 3s then wait for 
> board not ready
> a bit fragile, what if a board comes up faster?
> When the method "watching the "driver version"" works why don't you want to 
> use it  
> regardless of the reset method used?

The "watching the driver version" thing is only there to catch if
the firmware guys break things and turn the reset into a no-op
(which happened with the PCI power manaegment based reset and we
didn't catch it for a year or so because we didn't have that check)

We aren't supposed to look at the driver version field (or anything)
until we first verify the scratch pad register says the firmware is
ready.  In the case of those boards that use the "doorbell" reset,
we aren't supposed to look at *anything* for the first five seconds.

I have been bugging the firmware/hardware guys for a sane reset
procedure that actually works reliably for years with no luck.

For the SCSI over PCIe driver, being tired of this crap, I simply
unconditionally reset the device on driver load every single time,
and did this from the beginning.  This kind of forced the firmware
and hardware guys to make the reset on that thing work reliably
and quickly, and since I did that from the earliest days, they didn't
have a chance to screw it up without it being caught immediately.
For Smart Array, obviously it's too late for that approach.

-- steve

> 
> >  
> > -   /* Wait for board to become not ready, then ready. */
> > -   dev_info(&pdev->dev, "Waiting for board to reset.\n");
> > -   rc = hpsa_wait_for_board_state(pdev, vaddr, BOARD_NOT_READY);
> > -   if (rc) {
> > -   dev_warn(&pdev->dev,
> > -   "failed waiting for board to reset."
> > -   " Will try soft reset.\n");
> > -   rc = -ENOTSUPP; /* Not expected, but try soft reset later */
> > -   goto unmap_cfgtable;
> > +   if (!use_doorbell) {
> > +   /* Wait for board to become not ready, then ready.
> > +* (if we used the doorbell, then we already waited 5 secs
> > +* so the "not ready" state is already gone by so we
> > +* won't catch it.)
> > +*/
> > +   dev_info(&pdev->dev, "Waiting for board to reset.\n");
> > +   rc = hpsa_wait_for_board_state(pdev, vaddr, BOARD_NOT_READY);
> > +   if (rc) {
> > +   dev_warn(&pdev->dev,
> > +   "failed waiting for board to reset."
> > +   " Will try soft reset.\n");
> > +   /* Not expected, but try soft reset later */
> > +   rc = -ENOTSUPP;
> > +   goto unmap_cfgtable;
> > +   }
> > }
> > rc = hpsa_wait_for_board_state(pdev, vaddr, BOARD_READY);
> > if (rc) {
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this l

Re: [PATCH 03/11] hpsa: add 5 second delay after doorbell reset

2013-11-08 Thread scameron
On Fri, Nov 08, 2013 at 04:02:20PM +0100, Tomas Henzl wrote:
> On 11/08/2013 03:44 PM, scame...@beardog.cce.hp.com wrote:
> > On Fri, Nov 08, 2013 at 02:51:37PM +0100, Tomas Henzl wrote:
> >> On 11/07/2013 05:45 PM, Stephen M. Cameron wrote:
> >>> From: Stephen M. Cameron 
> >>>
> >>> The hardware guys tell us that after initiating a software
> >>> reset via the doorbell register we need to wait 5 seconds before
> >>> attempting to talk to the board *at all*.  This means that we
> >>> cannot watch the board to verify it transitions from "ready" to
> >>> to "not ready" then back "ready", since this transition will
> >>> most likely happen during those 5 seconds (though we can still
> >>> verify the reset happens by watching the "driver version" field
> >>> get cleared.)
> >>>
> >>> Signed-off-by: Stephen M. Cameron 
> >>> ---
> >>>  drivers/scsi/hpsa.c |   32 +++-
> >>>  1 files changed, 23 insertions(+), 9 deletions(-)
> >>>
> >>> diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
> >>> index 20fc598..fff5fd3 100644
> >>> --- a/drivers/scsi/hpsa.c
> >>> +++ b/drivers/scsi/hpsa.c
> >>> @@ -3781,6 +3781,13 @@ static int hpsa_controller_hard_reset(struct 
> >>> pci_dev *pdev,
> >>>*/
> >>>   dev_info(&pdev->dev, "using doorbell to reset controller\n");
> >>>   writel(use_doorbell, vaddr + SA5_DOORBELL);
> >>> +
> >>> + /* PMC hardware guys tell us we need a 5 second delay after
> >>> +  * doorbell reset and before any attempt to talk to the board
> >>> +  * at all to ensure that this actually works and doesn't fall
> >>> +  * over in some weird corner cases.
> >>> +  */
> >>> + msleep(5000);
> >>>   } else { /* Try to do it the PCI power state way */
> >>>  
> >>>   /* Quoting from the Open CISS Specification: "The Power
> >>> @@ -3977,15 +3984,22 @@ static int 
> >>> hpsa_kdump_hard_reset_controller(struct pci_dev *pdev)
> >>>  need a little pause here */
> >>>   msleep(HPSA_POST_RESET_PAUSE_MSECS);
> >> I know it's complicated with a lot of different devices and fw versions,
> >> but here^ we wait for 3sec - isn't the method - wait for 3s then wait for 
> >> board not ready
> >> a bit fragile, what if a board comes up faster?
> >> When the method "watching the "driver version"" works why don't you want 
> >> to use it  
> >> regardless of the reset method used?
> > The "watching the driver version" thing is only there to catch if
> > the firmware guys break things and turn the reset into a no-op
> > (which happened with the PCI power manaegment based reset and we
> > didn't catch it for a year or so because we didn't have that check)
> >
> > We aren't supposed to look at the driver version field (or anything)
> > until we first verify the scratch pad register says the firmware is
> > ready.  In the case of those boards that use the "doorbell" reset,
> > we aren't supposed to look at *anything* for the first five seconds.
> >
> > I have been bugging the firmware/hardware guys for a sane reset
> > procedure that actually works reliably for years with no luck.
> >
> > For the SCSI over PCIe driver, being tired of this crap, I simply
> > unconditionally reset the device on driver load every single time,
> > and did this from the beginning.  This kind of forced the firmware
> > and hardware guys to make the reset on that thing work reliably
> > and quickly, and since I did that from the earliest days, they didn't
> > have a chance to screw it up without it being caught immediately.
> > For Smart Array, obviously it's too late for that approach.
> 
> OK, my question was more or less if this:
> msleep(HPSA_POST_RESET_PAUSE_MSECS);
> just before waiting for the board to enter BOARD_NOT_READY state
> isn't dangerous - when the board enters a ready state in the first 3sec
> it will wait indefinitely for the not_ready state
> thus whether the test for not ready state shouldn't be removed.
> The mechanism now works somehow and maybe it's better
> not to touch it, I just wanted to draw your attention to that
> potential problem.

Oh ok, I see.  Thanks, yes that does look questionable.  So you
are suggesting to skip the check for transition from NOT READY to 
READY in the scratch pad register in all cases, since we have all
these ridiculous delay requirements preventing us from watching the
board closely enough and so that may mean that we would miss such a
transition.

Let me talk it over with Mike Miller, but it seems reasonable.

-- steve

> 
> 
> >
> > -- steve
> >
> >>>  
> >>> - /* Wait for board to become not ready, then ready. */
> >>> - dev_info(&pdev->dev, "Waiting for board to reset.\n");
> >>> - rc = hpsa_wait_for_board_state(pdev, vaddr, BOARD_NOT_READY);
> >>> - if (rc) {
> >>> - dev_warn(&pdev->dev,
> >>> - "failed waiting for board to reset."
> >>> - " Will try soft reset.\n");
> >>> - rc = -ENOTSUPP; /* Not expected, but try soft reset lat

Re: [PATCH 01/11] hpsa: use workqueue instead of kernel thread for lockup detection

2013-12-04 Thread scameron
On Mon, Dec 02, 2013 at 10:00:02AM -0800, James Bottomley wrote:
> On Thu, 2013-11-07 at 10:45 -0600, Stephen M. Cameron wrote:
> > From: Stephen M. Cameron 
> > 
> > Much simpler and avoids races starting/stopping the thread.
> > 
> > Signed-off-by: Stephen M. Cameron 
> 
> Actually, perhaps you do need to resubmit the series: this is rejecting
> against scsi-misc because of your prior hpsa patches:

Ah, I did not realize you had taken these into your tree:

2f2ec0b [SCSI] hpsa: bring logical drives online when format completes
c99298f [SCSI] hpsa: hide logical drives with format in progress from linux

Since you had complained about the races, and suggested using a
workqueue rather than a kernel thread, I thought that meant you
had dropped them, so I had gone back to the drawing board.  I hope
to have a new version of the series for you this afternoon.

-- steve

> 
> patching file drivers/scsi/hpsa.c
> Hunk #1 succeeded at 170 (offset -2 lines).
> Hunk #2 succeeded at 4881 (offset 247 lines).
> Hunk #3 succeeded at 4899 (offset 247 lines).
> Hunk #4 succeeded at 4918 (offset 247 lines).
> Hunk #5 succeeded at 4947 (offset 247 lines).
> Hunk #6 FAILED at 4862.
> Hunk #7 FAILED at 4928.
> 2 out of 7 hunks FAILED -- saving rejects to file
> drivers/scsi/hpsa.c.rej
> patching file drivers/scsi/hpsa.h
> Hunk #1 FAILED at 130.
> 1 out of 1 hunk FAILED -- saving rejects to file drivers/scsi/hpsa.h.rej
> 
> James
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Question about submitting large patchset for hpsa driver

2014-01-10 Thread scameron

Hi James,

I have a large patch set for hpsa that's been brewing for awhile
and it's finally stabilized.  I notice you have a for-next
and misc branches that are slightly different, and was wondering
which branch of your tree you'd prefer the patches be tailored for,
or if you'd prefer I hold off until after 3.13 comes out, or
maybe something else?

Diffstat of the combined 40 or so patches I have brewing looks like:

 Documentation/ABI/testing/sysfs-class-scsi_host |   16 
 drivers/scsi/hpsa.c | 2544 
 drivers/scsi/hpsa.h |  138 +
 drivers/scsi/hpsa_cmd.h |  255 ++
 4 files changed, 2566 insertions(+), 387 deletions(-)

The thrust of the patch set is to enable a couple alternate ways
of submitting commands directly to physical devices, kind of
going around the normal Smart Array firmware RAID stack for
certain i/o's in order to get a pretty substantial performance
gain when using SSDs.

-- steve

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] hpsa: fixup MSI-X registration

2014-01-15 Thread scameron
On Wed, Jan 15, 2014 at 01:30:53PM +0100, Hannes Reinecke wrote:
> Commit 254f796b9f22b1944c64caabc356a56caaa2facd updated
> the driver to use 16 MSI-X vectors, despite the fact that
> older controllers would provide only 4.
> This was causing MSI-X registration to drop down to INTx
> mode. But as the controller support performant mode, the
> initialisation will become confused and cause the machine
> to stall during boot.
> 
> This patch fixes up the MSI-X registration to re-issue
> the pci_enable_msix() call with the correct number of
> MSI-X vectors. With that the hpsa driver continues to
> works on older controllers like the P200.
> 
> Cc: Stephen M. Cameron 
> Signed-off-by: Hannes Reinecke 

Ack.

Looks good to me, thanks.

-- steve

> ---
>  drivers/scsi/hpsa.c | 31 +--
>  1 file changed, 17 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
> index 868318a..40989ee 100644
> --- a/drivers/scsi/hpsa.c
> +++ b/drivers/scsi/hpsa.c
> @@ -598,7 +598,7 @@ static void set_performant_mode(struct ctlr_info *h, 
> struct CommandList *c)
>  {
>   if (likely(h->transMethod & CFGTBL_Trans_Performant)) {
>   c->busaddr |= 1 | (h->blockFetchTable[c->Header.SGList] << 1);
> - if (likely(h->msix_vector))
> + if (likely(h->msix_vector > 0))
>   c->Header.ReplyQueue =
>   raw_smp_processor_id() % h->nreply_queues;
>   }
> @@ -4169,21 +4169,24 @@ static void hpsa_interrupt_mode(struct ctlr_info *h)
>   goto default_int_mode;
>   if (pci_find_capability(h->pdev, PCI_CAP_ID_MSIX)) {
>   dev_info(&h->pdev->dev, "MSIX\n");
> + h->msix_vector = MAX_REPLY_QUEUES;
>   err = pci_enable_msix(h->pdev, hpsa_msix_entries,
> - MAX_REPLY_QUEUES);
> - if (!err) {
> - for (i = 0; i < MAX_REPLY_QUEUES; i++)
> - h->intr[i] = hpsa_msix_entries[i].vector;
> - h->msix_vector = 1;
> - return;
> - }
> +   h->msix_vector);
>   if (err > 0) {
>   dev_warn(&h->pdev->dev, "only %d MSI-X vectors "
>  "available\n", err);
> - goto default_int_mode;
> + h->msix_vector = err;
> + err = pci_enable_msix(h->pdev, hpsa_msix_entries,
> +   h->msix_vector);
> + }
> + if (!err) {
> + for (i = 0; i < h->msix_vector; i++)
> + h->intr[i] = hpsa_msix_entries[i].vector;
> + return;
>   } else {
>   dev_warn(&h->pdev->dev, "MSI-X init failed %d\n",
>  err);
> + h->msix_vector = 0;
>   goto default_int_mode;
>   }
>   }
> @@ -4597,15 +4600,15 @@ static int hpsa_request_irq(struct ctlr_info *h,
>   for (i = 0; i < MAX_REPLY_QUEUES; i++)
>   h->q[i] = (u8) i;
>  
> - if (h->intr_mode == PERF_MODE_INT && h->msix_vector) {
> + if (h->intr_mode == PERF_MODE_INT && h->msix_vector > 0) {
>   /* If performant mode and MSI-X, use multiple reply queues */
> - for (i = 0; i < MAX_REPLY_QUEUES; i++)
> + for (i = 0; i < h->msix_vector; i++)
>   rc = request_irq(h->intr[i], msixhandler,
>   0, h->devname,
>   &h->q[i]);
>   } else {
>   /* Use single reply pool */
> - if (h->msix_vector || h->msi_vector) {
> + if (h->msix_vector > 0 || h->msi_vector) {
>   rc = request_irq(h->intr[h->intr_mode],
>   msixhandler, 0, h->devname,
>   &h->q[h->intr_mode]);
> @@ -4658,7 +4661,7 @@ static void free_irqs(struct ctlr_info *h)
>   return;
>   }
>  
> - for (i = 0; i < MAX_REPLY_QUEUES; i++)
> + for (i = 0; i < h->msix_vector; i++)
>   free_irq(h->intr[i], &h->q[i]);
>  }
>  
> @@ -5178,7 +5181,7 @@ static void hpsa_put_ctlr_into_performant_mode(struct 
> ctlr_info *h)
>   if (!(trans_support & PERFORMANT_MODE))
>   return;
>  
> - h->nreply_queues = h->msix_vector ? MAX_REPLY_QUEUES : 1;
> + h->nreply_queues = h->msix_vector > 0 ? h->msix_vector : 1;
>   hpsa_get_max_perf_mode_cmds(h);
>   /* Performant mode ring buffer and supporting data structures */
>   h->reply_pool_size = h->max_commands * sizeof(u64) * h->nreply_queues;
> -- 
> 1.7.12.4
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://

Re: [PATCH 38/41] hpsa: improve error messages for driver initiated commands

2014-01-16 Thread scameron
On Thu, Jan 16, 2014 at 09:42:52AM +0100, Hannes Reinecke wrote:
> On 01/15/2014 11:39 PM, Stephen M. Cameron wrote:
> > From: Stephen M. Cameron 
> > 
> > On encountering unexpected error conditions from driver initiated
> > commands, print something useful like CDB and sense data rather than
> > something useless like the kernel virtual address of the command buffer.
> > 
> > Signed-off-by: Stephen M. Cameron 
> > ---
> >  drivers/scsi/hpsa.c |   71 
> > ---
> >  1 files changed, 44 insertions(+), 27 deletions(-)
> > 
> > diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
> > index 6a50a83..d469974 100644
> > --- a/drivers/scsi/hpsa.c
> > +++ b/drivers/scsi/hpsa.c
> > @@ -1847,17 +1847,37 @@ static void 
> > hpsa_scsi_do_simple_cmd_with_retry(struct ctlr_info *h,
> > hpsa_pci_unmap(h->pdev, c, 1, data_direction);
> >  }
> >  
> > -static void hpsa_scsi_interpret_error(struct CommandList *cp)
> > +static void hpsa_print_cmd(struct ctlr_info *h, char *txt,
> > +   struct CommandList *c)
> >  {
> > -   struct ErrorInfo *ei;
> > +   const u8 *cdb = c->Request.CDB;
> > +   const u8 *lun = c->Header.LUN.LunAddrBytes;
> > +
> > +   dev_warn(&h->pdev->dev, "%s: LUN:%02x%02x%02x%02x%02x%02x%02x%02x"
> > +   " 
> > CDB:%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x%02x\n",
> > +   txt, lun[0], lun[1], lun[2], lun[3],
> > +   lun[4], lun[5], lun[6], lun[7],
> > +   cdb[0], cdb[1], cdb[2], cdb[3],
> > +   cdb[4], cdb[5], cdb[6], cdb[7],
> > +   cdb[8], cdb[9], cdb[10], cdb[11],
> > +   cdb[12], cdb[13], cdb[14], cdb[15]);
> > +}
> Do you _really_ have to do this?
> We do have scsi_logging_level, which would provide exactly 
> this information.

Not exactly, the mid layer could not print the 8-byte LUNID because
it does not and cannot know it ... 

> Can't you rely on the SCSI stack for logging here and drop all
> the redundancy in the driver?

and furthermore, hpsa_print_cmd() is only called by
hpsa_scsi_interpret_error(), which is only called from:

hpsa_scsi_do_inquiry
hpsa_send_reset
hpsa_get_raid_map
hpsa_scsi_do_report_luns
hpsa_send_abort

none of which are called from a place where this information
could be passed back to the scsi mid layer because the scsi mid
layer did not build the commands for these cases, they are commands
initiated by the driver (as noted in the changelog entry).

> 
> Thing is, the hpsa driver is already quite chatty:
> 
> hpsa :4f:00.0: Direct-Access device c0b0t0l0 added.
> hpsa :4f:00.0: RAID  device c0b3t0l0 added.
> scsi 0:0:0:0: Direct-Access HP   LOGICAL VOLUME   7.24 PQ
> : 0 ANSI: 5
> sd 0:0:0:0: [sda] 286677120 512-byte logical blocks: (146 GB/136
> GiB)
> scsi 0:3:0:0: RAID  HP   P400 7.24 PQ: 0
> ANSI: 0
> 
> So where is the point of the first two messages?
> The enumeration is internal anyway, so it has a very quesionable
> value to the user. And the devices are listed afterwards anyway.
> 
> Hence it _might_ be an idea to remove some of those messages ...

FWIW, I'm usually the one in my group arguing for less messages rather
than more.  I don't always win.

-- steve

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: hpsa driver bug crack kernel down!

2014-04-10 Thread scameron
On Wed, Apr 09, 2014 at 11:32:37PM -0700, Davidlohr Bueso wrote:
> On Wed, 2014-04-09 at 22:03 -0600, Bjorn Helgaas wrote:
> > [+cc Joerg, iommu list]
> > 
> > On Wed, Apr 9, 2014 at 6:19 PM, Davidlohr Bueso  wrote:
> > > On Wed, 2014-04-09 at 16:50 -0700, James Bottomley wrote:
> > >> On Wed, 2014-04-09 at 16:40 -0700, Davidlohr Bueso wrote:
> > >> > On Wed, 2014-04-09 at 16:10 -0700, James Bottomley wrote:
> > >> > > On Wed, 2014-04-09 at 16:08 -0700, James Bottomley wrote:
> > >> > > > [+linux-scsi]
> > >> > > > On Wed, 2014-04-09 at 15:49 -0700, Davidlohr Bueso wrote:
> > >> > > > > On Wed, 2014-04-09 at 10:39 +0800, Baoquan He wrote:
> > >> > > > > > Hi,
> > >> > > > > >
> > >> > > > > > The kernel is 3.14.0+ which is pulled just now.
> > >> > > > >
> > >> > > > > Cc'ing more people.
> > >> > > > >
> > >> > > > > While the hpsa driver appears to be involved in some way, I'm 
> > >> > > > > sure if
> > >> > > > > this is a related issue, but as of today's pull I'm getting 
> > >> > > > > another
> > >> > > > > problem that causes my DL980 not to come up.
> > >> > > > >
> > >> > > > > *Massive* amounts of:
> > >> > > > >
> > >> > > > > DMAR:[fault reason 02] Present bit in context entry is clear
> > >> > > > > dmar: DRHD: handling fault status reg 602
> > >> > > > > dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr 
> > >> > > > > 7f61e000
> > >> > > > >
> > >> > > > > Then:
> > >> > > > >
> > >> > > > > hpsa :03:00.0: Controller lockup detected: 0x
> > >> > > > > ...
> > >> > > > > Workqueue: events hpsa_monitor_ctlr_worker [hpsa]
> > >> > > > > ...
> > >> > > > >
> > >> > > > > Screenshot of the actual LOCKUP:
> > >> > > > > http://stgolabs.net/hpsa-hard-lockup-3.14+.png
> > >> > > > >
> > >> > > > > While I haven't bisected, things worked fine until at least 
> > >> > > > > until commit
> > >> > > > > 39de65aa2c3e (April 2nd).
> > >> > > > >
> > >> > > > > Any ideas?
> > >> > > >
> > >> > > > Well, it's either a DMA remapping issue or a hpsa one.  Your 
> > >> > > > assertion
> > >> > > > that everything worked fine until 39de65aa2c3e would tend to 
> > >> > > > vindicate
> > >> > > > hpsa,
> > >> >
> > >> > Hmm here you mean DMA, right?
> > >>
> > >> No, it vindicates the hpsa changes ... they don't seem to be causing
> > >> problems until something goes wrong with dma remapping.
> > >>
> > >> > > because all the hpsa changes went in before that under
> > >> > > Missing crucial info:
> > >> > >
> > >> > > commit 1a0b6abaea78f73d9bc0a2f6df2d9e4c917cade1
> > >> > >
> > >> > > > Merge: 3e75c6d b2bff6c
> > >> > > > Author: Linus Torvalds 
> > >> > > > Date:   Tue Apr 1 18:49:04 2014 -0700
> > >> > > >
> > >> > > > Merge tag 'scsi-misc' of
> > >> > > > git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
> > >> > > >
> > >> > > > can you revalidate that this commit works OK just to make sure?
> > >> >
> > >> > Ok so I don't see those DMA messages and system starts just fine. I'm
> > >> > thinking perhaps something broke after the IO mmu stuff in commit
> > >> > 3f583bc21977a608908b83d03ee2250426a5695c... could this be indirectly
> > >> > causing the CPU stalls and just blame hpsa in the path as a side 
> > >> > effect?
> > >> >
> > >> > /me goes out to try the commit.
> > >>
> > >> That's my guess.  The DMAR messages are DMA remapping issues caused in
> > >> the IOMMU.  If I had to guess, I'd say the DMAR fault message is
> > >> indicating the IOMMU is calling for a mapping address before it can
> > >> satisfy the driver read request, which is causing the hang apparently in
> > >> the hpsa driver.
> > >>
> > >> I've added linux-pci to the cc; I think they deal with iommu issues on
> > >> x86.
> > >
> > > So that merge commit appears to be the culprit, I see both the DMA
> > > messages and the lockup blaming hpsa...
> > 
> > My understanding so far (please correct me if I'm wrong):
> > 
> > 39de65aa2c3e OK ("Merge branch 'i2c/for-next'")
> > 1a0b6abaea78 OK ("Merge tag 'scsi-misc'")

^^^ this one, 1a0b6abaea78, did not work for me, crashing in
hpsa_enter_performant mode() which was surprsing to me as I am
pretty sure I tried on this very same machine I'm using now
(DL360p with P420, P430 and P420i) with 3.14-rc-something plus
all the hpsa patches that I thought were merged in.

But now I am seeing:

 [] hpsa_enter_performant_mode+0x4c0/0x540 [hpsa]
RSP: 0018:88042c515a78  EFLAGS: 00010297
RAX:  RBX: 88042c65 RCX: 0004
RDX:  RSI: 0001 RDI: 
RBP: 88042c515b48 R08:  R09: 8af03cc0
R10:  R11: 0001 R12: 88042c515a98
R13: 6104 R14: 88042c515ad8 R15: a0001630
FS:  7f86f7a38700() GS:88043f56() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
usb 1-1.6: new low-speed USB device number 3 using ehci-pci
CR2:  CR3: 00042c4c3000 CR4: 000407e0
Stack:
 00

Re: hpsa NULL pointer in hpsa_enter_performant_mode()

2014-04-10 Thread scameron
t;> > > >
> >> > >> > > > can you revalidate that this commit works OK just to make sure?
> >> > >> >
> >> > >> > Ok so I don't see those DMA messages and system starts just fine. 
> >> > >> > I'm
> >> > >> > thinking perhaps something broke after the IO mmu stuff in commit
> >> > >> > 3f583bc21977a608908b83d03ee2250426a5695c... could this be indirectly
> >> > >> > causing the CPU stalls and just blame hpsa in the path as a side 
> >> > >> > effect?
> >> > >> >
> >> > >> > /me goes out to try the commit.
> >> > >>
> >> > >> That's my guess.  The DMAR messages are DMA remapping issues caused in
> >> > >> the IOMMU.  If I had to guess, I'd say the DMAR fault message is
> >> > >> indicating the IOMMU is calling for a mapping address before it can
> >> > >> satisfy the driver read request, which is causing the hang apparently 
> >> > >> in
> >> > >> the hpsa driver.
> >> > >>
> >> > >> I've added linux-pci to the cc; I think they deal with iommu issues on
> >> > >> x86.
> >> > >
> >> > > So that merge commit appears to be the culprit, I see both the DMA
> >> > > messages and the lockup blaming hpsa...
> >> >
> >> > My understanding so far (please correct me if I'm wrong):
> >> >
> >> > 39de65aa2c3e OK ("Merge branch 'i2c/for-next'")
> >> > 1a0b6abaea78 OK ("Merge tag 'scsi-misc'")
> >
> > ^^^ this one, 1a0b6abaea78, did not work for me, crashing in
> > hpsa_enter_performant mode() which was surprsing to me as I am
> > pretty sure I tried on this very same machine I'm using now
> > (DL360p with P420, P430 and P420i) with 3.14-rc-something plus
> > all the hpsa patches that I thought were merged in.
> 
> I think we have to completely different issues mixed together in this
> thread, so I changed the subject here.

Thanks.

> 
> The reports above for 39de65aa2c3e, 1a0b6abaea78, were for a DMA fault.
> 
> The original message from Baoquan He was for a NULL pointer
> dereference in hpsa_enter_performant_mode(), which is very likely the
> same problem you're seeing, Steve.
> 
> I changed the subject to "hpsa NULL pointer in
> hpsa_enter_performant_mode()", so hopefully we can chase the NULL
> pointer issue there and leave the original, already long thread, for
> the DMA fault issue.
> 
> > But now I am seeing:
> >
> >  [] hpsa_enter_performant_mode+0x4c0/0x540 [hpsa]
> > RSP: 0018:88042c515a78  EFLAGS: 00010297
> > RAX:  RBX: 88042c65 RCX: 0004
> > RDX:  RSI: 0001 RDI: 
> > RBP: 88042c515b48 R08:  R09: 8af03cc0
> > R10:  R11: 0001 R12: 88042c515a98
> > R13: 6104 R14: 88042c515ad8 R15: a0001630
> > FS:  7f86f7a38700() GS:88043f56() knlGS:
> > CS:  0010 DS:  ES:  CR0: 80050033
> > usb 1-1.6: new low-speed USB device number 3 using ehci-pci
> > CR2:  CR3: 00042c4c3000 CR4: 000407e0
> > Stack:
> >  8024 a0c0 abe0 
> >  00060005 00080007 000a0009 000c000b
> >  000e000d 001f 00120011 00040013
> > Call Trace:
> >  [] ? SA5_fifo_full+0x20/0x20 [hpsa]
> >  [] ? SA5_ioaccel_mode1_completed+0xd0/0xd0 [hpsa]
> >  [] hpsa_put_ctlr_into_performant_mode+0x186/0x320 [hpsa]
> >  [] ? hpsa_allocate_sg_chain_blocks+0xa2/0xd0 [hpsa]
> >  [] hpsa_init_one+0x43b/0x7d0 [hpsa]
> >  [] local_pci_probe+0x4c/0xb0
> >  [] pci_call_probe+0x89/0xb0
> >  [] ? pci_match_device+0xc4/0xd0
> >  [] pci_device_probe+0x79/0xa0
> >  [] ? driver_sysfs_add+0x82/0xb0
> >  [] really_probe+0x6c/0x320
> > usb 1-1.6: New USB device found, idVendor=0624, idProduct=0341
> > usb 1-1.6: New USB device strings: Mfr=1, Product=2, SerialNumber=0
> > usb 1-1.6: Product: HP 336047-B21
> > usb 1-1.6: Manufacturer: Avocent
> > input: Avocent HP 336047-B21 as
> > /devices/pci:00/:00:1a.0/usb1/1-1/1-1.64
> > hid-generic 0003:0624:0341.0001: input,hidraw0: USB HID v1.10 Keyboard
> > [Avocent0
> > input: Avocen

Re: hpsa NULL pointer in hpsa_enter_performant_mode()

2014-04-10 Thread scameron
t; > >  [] ? SA5_fifo_full+0x20/0x20 [hpsa]
> > >  [] ? SA5_ioaccel_mode1_completed+0xd0/0xd0 [hpsa]
> > >  [] hpsa_put_ctlr_into_performant_mode+0x186/0x320 
> > > [hpsa]
> > >  [] ? hpsa_allocate_sg_chain_blocks+0xa2/0xd0 [hpsa]
> > >  [] hpsa_init_one+0x43b/0x7d0 [hpsa]
> > >  [] local_pci_probe+0x4c/0xb0
> > >  [] pci_call_probe+0x89/0xb0
> > >  [] ? pci_match_device+0xc4/0xd0
> > >  [] pci_device_probe+0x79/0xa0
> > >  [] ? driver_sysfs_add+0x82/0xb0
> > >  [] really_probe+0x6c/0x320
> > > usb 1-1.6: New USB device found, idVendor=0624, idProduct=0341
> > > usb 1-1.6: New USB device strings: Mfr=1, Product=2, SerialNumber=0
> > > usb 1-1.6: Product: HP 336047-B21
> > > usb 1-1.6: Manufacturer: Avocent
> > > input: Avocent HP 336047-B21 as
> > > /devices/pci:00/:00:1a.0/usb1/1-1/1-1.64
> > > hid-generic 0003:0624:0341.0001: input,hidraw0: USB HID v1.10 Keyboard
> > > [Avocent0
> > > input: Avocent HP 336047-B21 as
> > > /devices/pci:00/:00:1a.0/usb1/1-1/1-1.65
> > > hid-generic 0003:0624:0341.0002: input,hidraw1: USB HID v1.10 Mouse 
> > > [Avocent
> > > HP1
> > >  [] driver_probe_device+0x47/0xa0
> > >  [] __driver_attach+0xab/0xb0
> > >  [] ? driver_probe_device+0xa0/0xa0
> > >  [] ? driver_probe_device+0xa0/0xa0
> > >  [] bus_for_each_dev+0x94/0xb0
> > >  [] driver_attach+0x1e/0x20
> > >  [] bus_add_driver+0x1b0/0x250
> > > usb 2-1.3: new high-speed USB device number 3 using ehci-pci
> > >  [] ? 0xa0015fff
> > >  [] driver_register+0x64/0xf0
> > >  [] ? 0xa0015fff
> > >  [] __pci_register_driver+0x4c/0x50
> > >  [] hpsa_init+0x1e/0x20 [hpsa]
> > >  [] do_one_initcall+0xd2/0x180
> > >  [] ? __blocking_notifier_call_chain+0x65/0x80
> > >  [] do_init_module+0x44/0x1b0
> > >  [] load_module+0x5a8/0x6f0
> > >  [] ? __unlink_module+0x30/0x30
> > >  [] ? __vmalloc_node+0x35/0x40
> > >  [] ? module_sect_show+0x30/0x30
> > >  [] SyS_init_module+0x96/0xc0
> > >  [] system_call_fastpath+0x16/0x1b
> > > Code: 89 45 8c 78 2c 31 f6 8d 4e 04 4c 89 e2 31 c0 0f 1f 40 00 39 0a 7d 
> > > 0c usb
> > > 0
> > > usb 2-1.3: New USB device strings: Mfr=0, Product=0, SerialNumber=0
> > > hub 2-1.3:1.0: USB hub found
> > > hub 2-1.3:1.0: 2 ports detected
> > >
> > > 83 c0 01 48 83 c2 04 83 f8 10 75 f0 48 63 d6 83 c6 01 39 f7 <41> 89 04 90 
> > > 7d
> > > d6
> > > RIP  [] hpsa_enter_performant_mode+0x4c0/0x540 [hpsa]
> > >  RSP 
> > > CR2: 
> > > ---[ end trace ab56f106199a4971 ]---
> > >
> > >
> > >> > 3f583bc21977 BAD ("Merge tag 'iommu-updates-v3.15'")
> > >>
> > >> Yes, specifically (finally done bisecting):
> > >>
> > >> commit 2e45528930388658603ea24d49cf52867b928d3e
> > >> Author: Jiang Liu 
> > >> Date:   Wed Feb 19 14:07:36 2014 +0800
> > >>
> > >> iommu/vt-d: Unify the way to process DMAR device scope array
> > >>
> > >> Now we have a PCI bus notification based mechanism to update DMAR
> > >> device scope array, we could extend the mechanism to support boot
> > >> time initialization too, which will help to unify and simplify
> > >> the implementation.
> > >>
> > >> Signed-off-by: Jiang Liu 
> > >> Signed-off-by: Joerg Roedel 
> > >
> > > My git bisect appears to be converging on something else, something
> > > within the hpsa patches that I sent up recently, unfortunately for
> > > me.  Will let you all know when it converges.
> 
> Converged to this:
> 
> [scameron@localhost linux]$ git bisect good
> b9af4937e6f5b55b6ffb2a92ec580e79e1401825 is the first bad commit
> commit b9af4937e6f5b55b6ffb2a92ec580e79e1401825
> Author: Stephen M. Cameron 
> Date:   Tue Feb 18 13:56:29 2014 -0600
> 
> [SCSI] hpsa: initialize controller to perform io accelerator mode 2
> 
> Signed-off-by: Stephen M. Cameron 
> Signed-off-by: Scott Teel 
> Signed-off-by: James Bottomley 
> 
> :04 04 4fe4dce566eaa27dd6851c592e2d3123fbd4f306 
> 8db6e3648890745ad150341d00b586e0efcb322d Mdrivers
> 
> Which doesn't make the issue immediately obvious to me, but I will continue 
> looking into it.

Ok, I think I've got it.  trans_support is uninitialized in
hpsa_put_ctlr_into_performant_mode().  The compiler was warning about
this during my git bisect, which I should have picked up on because
I'm sure that the patches which I made were not giving me warnings
in my code base.  I diffed from my code base and immediately I see
the missing code.  Not sure what happenned to the code that
initializes that, but in any case a patch to fix it will be
following shortly.

-- steve

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] hpsa: fix uninitialized trans_support in hpsa_put_ctlr_into_performant_mode()

2014-04-10 Thread scameron

Without this, you'll see a null pointer dereference in
hpsa_enter_performant_mode().

Signed-off-by: Stephen M. Cameron 
---
 drivers/scsi/hpsa.c |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
index 8cf4a0c..ef4dfdd 100644
--- a/drivers/scsi/hpsa.c
+++ b/drivers/scsi/hpsa.c
@@ -7463,6 +7463,10 @@ static void hpsa_put_ctlr_into_performant_mode(struct 
ctlr_info *h)
if (hpsa_simple_mode)
return;
 
+   trans_support = readl(&(h->cfgtable->TransportSupport));
+   if (!(trans_support & PERFORMANT_MODE))
+   return;
+
/* Check for I/O accelerator mode support */
if (trans_support & CFGTBL_Trans_io_accel1) {
transMethod |= CFGTBL_Trans_io_accel1 |
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: HPSA related kernel panic on boot in 3.15 rc1 on Proliant with P420i

2014-04-14 Thread scameron
On Mon, Apr 14, 2014 at 2:33 PM, Darius D.  wrote:
> Hi,
>
> on P420i (2GB FBWC) with latest(5.22?) FW, and 2 SSD smart path
> enabled RAID0 arrays (1 and 3 SSD), i get panic on initialization.
> What is sad, i can't capture complete stack trace and it is deep in
> kernel worker, but top items are:
>
> SA5_performant_intr_mask+0x30/0x30
> spin_unlock_irqrestore
> hpsa_unit_one
> local_pci_probe
> ... kthread worker stuff
>
>
> RIP is calc_bucket_map+0x30/0x3a.
>
>
> Anyone got idea what is going on? 3.14 was working fine with exact
> config, so i suspect it could be HPSA smart path code related.

Try the patch below, and see also: http://marc.info/?t=13970113551&r=1&w=2

-- steve


>From f6b44f25c451d32cca5cf2e9825e0636132e69cf Mon Sep 17 00:00:00 2001
From: Stephen M. Cameron 
Date: Thu, 10 Apr 2014 16:44:59 -0500
Subject: [PATCH] hpsa: fix uninitialized variable in 
hpsa_put_ctlr_into_performant_mode()

Without this, you'll see a null pointer dereference in
hpsa_enter_performant_mode().

Signed-off-by: Stephen M. Cameron 
---
 drivers/scsi/hpsa.c |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
index 8cf4a0c..ef4dfdd 100644
--- a/drivers/scsi/hpsa.c
+++ b/drivers/scsi/hpsa.c
@@ -7463,6 +7463,10 @@ static void hpsa_put_ctlr_into_performant_mode(struct 
ctlr_info *h)
if (hpsa_simple_mode)
return;
 
+   trans_support = readl(&(h->cfgtable->TransportSupport));
+   if (!(trans_support & PERFORMANT_MODE))
+   return;
+
/* Check for I/O accelerator mode support */
if (trans_support & CFGTBL_Trans_io_accel1) {
transMethod |= CFGTBL_Trans_io_accel1 |
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] hpsa: fix uninitialized trans_support in hpsa_put_ctlr_into_performant_mode()

2014-04-14 Thread scameron
On Mon, Apr 14, 2014 at 08:45:16AM -0700, James Bottomley wrote:
> Your subject line is very tame.  It should be the one line summary of
> why we apply the patch, so it should read something like
> 
>  hpsa: fix NULL deref in performant mode
> 
> On Thu, 2014-04-10 at 17:17 -0500, scame...@beardog.cce.hp.com wrote:
> > Without this, you'll see a null pointer dereference in
> > hpsa_enter_performant_mode().
> 
> The description should be more comprehensible.
> 
> I'm clear that the use before initialisation is a bug ... I'm less clear
> on why it causes an oops.
> 
> > Signed-off-by: Stephen M. Cameron 
> > ---
> >  drivers/scsi/hpsa.c |4 
> >  1 files changed, 4 insertions(+), 0 deletions(-)
> > 
> > diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
> > index 8cf4a0c..ef4dfdd 100644
> > --- a/drivers/scsi/hpsa.c
> > +++ b/drivers/scsi/hpsa.c
> > @@ -7463,6 +7463,10 @@ static void 
> > hpsa_put_ctlr_into_performant_mode(struct ctlr_info *h)
> > if (hpsa_simple_mode)
> > return;
> >  
> > +   trans_support = readl(&(h->cfgtable->TransportSupport));
> > +   if (!(trans_support & PERFORMANT_MODE))
> > +   return;
> > +
> > /* Check for I/O accelerator mode support */
> > if (trans_support & CFGTBL_Trans_io_accel1) {
> > transMethod |= CFGTBL_Trans_io_accel1 |
> 
> Shouldn't you be moving this check from its previous location, rather
> than adding a new one that makes the original obsolete?

Oh... I didn't notice that.   So that's what happened to that hunk.
Yes, that is what should be done.

I will resend the patch after fixing up this and the commit message.

-- steve


> 
> James
> 
> ---
> 
> diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
> index 8cf4a0c..9a6e4a2 100644
> --- a/drivers/scsi/hpsa.c
> +++ b/drivers/scsi/hpsa.c
> @@ -7463,6 +7463,10 @@ static void hpsa_put_ctlr_into_performant_mode(struct 
> ctlr_info *h)
>   if (hpsa_simple_mode)
>   return;
>  
> + trans_support = readl(&(h->cfgtable->TransportSupport));
> + if (!(trans_support & PERFORMANT_MODE))
> + return;
> +
>   /* Check for I/O accelerator mode support */
>   if (trans_support & CFGTBL_Trans_io_accel1) {
>   transMethod |= CFGTBL_Trans_io_accel1 |
> @@ -7479,10 +7483,6 @@ static void hpsa_put_ctlr_into_performant_mode(struct 
> ctlr_info *h)
>   }
>  
>   /* TODO, check that this next line h->nreply_queues is correct */
> - trans_support = readl(&(h->cfgtable->TransportSupport));
> - if (!(trans_support & PERFORMANT_MODE))
> - return;
> -
>   h->nreply_queues = h->msix_vector > 0 ? h->msix_vector : 1;
>   hpsa_get_max_perf_mode_cmds(h);
>   /* Performant mode ring buffer and supporting data structures */
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] hpsa: fix NULL dereference in hpsa_put_ctlr_into_performant_mode()

2014-04-14 Thread scameron

Initialize local variable trans_support before it is used rather
than after.  It is supposed to contain the value of a register on the
controller containing bits that describe which transport modes the
controller supports (e.g. "performant", "ioaccel1",  "ioaccel2").  A
NULL pointer dereference will almost certainly occur if trans_support
is not initialized at the right point.  If for example the uninitialized
trans_support value does not have the bit set for ioaccel2 support when it
should be, then ioaccel2_alloc_cmds_and_bft() will not get called as it
should be and the h->ioaccel2_blockFetchTable array will remain NULL
instead of being allocated.  Too late, trans_support finally gets
initialized with the correct value with ioaccel2 mode bit set,
which later causes calc_bucket_map() to be called to fill in
h->ioaccel2_blockFetchTable[].  However h->ioaccel2_blockFetchTable
is NULL because it didn't get allocated because earlier trans_support
wasn't initialized at the right point.

Signed-off-by: Stephen M. Cameron 
---
 drivers/scsi/hpsa.c |8 
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
index 8cf4a0c..9a6e4a2 100644
--- a/drivers/scsi/hpsa.c
+++ b/drivers/scsi/hpsa.c
@@ -7463,6 +7463,10 @@ static void hpsa_put_ctlr_into_performant_mode(struct 
ctlr_info *h)
if (hpsa_simple_mode)
return;
 
+   trans_support = readl(&(h->cfgtable->TransportSupport));
+   if (!(trans_support & PERFORMANT_MODE))
+   return;
+
/* Check for I/O accelerator mode support */
if (trans_support & CFGTBL_Trans_io_accel1) {
transMethod |= CFGTBL_Trans_io_accel1 |
@@ -7479,10 +7483,6 @@ static void hpsa_put_ctlr_into_performant_mode(struct 
ctlr_info *h)
}
 
/* TODO, check that this next line h->nreply_queues is correct */
-   trans_support = readl(&(h->cfgtable->TransportSupport));
-   if (!(trans_support & PERFORMANT_MODE))
-   return;
-
h->nreply_queues = h->msix_vector > 0 ? h->msix_vector : 1;
hpsa_get_max_perf_mode_cmds(h);
/* Performant mode ring buffer and supporting data structures */
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [scsi:for-next 82/88] drivers/scsi/hpsa.c:3038 hpsa_update_scsi_devices() warn: impossible condition '(rescan_hba_mode < 0) => (0-255 < 0)'

2014-05-21 Thread scameron
On Wed, May 21, 2014 at 04:26:50PM +0200, Christoph Hellwig wrote:
> Stephen, can you look at these warnings?

Yeah, I'm looking at it right now.

-- steve

> 
> If the answer is this is all fixes in the other pending patches we'll just
> need to find another reviewer for them.
> 
> On Wed, May 21, 2014 at 10:13:44AM +0300, Dan Carpenter wrote:
> > 
> > tree:   git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi.git for-next
> > head:   0a8c8edf3a8c8c878c1e92ddb9c9df966b2fdefb
> > commit: 96444fbbbf3b55a9cd1e337cc5c8c7d04bb364b7 [82/88] hpsa: do not 
> > ignore failure of sense controller parameters command
> > 
> > New smatch warnings:
> > drivers/scsi/hpsa.c:3038 hpsa_update_scsi_devices() warn: impossible 
> > condition '(rescan_hba_mode < 0) => (0-255 < 0)'
> > 
> > Old smatch warnings:
> > drivers/scsi/hpsa.c:1042 hpsa_scsi_remove_entry() error: buffer overflow 
> > 'h->dev' 2081 <= 2081
> > drivers/scsi/hpsa.c:1192 hpsa_show_volume_status() warn: impossible 
> > condition '(sd->volume_offline == -1) => (0-255 == (-1))'
> > drivers/scsi/hpsa.c:2432 hpsa_get_device_id() warn: returning -1 instead of 
> > -ENOMEM is sloppy
> > 
> > git remote add scsi 
> > git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi.git
> > git remote update scsi
> > git checkout 96444fbbbf3b55a9cd1e337cc5c8c7d04bb364b7
> > vim +3038 drivers/scsi/hpsa.c
> > 
> > 339b2b14 Stephen M. Cameron 2010-02-04  3022int raid_ctlr_position;
> > 316b221a Stephen M. Cameron 2014-02-21  3023u8 rescan_hba_mode;
> > aca4a520 Scott Teel 2012-01-19  3024
> > DECLARE_BITMAP(lunzerobits, MAX_EXT_TARGETS);
> > edd16368 Stephen M. Cameron 2009-12-08  3025  
> > cfe5badc Scott Teel 2011-10-26  3026currentsd = 
> > kzalloc(sizeof(*currentsd) * HPSA_MAX_DEVICES, GFP_KERNEL);
> > edd16368 Stephen M. Cameron 2009-12-08  3027physdev_list = 
> > kzalloc(reportlunsize, GFP_KERNEL);
> > edd16368 Stephen M. Cameron 2009-12-08  3028logdev_list = 
> > kzalloc(reportlunsize, GFP_KERNEL);
> > edd16368 Stephen M. Cameron 2009-12-08  3029tmpdevice = 
> > kzalloc(sizeof(*tmpdevice), GFP_KERNEL);
> > edd16368 Stephen M. Cameron 2009-12-08  3030  
> > 0b0e1d6c Stephen M. Cameron 2011-08-09  3031if (!currentsd || 
> > !physdev_list || !logdev_list || !tmpdevice) {
> > edd16368 Stephen M. Cameron 2009-12-08  3032
> > dev_err(&h->pdev->dev, "out of memory\n");
> > edd16368 Stephen M. Cameron 2009-12-08  3033goto out;
> > edd16368 Stephen M. Cameron 2009-12-08  3034}
> > edd16368 Stephen M. Cameron 2009-12-08  3035memset(lunzerobits, 0, 
> > sizeof(lunzerobits));
> > edd16368 Stephen M. Cameron 2009-12-08  3036  
> > 316b221a Stephen M. Cameron 2014-02-21  3037rescan_hba_mode = 
> > hpsa_hba_mode_enabled(h);
> > 96444fbb Joe Handzik2014-05-15 @3038if (rescan_hba_mode < 0)
> > 96444fbb Joe Handzik2014-05-15  3039goto out;
> > 316b221a Stephen M. Cameron 2014-02-21  3040  
> > 316b221a Stephen M. Cameron 2014-02-21  3041if 
> > (!h->hba_mode_enabled && rescan_hba_mode)
> > 316b221a Stephen M. Cameron 2014-02-21  3042
> > dev_warn(&h->pdev->dev, "HBA mode enabled\n");
> > 316b221a Stephen M. Cameron 2014-02-21  3043else if 
> > (h->hba_mode_enabled && !rescan_hba_mode)
> > 316b221a Stephen M. Cameron 2014-02-21  3044
> > dev_warn(&h->pdev->dev, "HBA mode disabled\n");
> > 316b221a Stephen M. Cameron 2014-02-21  3045  
> > 316b221a Stephen M. Cameron 2014-02-21  3046h->hba_mode_enabled = 
> > rescan_hba_mode;
> > 
> > ---
> > 0-DAY kernel build testing backend  Open Source Technology 
> > Center
> > http://lists.01.org/mailman/listinfo/kbuild Intel 
> > Corporation
> ---end quoted text---
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2, repost] hpsa: remove unneeded loop

2014-05-29 Thread scameron
On Thu, May 29, 2014 at 04:02:19PM +0200, Tomas Henzl wrote:
> Originally this was first patch in a series, but while the  other patches
> were accepted, this one was forgotten.
> The code below is adapted for the latest sources.
> 
> The cmd_pool_bits is protected everywhere with a spinlock,
> we don't need the test_and_set_bit, set_bit is enough and the loop
> can be removed too.
> 
> Signed-off-by: Tomas Henzl 
> ---
>  drivers/scsi/hpsa.c | 15 ++-
>  1 file changed, 6 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
> index 9a6e4a2..c8267f3 100644
> --- a/drivers/scsi/hpsa.c
> +++ b/drivers/scsi/hpsa.c
> @@ -4680,15 +4680,12 @@ static struct CommandList *cmd_alloc(struct ctlr_info 
> *h)
>   unsigned long flags;
>  
>   spin_lock_irqsave(&h->lock, flags);
> - do {
> - i = find_first_zero_bit(h->cmd_pool_bits, h->nr_cmds);
> - if (i == h->nr_cmds) {
> - spin_unlock_irqrestore(&h->lock, flags);
> - return NULL;
> - }
> - } while (test_and_set_bit
> -  (i & (BITS_PER_LONG - 1),
> -   h->cmd_pool_bits + (i / BITS_PER_LONG)) != 0);
> + i = find_first_zero_bit(h->cmd_pool_bits, h->nr_cmds);
> + if (i == h->nr_cmds) {
> + spin_unlock_irqrestore(&h->lock, flags);
> + return NULL;
> + }
> + set_bit(i & (BITS_PER_LONG - 1), h->cmd_pool_bits + (i / 
> BITS_PER_LONG));
>   spin_unlock_irqrestore(&h->lock, flags);
>  
>   c = h->cmd_pool + i;
> -- 
> 1.8.3.1
> 

I don't quite remember, but I think it was forgotten because I became
more interested in removing all the locks from the main i/o path of hpsa
for performance reasons (I have had a rather large patch series in the
works for quite a while now to do that, maybe it will be ready for 3.17,
we'll see.) So this looks ok, but if things go according to plan it won't
be relevant for very long.

-- steve

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2 08/24] hpsa: choose number of reply queues more intelligently.

2014-06-02 Thread scameron
On Mon, Jun 02, 2014 at 02:27:51AM -0700, Christoph Hellwig wrote:
> On Thu, May 29, 2014 at 10:53:02AM -0500, Stephen M. Cameron wrote:
> > From: Stephen M. Cameron 
> > 
> > No sense having 8 or 16 reply queues if you only have 4 cpus,
> > and likewise no sense limiting to 8 reply queues if you have
> > many more cpus.
> 
> I've applied this as it looks good as-is, but shouldn't we also
> cap the number of MSI-X vectors in common code so that we avoid
> adding this as boilerplate code to lots of drivers?

Maybe so.  Thinking about CPU hotplug, is num_online_cpus()
the right cap?  Maybe num_possible_cpus() is better in case
additional cpus coming online are anticipated?

-- steve

> 
> > 
> > Signed-off-by: Stephen M. Cameron 
> > Reviewed-by: Mike Miller 
> > Reviewed-by: Scott Teel 
> > ---
> >  drivers/scsi/hpsa.c |2 ++
> >  drivers/scsi/hpsa_cmd.h |2 +-
> >  2 files changed, 3 insertions(+), 1 deletions(-)
> > 
> > diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
> > index b1ecfd8..b903e86 100644
> > --- a/drivers/scsi/hpsa.c
> > +++ b/drivers/scsi/hpsa.c
> > @@ -6157,6 +6157,8 @@ static void hpsa_interrupt_mode(struct ctlr_info *h)
> > if (pci_find_capability(h->pdev, PCI_CAP_ID_MSIX)) {
> > dev_info(&h->pdev->dev, "MSIX\n");
> > h->msix_vector = MAX_REPLY_QUEUES;
> > +   if (h->msix_vector > num_online_cpus())
> > +   h->msix_vector = num_online_cpus();
> > err = pci_enable_msix(h->pdev, hpsa_msix_entries,
> >   h->msix_vector);
> > if (err > 0) {
> > diff --git a/drivers/scsi/hpsa_cmd.h b/drivers/scsi/hpsa_cmd.h
> > index db89245..104b67b 100644
> > --- a/drivers/scsi/hpsa_cmd.h
> > +++ b/drivers/scsi/hpsa_cmd.h
> > @@ -615,7 +615,7 @@ struct TransTable_struct {
> > u32RepQCount;
> > u32RepQCtrAddrLow32;
> > u32RepQCtrAddrHigh32;
> > -#define MAX_REPLY_QUEUES 8
> > +#define MAX_REPLY_QUEUES 64
> > struct vals32  RepQAddr[MAX_REPLY_QUEUES];
> >  };
> >  
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> ---end quoted text---
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2 08/24] hpsa: choose number of reply queues more intelligently.

2014-06-02 Thread scameron
On Mon, Jun 02, 2014 at 09:52:06AM -0500, scame...@beardog.cce.hp.com wrote:
> On Mon, Jun 02, 2014 at 02:27:51AM -0700, Christoph Hellwig wrote:
> > On Thu, May 29, 2014 at 10:53:02AM -0500, Stephen M. Cameron wrote:
> > > From: Stephen M. Cameron 
> > > 
> > > No sense having 8 or 16 reply queues if you only have 4 cpus,
> > > and likewise no sense limiting to 8 reply queues if you have
> > > many more cpus.
> > 
> > I've applied this as it looks good as-is, but shouldn't we also
> > cap the number of MSI-X vectors in common code so that we avoid
> > adding this as boilerplate code to lots of drivers?
> 
> Maybe so.  Thinking about CPU hotplug, is num_online_cpus()
> the right cap?  Maybe num_possible_cpus() is better in case
> additional cpus coming online are anticipated?

On second thought, might there be PCI devices which have multiple
MSIX vectors to signal different things?  I know for smart array
the idea has occasionally been floated to have an additional vector
to signal something to the host that is not a command completion,
although we never implemented such a thing.  Had we done so,
we would then need num_online_cpus() + 1 vectors.

So maybe it's not a good idea to put such a limit in the generic
pci code in case some device works in such a way.

-- steve


> 
> > 
> > > 
> > > Signed-off-by: Stephen M. Cameron 
> > > Reviewed-by: Mike Miller 
> > > Reviewed-by: Scott Teel 
> > > ---
> > >  drivers/scsi/hpsa.c |2 ++
> > >  drivers/scsi/hpsa_cmd.h |2 +-
> > >  2 files changed, 3 insertions(+), 1 deletions(-)
> > > 
> > > diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
> > > index b1ecfd8..b903e86 100644
> > > --- a/drivers/scsi/hpsa.c
> > > +++ b/drivers/scsi/hpsa.c
> > > @@ -6157,6 +6157,8 @@ static void hpsa_interrupt_mode(struct ctlr_info *h)
> > >   if (pci_find_capability(h->pdev, PCI_CAP_ID_MSIX)) {
> > >   dev_info(&h->pdev->dev, "MSIX\n");
> > >   h->msix_vector = MAX_REPLY_QUEUES;
> > > + if (h->msix_vector > num_online_cpus())
> > > + h->msix_vector = num_online_cpus();
> > >   err = pci_enable_msix(h->pdev, hpsa_msix_entries,
> > > h->msix_vector);
> > >   if (err > 0) {
> > > diff --git a/drivers/scsi/hpsa_cmd.h b/drivers/scsi/hpsa_cmd.h
> > > index db89245..104b67b 100644
> > > --- a/drivers/scsi/hpsa_cmd.h
> > > +++ b/drivers/scsi/hpsa_cmd.h
> > > @@ -615,7 +615,7 @@ struct TransTable_struct {
> > >   u32RepQCount;
> > >   u32RepQCtrAddrLow32;
> > >   u32RepQCtrAddrHigh32;
> > > -#define MAX_REPLY_QUEUES 8
> > > +#define MAX_REPLY_QUEUES 64
> > >   struct vals32  RepQAddr[MAX_REPLY_QUEUES];
> > >  };
> > >  
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> > > the body of a message to majord...@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > ---end quoted text---
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/10] hpsa: use safer test on the result of find_first_zero_bit

2014-06-04 Thread scameron
On Wed, Jun 04, 2014 at 11:07:56AM +0200, Julia Lawall wrote:
> From: Julia Lawall 
> 
> Find_first_zero_bit considers BITS_PER_LONG bits at a time, and thus may
> return a larger number than the maximum position argument if that position
> is not a multiple of BITS_PER_LONG.
> 
> The semantic match that finds this problem is as follows:
> (http://coccinelle.lip6.fr/)
> 
> // 
> @@
> expression e1,e2,e3;
> statement S1,S2;
> @@
> 
> e1 = find_first_zero_bit(e2,e3)
> ...
> if (e1 
> - ==
> + >=
>   e3)
> S1 else S2
> // 
> 
> Signed-off-by: Julia Lawall 
> 
> ---
>  drivers/scsi/hpsa.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff -u -p a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
> --- a/drivers/scsi/hpsa.c
> +++ b/drivers/scsi/hpsa.c
> @@ -4703,7 +4703,7 @@ static struct CommandList *cmd_alloc(str
>   spin_lock_irqsave(&h->lock, flags);
>   do {
>   i = find_first_zero_bit(h->cmd_pool_bits, h->nr_cmds);
> - if (i == h->nr_cmds) {
> + if (i >= h->nr_cmds) {
>   spin_unlock_irqrestore(&h->lock, flags);
>   return NULL;
>   }

Thanks, Ack.

You can add

Reviewed-by: Stephen M. Cameron 

to this patch if you want.

You might also consider adding "Cc: sta...@vger.kernel.org" to the sign-off 
area.

-- steve

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/10] hpsa: use safer test on the result of find_first_zero_bit

2014-06-04 Thread scameron
On Wed, Jun 04, 2014 at 05:06:44PM +0200, Julia Lawall wrote:
> 
> 
> On Wed, 4 Jun 2014, scame...@beardog.cce.hp.com wrote:
> 
> > On Wed, Jun 04, 2014 at 11:07:56AM +0200, Julia Lawall wrote:
> > > From: Julia Lawall 
> > >
> > > Find_first_zero_bit considers BITS_PER_LONG bits at a time, and thus may
> > > return a larger number than the maximum position argument if that position
> > > is not a multiple of BITS_PER_LONG.
> > >
> > > The semantic match that finds this problem is as follows:
> > > (http://coccinelle.lip6.fr/)
> > >
> > > // 
> > > @@
> > > expression e1,e2,e3;
> > > statement S1,S2;
> > > @@
> > >
> > > e1 = find_first_zero_bit(e2,e3)
> > > ...
> > > if (e1
> > > - ==
> > > + >=
> > >   e3)
> > > S1 else S2
> > > // 
> > >
> > > Signed-off-by: Julia Lawall 
> > >
> > > ---
> > >  drivers/scsi/hpsa.c |2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff -u -p a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
> > > --- a/drivers/scsi/hpsa.c
> > > +++ b/drivers/scsi/hpsa.c
> > > @@ -4703,7 +4703,7 @@ static struct CommandList *cmd_alloc(str
> > >   spin_lock_irqsave(&h->lock, flags);
> > >   do {
> > >   i = find_first_zero_bit(h->cmd_pool_bits, h->nr_cmds);
> > > - if (i == h->nr_cmds) {
> > > + if (i >= h->nr_cmds) {
> > >   spin_unlock_irqrestore(&h->lock, flags);
> > >   return NULL;
> > >   }
> >
> > Thanks, Ack.
> >
> > You can add
> >
> > Reviewed-by: Stephen M. Cameron 
> >
> > to this patch if you want.
> >
> > You might also consider adding "Cc: sta...@vger.kernel.org" to the sign-off 
> > area.
> 
> Actually, it seems that the function can never overshoot the specified
> limit.  So the change is not needed.

Well, that would explain why nobody has complained about it in all these years.
I figured that must have been the case at least on x86, but I thought maybe
other archictectures might behave differently, and the change seemed harmless
in any case. 

-- steve


--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [SCSI] hpsa: fix non-x86 builds

2014-06-26 Thread scameron
On Thu, Jun 26, 2014 at 03:44:52PM +0200, Arnd Bergmann wrote:
> commit 28e134464734 "[SCSI] hpsa: enable unit attention reporting"
> turns on unit attention notifications, but got the change wrong for
> all architectures other than x86, which now store an uninitialized
> value into the device register.
> 
> Gcc helpfully warns about this:
> 
> ../drivers/scsi/hpsa.c: In function 'hpsa_set_driver_support_bits':
> ../drivers/scsi/hpsa.c:6373:17: warning: 'driver_support' is used 
> uninitialized in this function [-Wuninitialized]
>   driver_support |= ENABLE_UNIT_ATTN;
>  ^
> 
> This moves the #ifdef so only the prefetch-enable is conditional
> on x86, not also reading the initial register contents. 
> 
> 
> Signed-off-by: Arnd Bergmann 
> Fixes: 28e134464734 "[SCSI] hpsa: enable unit attention reporting"
> Cc: Stephen M. Cameron 
> Cc: sta...@vger.kernel.org # v3.14+
> 
> diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
> index 31184b3..d0e487c 100644
> --- a/drivers/scsi/hpsa.c
> +++ b/drivers/scsi/hpsa.c
> @@ -6365,9 +6365,9 @@ static inline void hpsa_set_driver_support_bits(struct 
> ctlr_info *h)
>  {
>   u32 driver_support;
>  
> -#ifdef CONFIG_X86
> - /* Need to enable prefetch in the SCSI core for 6400 in x86 */
>   driver_support = readl(&(h->cfgtable->driver_support));
> + /* Need to enable prefetch in the SCSI core for 6400 in x86 */
> +#ifdef CONFIG_X86
>   driver_support |= ENABLE_SCSI_PREFETCH;
>  #endif
>   driver_support |= ENABLE_UNIT_ATTN;

Thanks.

-- steve

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [SCSI] hpsa: fix non-x86 builds

2014-07-03 Thread scameron
On Thu, Jul 03, 2014 at 01:43:48AM -0700, Christoph Hellwig wrote:
> On Thu, Jun 26, 2014 at 09:34:45AM -0500, scame...@beardog.cce.hp.com wrote:
> > Thanks.
> 
> Do you plan to include this with the next hpsa update, or should I take
> this as an ACK and apply it?

Take it as an ACK.  I should have been clearer.

-- steve

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] hpsa: work in progress "lockless monster" patches

2014-07-25 Thread scameron

hpsa: Work In Progress: "lockless monster" patches

To be clear, I am not suggesting that these patches be merged at this time.

This patchset is vs. Christoph Hellwig's scsi-mq.4 branch which
may be found here: git://git.infradead.org/users/hch/scsi.git

We've been working for a long time on a patchset for hpsa to remove
all the locks from the main i/o path in pursuit of high IOPS.  Some
of that work is already upstream, but a lot more of it is not quite
yet ready to be merged.  However, I think we've "gone dark" for a bit
too long on this, and even though the patches aren't really ready to
be merged just yet, I thought I should let other people who might be
interested have a look anyway, as things are starting to be at least
more stable than unstable.  Be warned though, there are still some
problems, esp. around error recovery.

That being said, with the right hardware (fast SAS SSDs, recent Smart
Arrays e.g. P430, with up-to-date firmware, attached to recent disk enclosures)
with these patches and the scsi-mq patches, it is possible to get around
~970k IOPS from a single logical drive on a single controller.

There are about 150 patches in this set.  Rather than bomb the list
with that, here is a link to a tarball of the patches in the form of
an stgit patch series:

https://github.com/smcameron/hpsa-lockless-patches-work-in-progress/blob/master/hpsa-lockless-vs-hch-scsi-mq.4-2014-07-25-1415CDT.tar.bz2?raw=true

In some cases, I have probably erred on the side of having too many too
small patches, on the theory that it is easier to bake a cake than to 
unbake a cake.  Before these are submitted "for reals", there will be
some squashing of patches, along with other cleaning up.

There are some patches in this set which are already upstream in
James's tree which do not happen to be in Christoph's tree
(most of which are named "*_sent_to_james").  There are also
quite a few patches which are strictly for debugging and are not
ever intended to be merged. 


Here is a list of all the patches sorted by author:

Arnd Bergmann (1):
  [SCSI] hpsa: fix non-x86 builds

Christoph Hellwig (1):
  reserve block tags in scsi host

Joe Handzik (9):
  hpsa: add masked physical devices into h->dev[] array
  hpsa: add hpsa_bmic_id_physical_device function
  hpsa: get queue depth for physical devices
  hpsa: use ioaccel2 path to submit IOs to physical drives in HBA mode.
  hpsa: Get queue depth from identify physical bmic for physical disks.
  hpsa: add ioaccel sg chaining for the ioaccel2 path
  Set the phys_disk value for a CommandList structure that is
submitted. Squash
  hpsa: unmap ioaccel2 commands before, not after adding to resubmit
workqueue
  hpsa: add more ioaccel2 error handling, including underrun statuses.

Nicholas Bellinger (1):
  hpsa: Convert SCSI LLD ->queuecommand() for host_lock less operation

Robert Elliott (44):
  Change scsi.c scsi_log_completion() to print strings for QUEUED,
  Set scsi_logging_level to be more verbose to get better messages
  hpsa: do not unconditionally copy sense data
  hpsa: optimize cmd_alloc function by remembering last allocation
  From a154642aeed291c7cfe4b9ea9da932156030f7d1 Mon Sep 17 00:00:00
2001
  hpsa: Clean up host, channel, target, lun prints
  hpsa: do not check cmd_alloc return value - it cannnot return NULL
  hpsa: return -1 rather than -ENOMEM in hpsa_get_raid_map if fill_cmd
fails
  hpsa: return -ENOMEM not 1 from ioaccel2_alloc_cmds_and_bft on
allocation failure
  hpsa: return -ENOMEM not 1 from hpsa_alloc_ioaccel_cmd_and_bft on
allocation failure
  hpsa: return -ENOMEM not -EFAULT from hpsa_passthru_ioctl on kmalloc
failure
  hpsa: pass error from pci_set_consistent_dma_mask intact from
hpsa_message
  hpsa: detect and report failures changing controller transport modes
  hpsa: add hpsa_disable_interrupt_mode
  hpsa: rename free_irqs to hpsa_free_irqs and move before
hpsa_request_irq
  hpsa: rename hpsa_request_irq to hpsa_request_irqs
  hpsa: on failure of request_irq, free the irqs that we did get
  hpsa: make hpsa_pci_init disable interrupts and pci_disable_device on
critical failures
  hpsa: fix a string constant broken into two lines
  hpsa: avoid unneccesary calls to resource freeing functions in
hpsa_init_one
  hpsa: add hpsa_free_cfgtables function to undo what
hpsa_find_cfgtables does
  hpsa: report failure to ioremap config table
  hpsa: add hpsa_free_pci_init function
  hpsa: clean up error handling in hpsa_pci_init
  hpsa: make hpsa_remove_one use hpsa_pci_free_init
  hpsa: rename hpsa_alloc_ioaccel_cmd_and_bft to
hpsa_alloc_ioaccel1_cmd_and_bft
  hpsa: rename hpsa_allocate_cmd_pool to hpsa_alloc_cmd_pool
  hpsa: rename ioaccel2_alloc_cmds_and_bft to
hpsa_alloc_ioaccel2_cmd_and_bft
  hpsa: fix memory leak in hpsa_alloc_cmd_pool
  hpsa: return status not void from hpsa_put_ctlr_into_performant_mode
  hpsa: clean up error handling in hpsa_init_one
  hpsa: report allocation failures while alloca

Re: [PATCH] scsi: hpsa.c: Cleaning up code clarification using strlcpy

2014-07-31 Thread scameron
On Wed, Jul 30, 2014 at 11:46:52PM +0200, Rickard Strandqvist wrote:
> Code clarification using strlcpy instead of strncpy.
> And removed unnecessary memset
> 
> Signed-off-by: Rickard Strandqvist 
> ---
>  drivers/scsi/hpsa.c |   16 +++-
>  1 file changed, 7 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
> index 31184b3..814d64d 100644
> --- a/drivers/scsi/hpsa.c
> +++ b/drivers/scsi/hpsa.c
> @@ -315,16 +315,15 @@ static ssize_t 
> host_store_hp_ssd_smart_path_status(struct device *dev,
>struct device_attribute *attr,
>const char *buf, size_t count)
>  {
> - int status, len;
> + int status;
>   struct ctlr_info *h;
>   struct Scsi_Host *shost = class_to_shost(dev);
>   char tmpbuf[10];
>  
>   if (!capable(CAP_SYS_ADMIN) || !capable(CAP_SYS_RAWIO))
>   return -EACCES;
> - len = count > sizeof(tmpbuf) - 1 ? sizeof(tmpbuf) - 1 : count;
> - strncpy(tmpbuf, buf, len);
> - tmpbuf[len] = '\0';
> + strlcpy(tmpbuf, buf,
> + count > sizeof(tmpbuf) ? sizeof(tmpbuf) : count);

Are we doing the strlcpy thing?  Linus and GregKH didn't seem to like this 
kind of stuff all that much in some google+ comments I saw recently.

https://plus.google.com/111049168280159033135/posts/1amLbuhWbh5

"... strlcpy requires the source to be 0 terminated, even if
its longer than the target size."

which I don't know that we really want to rely on buf being
null terminated here.

Part of Linus's comment was:

If you're actually copying from user space in the kernel, do

   ret = strncpy_from_user(buf, userptr, len);
   if (ret < 0) return ret;
   if (ret == len) return -ETOOLONG;

and you're ok (and 'ret' will be the length of the properly 
NUL-terminated string).

Don't use strncpy(), don't use strlcpy(). 


>   if (sscanf(tmpbuf, "%d", &status) != 1)
>   return -EINVAL;
>   h = shost_to_hba(shost);
> @@ -339,16 +338,15 @@ static ssize_t host_store_raid_offload_debug(struct 
> device *dev,
>struct device_attribute *attr,
>const char *buf, size_t count)
>  {
> - int debug_level, len;
> + int debug_level;
>   struct ctlr_info *h;
>   struct Scsi_Host *shost = class_to_shost(dev);
>   char tmpbuf[10];
>  
>   if (!capable(CAP_SYS_ADMIN) || !capable(CAP_SYS_RAWIO))
>   return -EACCES;
> - len = count > sizeof(tmpbuf) - 1 ? sizeof(tmpbuf) - 1 : count;
> - strncpy(tmpbuf, buf, len);
> - tmpbuf[len] = '\0';
> + strlcpy(tmpbuf, buf,
> + count > sizeof(tmpbuf) ? sizeof(tmpbuf) : count);
>   if (sscanf(tmpbuf, "%d", &debug_level) != 1)
>   return -EINVAL;
>   if (debug_level < 0)
> @@ -5881,8 +5879,8 @@ static int hpsa_controller_hard_reset(struct pci_dev 
> *pdev,
>  
>  static void init_driver_version(char *driver_version, int len)
>  {
> - memset(driver_version, 0, len);
>   strncpy(driver_version, HPSA " " HPSA_DRIVER_VERSION, len - 1);
> + driver_version[len - 1] = '\0';

So this depends on strncpy actually putting those zeros on the end of that
string so as not to leak a partially uninitialized kernel buffer out into a pci
config space register on a device that could potentially get read later
from user space.  (using kzalloc for that particular buffer could alleviate
that dependency easily enough though) Plenty of places that are using strncpy
probably do not care whether those zeroes are padded on, but this one does
care, but that is not signaled to the reader of the code in any way here.
(a comment indicating the zero padding is actually wanted would suffice.)

None of this stuff is in performance critical paths.

-- steve

>  }
>  
>  static int write_driver_ver_to_cfgtable(struct CfgTable __iomem *cfgtable)
> -- 
> 1.7.10.4
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] hpsa: work in progress "lockless monster" patches

2014-07-31 Thread scameron
On Thu, Jul 31, 2014 at 03:56:13PM +0200, Tomas Henzl wrote:
> Hi Steve, Webb,
> 
> let me start with the part most important for me - from my previous mail
> "And please have a look at "[PATCH] hpsa: refine the pci enble/disable 
> handling"
> I posted in June and add it to your series or review in the list. Thanks."

Ok.  I will try to get to this.  
(some customer issues have been eating a lot of my time this week).

When I first saw that patch I knew that Rob Elliott had a lot of patches in the
works that did a lot of cleanup to the initialization code, so I expect some
conflicts.

> 
> other comments see below
> 
> > Hi Tomas,
> >
> > Thanks for taking a look and for the feedback.  I'm actually the one 
> > responsible for the changes you refer to, so I'll try to address your 
> > comments.
> >
> >
> > On 7/30/14 10:55 AM, Tomas Henzl wrote:
> >> I'm not sure if I got it right,
> >> it seems it uses the same cmd_pool for both alloc function,
> >> from (0 - reserved_cmds) it's cmd_alloc and above that it's handled
> >> by cmd_tagged_alloc.
> >> The logic in both functions seems to be valid for me.
> > Good.  With Christoph's proposed changes, the block layer will select a 
> > tag for I/O requests, and it provides for the driver to reserve part of 
> > the tag space for its own use.  Thus, HPSA uses a common pool for both 
> > I/O requests and internally-driven requests (IOCTLs, aborts, etc.), and 
> > we index the pool using the tag.  The code is this way mostly to 
> > minimize change from the previous version, but it's also handy in terms 
> > of managing the pool (i.e., there is a single pool to administer, 
> > instead of two, and all the items are initialized the same way, etc.).  
> > The portion of the pool which is reserved to the driver (the 
> > low-numbered tags) continues to be managed through a bitmap, just as it 
> > was prior to this change; the bitmap does cover the whole pool (which is 
> > a historical artifact), but the higher bits are never set, since 
> > allocation of those entries is determined by the tag from the block 
> > layer -- nevertheless, it is convenient to have the map extend over the 
> > whole pool in certain corner cases (such as resets).
> >
> >
> >> In cmd_tagged_alloc "Thus, there should never be a collision here between
> >> two requests" if this is true you don't need the refcount and just a simple
> >> flag were sufficient for your other needs. (And maybe you get to ~971k 
> >> IOPS..)
> > :-)  The code previously had the reference count in there to close 
> > certain race conditions' asynchronous accesses to the command block 
> > (e.g., between an abort and a completing command).  We hope that, using 
> > the block layer tags, those races no longer occur, but there didn't seem 
> > to be any point in removing the reference count until we'd had more 
> > experience with the code...and, in a fit of healthy paranoia I added the 
> > check to which you refer.  Unfortunately, in some of Rob Elliott's 
> > recent tests, he managed to drive a case where the check triggers.  So, 
> > until we reconcile that, I'm inclined to leave the check in place 
> > (hopefully it's not costing a full 1k IOPS :-) ).
> 
> Let us assume that the block layer never sends duplicate tags to the driver,
> I think there are some weaknesses in the way how it's implemented,
> for example here - from fail_all_outstanding_cmds:
> refcount = atomic_inc_return(&c->refcount);
> if (refcount > 1) {
> ...
> finish_cmd(c); - this finishes the command
>   and the tag might be now reused,
>   when that happens you'll see a race
> atomic_dec(&h->commands_outstanding);
> failcount++;
> }
> else {what happens when right now comes a call from block 
> layer?}
> cmd_free(h, c);
> 
> When references are used it usually means, that when refcount==0 that
> a release function frees the resources, in this case it is the tag number.
> In your implementation you should ensure that when you signal
> to the upper layer that the tag is free, that nothing else holds the 
> reference.
> If this is true the conclusion is that you don't need this kind of references
> and a simple flag just to debug not yet fixed races is enough.

Part of the motivation is we want to have this code be workable for distros
that may not have the necessary kernel changes for us to be able to use the
block layer tagging (e.g. tags reserved for driver use).  So I am floating
the block layer tag patches, keeping them at the top of my stack of patches
so that all this lockless stuff can still work even without using the block
layer tags.  So one reason the reference counting needs to go in is to
accomodate backporting the LLD without touching an older kernel's mid layer
or block layer code, which is import

Re: [RFC] hpsa: work in progress "lockless monster" patches

2014-07-31 Thread scameron
On Sun, Jul 27, 2014 at 05:24:46PM +0200, Hannes Reinecke wrote:
> On 07/25/2014 09:28 PM, scame...@beardog.cce.hp.com wrote:
> >
> >hpsa: Work In Progress: "lockless monster" patches
> >
> >To be clear, I am not suggesting that these patches be merged at this time.
> >
> >This patchset is vs. Christoph Hellwig's scsi-mq.4 branch which
> >may be found here: git://git.infradead.org/users/hch/scsi.git
> >
> >We've been working for a long time on a patchset for hpsa to remove
> >all the locks from the main i/o path in pursuit of high IOPS.  Some
> >of that work is already upstream, but a lot more of it is not quite
> >yet ready to be merged.  However, I think we've "gone dark" for a bit
> >too long on this, and even though the patches aren't really ready to
> >be merged just yet, I thought I should let other people who might be
> >interested have a look anyway, as things are starting to be at least
> >more stable than unstable.  Be warned though, there are still some
> >problems, esp. around error recovery.
> >
> >That being said, with the right hardware (fast SAS SSDs, recent Smart
> >Arrays e.g. P430, with up-to-date firmware, attached to recent disk 
> >enclosures)
> >with these patches and the scsi-mq patches, it is possible to get around
> >~970k IOPS from a single logical drive on a single controller.
> >
> >There are about 150 patches in this set.  Rather than bomb the list
> >with that, here is a link to a tarball of the patches in the form of
> >an stgit patch series:
> >
> >https://github.com/smcameron/hpsa-lockless-patches-work-in-progress/blob/master/hpsa-lockless-vs-hch-scsi-mq.4-2014-07-25-1415CDT.tar.bz2?raw=true
> >
> >In some cases, I have probably erred on the side of having too many too
> >small patches, on the theory that it is easier to bake a cake than to
> >unbake a cake.  Before these are submitted "for reals", there will be
> >some squashing of patches, along with other cleaning up.
> >
> >There are some patches in this set which are already upstream in
> >James's tree which do not happen to be in Christoph's tree
> >(most of which are named "*_sent_to_james").  There are also
> >quite a few patches which are strictly for debugging and are not
> >ever intended to be merged.
> >
> Hmm. While you're about to engage on a massive rewrite _and_ we're 
> having 64bit LUN support now, what about getting rid of the weird 
> internal LUN mapping? That way you would get rid of this LUN rescan 
> thingie and the driver would look more sane.

Sorry for my slow reply to this.

I can sympathize with this desire, the device scanning code in the driver
is not my favorite, and it can undoubtedly be improved.  However, there
are some problems.

> Plus the REPORT LUN command would actually return the correct data ...

No, it wouldn't.  Smart Arrays are unfortunately pretty weird.

SCSI REPORT LUNS issued to a smart array will report only the logical
drives.  No physical devices (tape drives, etc.) will be reported.

Instead of SCSI REPORT LUNS, the driver uses a couple of proprietary
variants of this, CISS_REPORT_LOGICAL and CISS_REPORT_PHYSICAL, which
report logical drives and physical devices, and have their own oddities.

And it gets weirder.  This will require some exposition, so bear with me.

What's driving this giant patchball to remove locks from the driver is SSDs.
Suddenly, the assumption that disks are dirt slow relative to the host is not
quite so true anymore.  Smart Array looks something like this:

   +---+
   | host  |
   +-+-+
 |
 | PCI bus
 |
   +-|--+
   | |  |
   | +--RAID stack  |<--- smart array
   |running on  |
   |embedded system |
   |on PCI board|
   | |  |
   |-|--|
   | |  |
   |  Back end firmware |
   ||   |
   +|---+
|
| SAS
|
   +--+
   | physical devices (disks, etc.)   |
   +--+

with the advent of SSDs, it turns out the RAID stack firmware
running on the little embedded system on the PCI board becomes
a bottleneck and hurts performance.

It turns out that with a (significant) firmware change, we
can do something like this:

   +---+
   | host  |
   ++--+
|
| PCI bus
|
   +|---+
   ||  

Re: [PATCH v2 RESEND 08/23] hpsa: Fallback to MSI rather than to INTx if MSI-X failed

2014-08-13 Thread scameron
On Wed, Jul 16, 2014 at 08:05:12PM +0200, Alexander Gordeev wrote:
> Currently the driver falls back to INTx mode when MSI-X
> initialization failed. This is a suboptimal behaviour
> for chips that also support MSI. This update changes that
> behaviour and falls back to MSI mode in case MSI-X mode
> initialization failed.
> 
> Signed-off-by: Alexander Gordeev 
> Cc: "Stephen M. Cameron" 
> Cc: iss_storage...@hp.com
> Cc: linux-scsi@vger.kernel.org
> Cc: linux-...@vger.kernel.org
> ---
>  drivers/scsi/hpsa.c |1 -
>  1 files changed, 0 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
> index 31184b3..648dec2 100644
> --- a/drivers/scsi/hpsa.c
> +++ b/drivers/scsi/hpsa.c
> @@ -6177,7 +6177,6 @@ static void hpsa_interrupt_mode(struct ctlr_info *h)
>   dev_warn(&h->pdev->dev, "MSI-X init failed %d\n",
>  err);
>   h->msix_vector = 0;
> - goto default_int_mode;
>   }
>   }
>   if (pci_find_capability(h->pdev, PCI_CAP_ID_MSI)) {
> -- 
> 1.7.7.6

Ack.

Sorry for the slow reply, this got lost in the tornadic maelstrom
I call my inbox.

I also tested this with a Smart Array P420, P420i, and P430 in
the system by ifdef'ing out the code for MSI-X and setting
h->msix_vector = 0, to force it down the MSI path, and that all
appears to work. 

-- steve

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 RESEND 10/23] hpsa: Use pci_enable_msix_range() instead of pci_enable_msix()

2014-08-13 Thread scameron
On Mon, Apr 14, 2014 at 10:05:34AM +0200, Alexander Gordeev wrote:
> As result of deprecation of MSI-X/MSI enablement functions
> pci_enable_msix() and pci_enable_msi_block() all drivers
> using these two interfaces need to be updated to use the
> new pci_enable_msi_range()  or pci_enable_msi_exact()
> and pci_enable_msix_range() or pci_enable_msix_exact()
> interfaces.
> 
> Signed-off-by: Alexander Gordeev 
> Cc: "Stephen M. Cameron" 
> Cc: iss_storage...@hp.com
> Cc: linux-scsi@vger.kernel.org
> Cc: linux-...@vger.kernel.org
> ---
>  drivers/scsi/hpsa.c |   29 +
>  1 files changed, 13 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
> index 3284edb..36bb1e4 100644
> --- a/drivers/scsi/hpsa.c
> +++ b/drivers/scsi/hpsa.c
> @@ -6144,26 +6144,23 @@ static void hpsa_interrupt_mode(struct ctlr_info *h)
>   goto default_int_mode;
>   if (pci_find_capability(h->pdev, PCI_CAP_ID_MSIX)) {
>   dev_info(&h->pdev->dev, "MSIX\n");
> - h->msix_vector = MAX_REPLY_QUEUES;
> - err = pci_enable_msix(h->pdev, hpsa_msix_entries,
> -   h->msix_vector);
> - if (err > 0) {
> + err = pci_enable_msix_range(h->pdev, hpsa_msix_entries,
> + 1, MAX_REPLY_QUEUES);
> + if (err < 0) {
> + dev_warn(&h->pdev->dev, "MSI-X init failed %d\n", err);
> + h->msix_vector = 0;
> + goto single_msi_mode;
> + } else if (err < MAX_REPLY_QUEUES) {
>   dev_warn(&h->pdev->dev, "only %d MSI-X vectors "
>  "available\n", err);
> - h->msix_vector = err;
> - err = pci_enable_msix(h->pdev, hpsa_msix_entries,
> -   h->msix_vector);
> - }
> - if (!err) {
> - for (i = 0; i < h->msix_vector; i++)
> - h->intr[i] = hpsa_msix_entries[i].vector;
> - return;
> - } else {
> - dev_warn(&h->pdev->dev, "MSI-X init failed %d\n",
> -err);
> - h->msix_vector = 0;
>   }
> +
> + h->msix_vector = err;
> + for (i = 0; i < h->msix_vector; i++)
> + h->intr[i] = hpsa_msix_entries[i].vector;
> + return;
>   }
> +single_msi_mode:
>   if (pci_find_capability(h->pdev, PCI_CAP_ID_MSI)) {
>   dev_info(&h->pdev->dev, "MSI\n");
>   if (!pci_enable_msi(h->pdev))
> -- 
> 1.7.7.6

Ack.

-- steve

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] hpsa: refine the pci enble/disable handling]

2014-08-13 Thread scameron

Let me try again, this time not from gmail.

On Thu, Jun 12, 2014 at 10:29 AM, Tomas Henzl  wrote:
> When a second(kdump) kernel starts and the hard reset method is used
> the driver calls pci_disable_device without previously enabling it,
> so the kernel shows a warning -
> [   16.876248] WARNING: at drivers/pci/pci.c:1431
> pci_disable_device+0x84/0x90()
> [   16.882686] Device hpsa
> disabling already-disabled device
> ...
> This patch fixes it, in addition to this I tried to balance also some
> other pairs
> of enable/disable device in the driver.
> Unfortunately I wasn't able to verify the functionality for the case of a
> sw reset,
> because of a lack of proper hw.
>
> Signed-off-by: Tomas Henzl 
> ---
> diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
> index 5858600..67c41b9 100644
> --- a/drivers/scsi/hpsa.c
> +++ b/drivers/scsi/hpsa.c
> @@ -5983,7 +5983,6 @@ static int hpsa_kdump_hard_reset_controller(struct
> pci_dev *pdev)
> /* Turn the board off.  This is so that later pci_restore_state()
>  * won't turn the board on before the rest of config space is
> ready.
>  */
> -   pci_disable_device(pdev);
> pci_save_state(pdev);
>
> /* find the first memory BAR, so we can find the cfg table */
> @@ -6031,11 +6030,6 @@ static int hpsa_kdump_hard_reset_controller(struct
> pci_dev *pdev)
> goto unmap_cfgtable;
>
> pci_restore_state(pdev);
> -   rc = pci_enable_device(pdev);
> -   if (rc) {
> -   dev_warn(&pdev->dev, "failed to enable device.\n");
> -   goto unmap_cfgtable;
> -   }
> pci_write_config_word(pdev, 4, command_register);
>
> /* Some devices (notably the HP Smart Array 5i Controller)
> @@ -6548,6 +6542,12 @@ static int hpsa_init_reset_devices(struct pci_dev
> *pdev)
> if (!reset_devices)
> return 0;
>
> +   rc = pci_enable_device(pdev);
> +   if (rc) {
> +   dev_warn(&pdev->dev, "failed to enable device.\n");
> +   return -ENODEV;
> +   }
> +
> /* Reset the controller with a PCI power-cycle or via doorbell */
> rc = hpsa_kdump_hard_reset_controller(pdev);
>
> @@ -6556,10 +6556,11 @@ static int hpsa_init_reset_devices(struct pci_dev
> *pdev)
>  * "performant mode".  Or, it might be 640x, which can't reset
>  * due to concerns about shared bbwc between 6402/6404 pair.
>  */
> -   if (rc == -ENOTSUPP)
> -   return rc; /* just try to do the kdump anyhow. */
> -   if (rc)
> -   return -ENODEV;
> +   if (rc) {
> +   if (rc != -ENOTSUPP) /* just try to do the kdump anyhow. */
> +   rc = -ENODEV;
> +   goto out_disable;
>


Checkpatch complained of trailing whitespace here. ^^^

Other than that, looks good.

-- steve


> +   }
>
> /* Now try to get the controller to respond to a no-op */
> dev_warn(&pdev->dev, "Waiting for controller to respond to
> no-op\n");
> @@ -6570,7 +6571,11 @@ static int hpsa_init_reset_devices(struct pci_dev
> *pdev)
> dev_warn(&pdev->dev, "no-op failed%s\n",
> (i < 11 ? "; re-trying" : ""));
> }
> -   return 0;
> +
> +out_disable:
> +
> +   pci_disable_device(pdev);
> +   return rc;
>  }
>
>  static int hpsa_allocate_cmd_pool(struct ctlr_info *h)
> @@ -6722,6 +6727,7 @@ static void
> hpsa_undo_allocations_after_kdump_soft_reset(struct ctlr_info *h)
> iounmap(h->transtable);
> if (h->cfgtable)
> iounmap(h->cfgtable);
> +   pci_disable_device(h->pdev);
> pci_release_regions(h->pdev);
> kfree(h);
>  }
> --
> 1.8.3.1
>
>

- End forwarded message -
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


SCSI mid layer and high IOPS capable devices

2012-12-10 Thread scameron

Is there any work going on to improve performance of the SCSI layer
to better support devices capable of high IOPS?

I've been playing around with some flash-based devices and have
a block driver that uses the make_request interface (calls
blk_queue_make_request() rather than blk_init_queue()) and a
SCSI LLD variant of the same driver.  The block driver is similar
in design and performance to the nvme driver.

If I compare the performance, the block driver gets about 3x the
performance as the SCSI LLD.  The SCSI LLD spends a lot of time
(according to perf) contending for locks in scsi_request_fn(),
presumably the host lock or the queue lock, or perhaps both.

All other things being equal, a SCSI LLD would be preferable to
me, but, with performance differing by a factor of around 3x, all
other things are definitely not equal.

I tried using scsi_debug with fake_rw and also the scsi_ram driver
that was recently posted to get some idea of what the maximum IOPS
that could be pushed through the SCSI midlayer might be, and the
numbers were a little disappointing (was getting around 150k iops
with scsi_debug with reads and writes faked, and around 3x that
with the block driver actually doing the i/o).

Essentially, what I've been finding out is consistent with what's in
this slide deck: 
http://static.usenix.org/event/lsf08/tech/IO_Carlson_Accardi_SATA.pdf

The driver, like nvme, has a submit and reply queue per cpu.
I'm sort of guessing that funnelling all the requests through
a single request queue per device that only one cpu can touch
at a time as the scsi mid layer does is a big part of what's
killing performance.  Looking through the scsi code, if I read
it correctly, the assumption that each device has a request queue
seems to be all over the code, so how exactly one might go about
attempting to improve the situation is not really obvious to me.

Anyway, just wondering if anybody is looking into doing some
improvements in this area.

-- steve

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SCSI mid layer and high IOPS capable devices

2012-12-11 Thread scameron
On Tue, Dec 11, 2012 at 09:21:46AM +0100, Bart Van Assche wrote:
> On 12/11/12 01:00, scame...@beardog.cce.hp.com wrote:
> >I tried using scsi_debug with fake_rw and also the scsi_ram driver
> >that was recently posted to get some idea of what the maximum IOPS
> >that could be pushed through the SCSI midlayer might be, and the
> >numbers were a little disappointing (was getting around 150k iops
> >with scsi_debug with reads and writes faked, and around 3x that
> >with the block driver actually doing the i/o).
> 
> With which request size was that ? 

4k (I'm thinking the request size should not matter too much since
fake_rw=1 causes the i/o not to actually be done -- there's no data 
transferred.  Similarly with scsi_ram there's a flag to discard 
reads and writes that I was using.)

> I see about 330K IOPS @ 4 KB and 
> about 540K IOPS @ 512 bytes with the SRP protocol, a RAM disk at the 
> target side, a single SCSI LUN and a single IB cable. These results have 
> been obtained on a setup with low-end CPU's. Had you set rq_affinity to 
> 2 in your tests ?

No, hadn't done anything with rq_affinity.  I had spread interrupts
around by turning off irqbalance and echoing things into /proc/irq/*, and
running a bunch of dd processes (one per cpu) like this: 

taskset -c $cpu dd if=/dev/blah of=/dev/null bs=4k iflag=direct &

And the hardware in this case should route the interrupts back to the processor
which submitted the i/o (the submitted command contains info that lets the hw
know which msix vector we want the io to come back on.)

I would be curious to see what kind of results you would get with scsi_debug
with fake_rw=1.  I am sort of suspecting that trying to put an "upper limit"
on scsi LLD IOPS performance by seeing what scsi_debug will do with fake_rw=1
is not really valid (or, maybe I'm doing it wrong) as I know of one case in
which a real HW scsi driver beats scsi_debug fake_rw=1 at IOPS on the very
same system, which seems like it shouldn't be possible.  Kind of mysterious.

Another mystery I haven't been able to clear up -- I'm using code like
this to set affinity hints 

int i, cpu;

cpu = cpumask_first(cpu_online_mask);
for (i = 0; i < h->noqs; i++) {
int idx = i ? i + 1 : i;
int rc;
rc = irq_set_affinity_hint(h->qinfo[idx].msix_vector,
get_cpu_mask(cpu));

if (rc)
dev_warn(&h->pdev->dev, "Failed to hint affinity of 
vector %d to cpu %d\n",
h->qinfo[idx].msix_vector, cpu);
cpu = cpumask_next(cpu, cpu_online_mask);
}

and those hints are set (querying /proc/irq/*/affinity_hint shows that my hints
are in there) but the hints are not "taken", that is /proc/irq/smp_affinity
does not match the hints.

doing this:

for x in `seq $first_irq $last_irq`
do
cat /proc/irq/$x/affinity_hint > /proc/irq/$x/smp_affinity
done

(where first_irq and last_irq specify the range of irqs assigned
to my driver) makes the hints be "taken".

I noticed nvme doesn't seem to suffer from this, somehow the hints are
taken automatically (er, I don't recall if /proc/irq/*/smp_affinity matches
affinity_hints for nvme, but interrupts seem spread around without doing
anything special).   I haven't seen anything in the nvme code related to 
affinity
that I'm not already doing as well in my driver, so it is a mystery to me why
that difference in behavior occurs.

-- steve

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SCSI mid layer and high IOPS capable devices

2012-12-13 Thread scameron
On Thu, Dec 13, 2012 at 04:22:33PM +0100, Bart Van Assche wrote:
> On 12/11/12 01:00, scame...@beardog.cce.hp.com wrote:
> >The driver, like nvme, has a submit and reply queue per cpu.
> 
> This is interesting. If my interpretation of the POSIX spec is correct 
> then aio_write() allows to queue overlapping writes and all writes 
> submitted by the same thread have to be performed in the order they were 
> submitted by that thread. What if a thread submits a first write via 
> aio_write(), gets rescheduled on another CPU and submits a second 
> overlapping write also via aio_write() ? If a block driver uses one 
> queue per CPU, does that mean that such writes that were issued in order 
> can be executed in a different order by the driver and/or hardware than 
> the order in which the writes were submitted ?
> 
> See also the aio_write() man page, The Open Group Base Specifications 
> Issue 7, IEEE Std 1003.1-2008 
> (http://pubs.opengroup.org/onlinepubs/9699919799/functions/aio_write.html).

It is my understanding that the low level driver is free to re-order the
i/o's any way it wants, as is the hardware.  It is up to the layers above
to enforce any ordering requirements.  For a long time there was a bug
in the cciss driver that all i/o's submitted to the driver got reversed
in order -- adding to head of a list instead of to the tail, or vice versa,
I forget which -- and it caused no real problems (apart from some slight
performance issues that were mostly masked by the Smart Array's cache.
It was caught by firmware guys noticing LBAs coming in in weird orders
for supposedly sequential workloads.

So in your scenario, I think the overlapping writes should not be submitted
by the block layer to the low level driver concurrently, as the block layer
is aware that the lld is free to re-order things.  (I am very certain
that this is the case for scsi low level drivers and block drivers using a
request_fn interface -- less certain about block drivers using the
make_request interface to submit i/o's, as this interface is pretty new
to me.

If I am wrong about any of that, that would be very interesting to know.

-- steve

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SCSI mid layer and high IOPS capable devices

2012-12-13 Thread scameron
On Thu, Dec 13, 2012 at 12:40:27PM +0100, Bart Van Assche wrote:
> On 12/11/12 23:46, scame...@beardog.cce.hp.com wrote:
> >I would be curious to see what kind of results you would get with 
> >scsi_debug
> >with fake_rw=1.  I am sort of suspecting that trying to put an "upper 
> >limit"
> >on scsi LLD IOPS performance by seeing what scsi_debug will do with 
> >fake_rw=1
> >is not really valid (or, maybe I'm doing it wrong) as I know of one case in
> >which a real HW scsi driver beats scsi_debug fake_rw=1 at IOPS on the very
> >same system, which seems like it shouldn't be possible.  Kind of 
> >mysterious.
> 
> The test
> 
> # disable-frequency-scaling
> # modprobe scsi_debug delay=0 fake_rw=1
> # echo 2 > /sys/block/sdc/queue/rq_affinity
> # echo noop > /sys/block/sdc/queue/scheduler
> # echo 0 > /sys/block/sdc/queue/add_random
> 
> results in about 800K IOPS for random reads on the same setup (with a 
> request size of 4 KB; CPU: quad core i5-2400).
> 
> Repeating the same test with fake_rw=0 results in about 651K IOPS.

What are your system specs?


Here's what I'm seeing.

I have one 6-core processor.

[root@localhost scameron]# grep 'model name' /proc/cpuinfo
model name  : Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz
model name  : Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz
model name  : Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz
model name  : Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz
model name  : Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz
model name  : Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz

hyperthreading is disabled.

Here is the script I'm running.

[root@localhost scameron]# cat do-dds
#!/bin/sh

do_dd()
{
device="$1"
cpu="$2"

taskset -c "$cpu" dd if="$device" of=/dev/null bs=4k iflag=direct
}

do_six()
{
for x in `seq 0 5`
do
do_dd "$1" $x &
done
}

do_120()
{
for z in `seq 1 20` 
do
do_six "$1"
done
wait
}

time do_120 "$1"

I don't have "disable-frequency-scaling" on rhel6, but I think if I send
SIGUSR1 to all the cpuspeed processes, this does the same thing.

 ps aux | grep cpuspeed | grep -v grep | awk '{ printf("kill -USR1 %s\n", 
$2);}' | sh

[root@localhost scameron]# find /sys -name 'scaling_cur_freq' -print | xargs cat
200
200
200
200
200
200
[root@localhost scameron]#

Now, using scsi-debug (300mb size) with delay=0 and fake_rw=1, with
rq_affinity set to 2, and add_random set to 0 and noop i/o scheduler
I get ~216k iops.

With my scsi lld (actually doing the i/o) , I now get ~190k iops.
rq_affinity set to 2, add_random 0, noop i/o scheduler, irqs
manually spread across cpus (irqbalance turned off).

With my block lld (actually doing the i/o), I get ~380k iops.
rq_affinity set to 2, add_random 0, i/o scheduler "none"
(there is no i/o scheduler with the make_request interface),
irqs manually spread across cpus (irqbalance turned off).

So the block driver seems to beat the snot out of the scsi lld
by a factor of 2x now, rather than 3x, so I guess that's some
improvement, but still.

-- steve

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SCSI mid layer and high IOPS capable devices

2012-12-13 Thread scameron
On Thu, Dec 13, 2012 at 05:47:14PM +0100, Bart Van Assche wrote:
> On 12/13/12 18:25, scame...@beardog.cce.hp.com wrote:
> >On Thu, Dec 13, 2012 at 04:22:33PM +0100, Bart Van Assche wrote:
> >>On 12/11/12 01:00, scame...@beardog.cce.hp.com wrote:
> >>>The driver, like nvme, has a submit and reply queue per cpu.
> >>
> >>This is interesting. If my interpretation of the POSIX spec is correct
> >>then aio_write() allows to queue overlapping writes and all writes
> >>submitted by the same thread have to be performed in the order they were
> >>submitted by that thread. What if a thread submits a first write via
> >>aio_write(), gets rescheduled on another CPU and submits a second
> >>overlapping write also via aio_write() ? If a block driver uses one
> >>queue per CPU, does that mean that such writes that were issued in order
> >>can be executed in a different order by the driver and/or hardware than
> >>the order in which the writes were submitted ?
> >>
> >>See also the aio_write() man page, The Open Group Base Specifications
> >>Issue 7, IEEE Std 1003.1-2008
> >>(http://pubs.opengroup.org/onlinepubs/9699919799/functions/aio_write.html).
> >
> >It is my understanding that the low level driver is free to re-order the
> >i/o's any way it wants, as is the hardware.  It is up to the layers above
> >to enforce any ordering requirements.  For a long time there was a bug
> >in the cciss driver that all i/o's submitted to the driver got reversed
> >in order -- adding to head of a list instead of to the tail, or vice versa,
> >I forget which -- and it caused no real problems (apart from some slight
> >performance issues that were mostly masked by the Smart Array's cache.
> >It was caught by firmware guys noticing LBAs coming in in weird orders
> >for supposedly sequential workloads.
> >
> >So in your scenario, I think the overlapping writes should not be submitted
> >by the block layer to the low level driver concurrently, as the block layer
> >is aware that the lld is free to re-order things.  (I am very certain
> >that this is the case for scsi low level drivers and block drivers using a
> >request_fn interface -- less certain about block drivers using the
> >make_request interface to submit i/o's, as this interface is pretty new
> >to me.
> 
> As far as I know there are basically two choices:
> 1. Allow the LLD to reorder any pair of write requests. The only way
>for higher layers to ensure the order of (overlapping) writes is then
>to separate these in time. Or in other words, limit write request
>queue depth to one.
>
> 2. Do not allow the LLD to reorder overlapping write requests. This
>allows higher software layers to queue write requests (queue depth
>> 1).
> 
> From my experience with block and SCSI drivers option (1) doesn't look 
> attractive from a performance point of view. From what I have seen 
> performance with QD=1 is several times lower than performance with QD > 
> 1. But maybe I overlooked something ?

I don't think 1 is how it works, and I know 2 is not how it works.
LLD's are definitely allowed to re-order i/o's arbitrarily (and so is
the hardware (e.g. array controller or disk drive)).  

If you need an i/o to complete before another begins, 
don't give the 2nd i/o to the LLD before the 1st completes, but be smarter
than limiting all writes to queue depth of 1 by knowing when you care
about the order.  If my understanding is correct, The buffer cache will,
for the most part, make sure there generally aren't many overlapping or
order-dependent i/o's by essentially combining multiple overlapping writes
into a single write, but for filesystem meta data, or direct i/o, there
may of course be application specific ordering requirements, and the answer
is, I think, the application (e.g. filesystem) needs to know when it care
 about the order, and wait for completions as necessary when it does care, and
take pains that it should not care about the order most of the time if 
performance is important (one of the reasons the buffer cache exists.)

(I might be wrong though.)

-- steve

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SCSI mid layer and high IOPS capable devices

2012-12-14 Thread scameron
On Fri, Dec 14, 2012 at 10:44:34AM +0100, Bart Van Assche wrote:
> On 12/13/12 17:49, Christoph Hellwig wrote:
> >On Thu, Dec 13, 2012 at 05:47:14PM +0100, Bart Van Assche wrote:
> >> From my experience with block and SCSI drivers option (1) doesn't
> >>look attractive from a performance point of view. From what I have
> >>seen performance with QD=1 is several times lower than performance
> >>with QD > 1. But maybe I overlooked something ?
> >
> >What you might be missing is that at least on Linux no one who cares
> >about performance uses the Posix AIO inferface anyway, as the
> >implementation in glibc always has been horrible.  The Linux-native
> >aio interface or the various thread pool implementations don't imply
> >useless ordering and thus can be used to fill up large queues.
> 
> Some applications need write ordering without having a need for 
> enforcing durability as fsync() does [1]. What I'm wondering about is 
> whether an operating system kernel like the Linux kernel should penalize 
> application performance when using block drivers and storage hardware 
> that preserve the order of write requests because there exist other 
> drivers and storage devices that do not preserve the order of write 
> requests ?

Which devices don't re-order requests?  So far as I know every single
disk drive ever made that is capable of handling multiple requests will
also re-order requests as it sees fit.

I expect the flash devices re-order requests as well, simply because
to feed requests to the things at a sufficient rate, you have to pump
requests into them concurrently on multiple hardware queues -- a single
cpu jamming requests into them as fast as it can is still not fast enough
to keep them busy.  Consequently, they *can't* care about ordering, as the
relative order requests on different hardware queues are submitted into them
is not even really controlled, so the OS *can't* count on concurrent requests
not to be essentially "re-ordered", just because of the nature of the way
requests get into the device.

So I think the property that devices and drivers are free to reorder
concurrent requests is not going away.

-- steve

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SCSI mid layer and high IOPS capable devices

2012-12-14 Thread scameron
On Fri, Dec 14, 2012 at 05:15:37PM +0100, Bart Van Assche wrote:
> On 12/14/12 17:44, scame...@beardog.cce.hp.com wrote:
> >I expect the flash devices re-order requests as well, simply because
> >to feed requests to the things at a sufficient rate, you have to pump
> >requests into them concurrently on multiple hardware queues -- a single
> >cpu jamming requests into them as fast as it can is still not fast enough
> >to keep them busy.  Consequently, they *can't* care about ordering, as the
> >relative order requests on different hardware queues are submitted into 
> >them
> >is not even really controlled, so the OS *can't* count on concurrent 
> >requests
> >not to be essentially "re-ordered", just because of the nature of the way
> >requests get into the device.
> 
> Why should a flash device have to reorder write requests ? These devices 
> typically use a log-structured file system internally.

It's not so much that they are re-ordered as that there is no controlled
ordering to begin with because multiple cpus are submitting to multiple
hardware queues concurrently.  If you have 12 requests coming in on 12
cpus to 12 hardware queues to the device, it's going to be racy as to
which request is processed first by the device -- and this is fine, the
hardware queues are independent of one another and do not need to worry
about each other.  This is all to provide a means of getting enough commands
on the device to actually keep it busy.  A single cpu can't do it, the
device is too fast.  If you have ordering dependencies such that request
A must complete before request B completes, then don't submit A and B
concurrently, because if you do submit them concurrently, you cannot tell
whether A or B will arrive into the device first because they may go into
it via different hardware queues.

Note, in case it isn't obvious, the hardware queues I'm talking about here
are not the struct scsi_device, sdev->request_queue queues, they are
typically ring buffers in host memory from which the device DMAs 
commands/responses
to/from depending on if it's a submit queue or a completion queue and with
producer/consumer indexes one of which is in host memory and one of which
is a register on the device (which is which depends on the direction of the
queue, from device (pi = host memory, ci = device register), or to device
(pi = device register, ci = host memory))

-- steve
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SCSI mid layer and high IOPS capable devices

2012-12-14 Thread scameron
On Fri, Dec 14, 2012 at 08:28:56PM +0100, Bart Van Assche wrote:
> On 12/14/12 20:55, scame...@beardog.cce.hp.com wrote:
> >It's not so much that they are re-ordered as that there is no controlled
> >ordering to begin with because multiple cpus are submitting to multiple
> >hardware queues concurrently.  If you have 12 requests coming in on 12
> >cpus to 12 hardware queues to the device, it's going to be racy as to
> >which request is processed first by the device -- and this is fine, the
> >hardware queues are independent of one another and do not need to worry
> >about each other.  This is all to provide a means of getting enough 
> >commands
> >on the device to actually keep it busy.  A single cpu can't do it, the
> >device is too fast.  If you have ordering dependencies such that request
> >A must complete before request B completes, then don't submit A and B
> >concurrently, because if you do submit them concurrently, you cannot tell
> >whether A or B will arrive into the device first because they may go into
> >it via different hardware queues.
> 
> It depends on how these multiple queues are used. If each queue would 
> e.g. be associated with a disjoint LBA range of the storage device then 
> there wouldn't be a risk of request reordering due to using multiple 
> hardware queues.

They are not associated with disjoint LBA ranges. They are associated
with CPUs on the submission side, there's a submit queue per cpu, and
msix vectors on the completion side (also a completion queue per cpu).

The point of the queues is only to provide a wide enough highway to
allow enough requests to be shoved down to the device fast enough
and completed back to the host fast enough that the device can be kept
reasonably busy, instead of being starved for work to do.  There is
no distinction about what the requests may do based on what hardware
i/o queue they come in on (e.g. no lba range partitioning). All the
i/o queues are equivalent.

Pretty much all current storage devices, disk drives and the devices I'm
talking about in particular do depend on the low level driver and storage
devices being permitted to re-order requests.  So I don't think the discussion
about drivers and devices that *do not* reorder requests (which drivers and 
devices
would those be?) is very related to the topic of how to get the scsi mid layer
to provide a wide enough highway for requests destined for very low latency
devices.

-- steve

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[LSF/MM TOPIC] SCSI Express driver

2013-01-31 Thread scameron

If there is interest, I would like to discuss the SCSI Express driver
Chayan and I have been working on for the last few months.

-- steve

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 1/1] scsi: fix a few misleading comments in scsi.c

2007-10-26 Thread scameron

Fix a few misleading comments in scsi.c

Signed-off-by: Stephen M. Cameron <[EMAIL PROTECTED]>

---


diff -puN linux-2.6.24-rc1/drivers/scsi/scsi.c~fix_scsi_comments 
linux-2.6.24-rc1/drivers/scsi/scsi.c
--- dynamic_scsi/linux-2.6.24-rc1/drivers/scsi/scsi.c~fix_scsi_comments 
2007-10-25 08:52:01.0 -0400
+++ dynamic_scsi-root/linux-2.6.24-rc1/drivers/scsi/scsi.c  2007-10-25 
08:53:45.0 -0400
@@ -897,7 +897,7 @@ EXPORT_SYMBOL(__scsi_iterate_devices);
  * @starget:   target whose devices we want to iterate over.
  *
  * This traverses over each devices of @shost.  The devices have
- * a reference that must be released by scsi_host_put when breaking
+ * a reference that must be released by scsi_device_put when breaking
  * out of the loop.
  */
 void starget_for_each_device(struct scsi_target *starget, void * data,
@@ -949,7 +949,7 @@ EXPORT_SYMBOL(__scsi_device_lookup_by_ta
  *
  * Looks up the scsi_device with the specified @channel, @id, @lun for a
  * give host.  The returned scsi_device has an additional reference that
- * needs to be release with scsi_host_put once you're done with it.
+ * needs to be release with scsi_device_put once you're done with it.
  **/
 struct scsi_device *scsi_device_lookup_by_target(struct scsi_target *starget,
 uint lun)
@@ -1008,7 +1008,7 @@ EXPORT_SYMBOL(__scsi_device_lookup);
  *
  * Looks up the scsi_device with the specified @channel, @id, @lun for a
  * give host.  The returned scsi_device has an additional reference that
- * needs to be release with scsi_host_put once you're done with it.
+ * needs to be release with scsi_device_put once you're done with it.
  **/
 struct scsi_device *scsi_device_lookup(struct Scsi_Host *shost,
uint channel, uint id, uint lun)
_
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 1/1] cciss: notify scsi mid layer of device changes

2007-11-06 Thread scameron

Make the cciss driver notify the scsi mid layer of device
changes (e.g. hot pluggable tape drives mostly).  Also 
made it register with the scsi mid layer even if no 
devices are present so adding the first device is
not different than adding 2nd or nth device.

Signed-off-by: Stephen M. Cameron <[EMAIL PROTECTED]>
Acked-by: Mike Miller <[EMAIL PROTECTED]>

---


diff -puN linux-2.6.24-rc1/drivers/block/cciss_scsi.c~dynamic_scsi 
linux-2.6.24-rc1/drivers/block/cciss_scsi.c
--- dynamic_scsi/linux-2.6.24-rc1/drivers/block/cciss_scsi.c~dynamic_scsi   
2007-10-24 08:31:21.0 -0400
+++ dynamic_scsi-root/linux-2.6.24-rc1/drivers/block/cciss_scsi.c   
2007-11-06 08:32:19.0 -0500
@@ -359,9 +359,15 @@ find_bus_target_lun(int ctlr, int *bus, 
return (!found);
 }
 
+struct scsi2map {
+   char scsi3addr[8];
+int bus, target, lun;
+};
+
 static int 
-cciss_scsi_add_entry(int ctlr, int hostno, 
-   unsigned char *scsi3addr, int devtype)
+cciss_scsi_add_entry(int ctlr, int hostno,
+   unsigned char *scsi3addr, int devtype,
+   struct scsi2map *added, int *nadded)
 {
/* assumes hba[ctlr]->scsi_ctlr->lock is held */ 
int n = ccissscsi[ctlr].ndevices;
@@ -375,6 +381,13 @@ cciss_scsi_add_entry(int ctlr, int hostn
sd = &ccissscsi[ctlr].dev[n];
if (find_bus_target_lun(ctlr, &sd->bus, &sd->target, &sd->lun) != 0)
return -1;
+
+   memcpy(added[*nadded].scsi3addr, scsi3addr, 8);
+   added[*nadded].bus = sd->bus;
+   added[*nadded].target = sd->target;
+   added[*nadded].lun = sd->lun;
+   (*nadded)++;
+
memcpy(&sd->scsi3addr[0], scsi3addr, 8);
sd->devtype = devtype;
ccissscsi[ctlr].ndevices++;
@@ -390,7 +403,8 @@ cciss_scsi_add_entry(int ctlr, int hostn
 }
 
 static void
-cciss_scsi_remove_entry(int ctlr, int hostno, int entry)
+cciss_scsi_remove_entry(int ctlr, int hostno, int entry,
+   struct scsi2map *removed, int *nremoved)
 {
/* assumes hba[ctlr]->scsi_ctlr->lock is held */ 
int i;
@@ -398,6 +412,11 @@ cciss_scsi_remove_entry(int ctlr, int ho
 
if (entry < 0 || entry >= CCISS_MAX_SCSI_DEVS_PER_HBA) return;
sd = ccissscsi[ctlr].dev[entry];
+   memcpy(removed[*nremoved].scsi3addr, sd.scsi3addr, 8);
+   removed[*nremoved].bus= sd.bus;
+   removed[*nremoved].target = sd.target;
+   removed[*nremoved].lun= sd.lun;
+   (*nremoved)++;
for (i=entry;iscsi_ctlr)->scsi_host;
+
/* find any devices in ccissscsi[] that are not in 
   sd[] and remove them from ccissscsi[] */
 
i = 0;
+   nremoved = 0;
+   nadded = 0;
while(idevtype), hostno,
csd->bus, csd->target, csd->lun); */
-   cciss_scsi_remove_entry(ctlr, hostno, i);
-   /* note, i not incremented */
+   cciss_scsi_remove_entry(ctlr, hostno, i,
+   removed, &nremoved);
+   /* remove ^^^, hence i not incremented */
} 
else if (found == 1) { /* device is different kind */
changes++;
@@ -464,8 +520,15 @@ adjust_cciss_scsi_table(int ctlr, int ho
"(device type now %s).\n",
ctlr, hostno, csd->bus, csd->target, csd->lun,
scsi_device_type(csd->devtype));
+   cciss_scsi_remove_entry(ctlr, hostno, i,
+   removed, &nremoved);
+   /* remove ^^^, hence i not incremented */
+   if (cciss_scsi_add_entry(ctlr, hostno,
+   &sd[j].scsi3addr[0], sd[j].devtype,
+   added, &nadded) != 0)
+   /* we just removed one, so add can't fail. */
+   BUG();
csd->devtype = sd[j].devtype;
-   i++;/* so just move along. */
} else  /* device is same as it ever was, */
i++;/* so just move along. */
}
@@ -489,7 +552,8 @@ adjust_cciss_scsi_table(int ctlr, int ho
if (!found) {
changes++;
if (cciss_scsi_add_entry(ctlr, hostno, 
-   &sd[i].scsi3addr[0], sd[i].devtype) != 0)
+   &sd[i].scsi3addr[0], sd[i].devtype,
+   added, &nadded) != 0)
break;
} else if (found == 1) {
/* should never happen... */
@@ -501,10 +565,49 @@ adjust_cciss_scsi_table(int ctlr, int ho
}
CPQ_TAPE_UNLOCK(ctlr, flags);
 
-   if (!changes) 
-   printk("cciss%d: No device changes detected.\n", ctlr);
+ 

[patch 3/3] cciss: Add support for multi-lun devices (tape libraries)

2008-02-04 Thread scameron


Add support for multi-lun tape devices to the cciss driver.

Signed-off-by: Stephen M. Cameron <[EMAIL PROTECTED]>
---

 linux-2.6.24/drivers/block/cciss_scsi.c |   40 +---
 1 files changed, 37 insertions(+), 3 deletions(-)

diff -puN linux-2.6.24/drivers/block/cciss_scsi.c~cciss_multilun_tape 
linux-2.6.24/drivers/block/cciss_scsi.c
--- kernel.org2/linux-2.6.24/drivers/block/cciss_scsi.c~cciss_multilun_tape 
2008-02-04 07:52:48.0 -0600
+++ kernel.org2-root/linux-2.6.24/drivers/block/cciss_scsi.c2008-02-04 
07:52:48.0 -0600
@@ -371,16 +371,50 @@ cciss_scsi_add_entry(int ctlr, int hostn
/* assumes hba[ctlr]->scsi_ctlr->lock is held */ 
int n = ccissscsi[ctlr].ndevices;
struct cciss_scsi_dev_t *sd;
+   int i, bus, target, lun;
+   unsigned char addr1[8], addr2[8];
 
if (n >= CCISS_MAX_SCSI_DEVS_PER_HBA) {
printk("cciss%d: Too many devices, "
"some will be inaccessible.\n", ctlr);
return -1;
}
-   sd = &ccissscsi[ctlr].dev[n];
-   if (find_bus_target_lun(ctlr, &sd->bus, &sd->target, &sd->lun) != 0)
-   return -1;
 
+   bus = target = -1;
+   lun = 0;
+   /* Is this device a non-zero lun of a multi-lun device */
+   /* byte 4 of the 8-byte LUN addr will contain the logical unit no. */
+   if (scsi3addr[4] != 0) {
+   /* Search through our list and find the device which */
+   /* has the same 8 byte LUN address, excepting byte 4. */
+   /* Assign the same bus and target for this new LUN. */
+   /* Use the logical unit number from the firmware. */
+   memcpy(addr1, scsi3addr, 8);
+   addr1[4] = 0;
+   for (i = 0; i < n; i++) {
+   sd = &ccissscsi[ctlr].dev[i];
+   memcpy(addr2, sd->scsi3addr, 8);
+   addr2[4] = 0;
+   /* differ only in byte 4? */
+   if (memcmp(addr1, addr2, 8) == 0) {
+   bus = sd->bus;
+   target = sd->target;
+   lun = scsi3addr[4];
+   break;
+   }
+   }
+   }
+
+   sd = &ccissscsi[ctlr].dev[n];
+   if (lun == 0) {
+   if (find_bus_target_lun(ctlr,
+   &sd->bus, &sd->target, &sd->lun) != 0)
+   return -1;
+   } else {
+   sd->bus = bus;
+   sd->target = target;
+   sd->lun = lun;
+   }
added[*nadded].bus = sd->bus;
added[*nadded].target = sd->target;
added[*nadded].lun = sd->lun;
_
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch 2/3] cciss: auto notify scsi midlayer of device changes

2008-02-04 Thread scameron



Make the cciss driver automaticaly notify the SCSI mid layer when
it finds out about added or removed devices.

Signed-off-by: Stephen M. Cameron <[EMAIL PROTECTED]>
---

 linux-2.6.24/Documentation/cciss.txt|   21 +---
 linux-2.6.24/drivers/block/cciss_scsi.c |  157 
 2 files changed, 124 insertions(+), 54 deletions(-)

diff -puN linux-2.6.24/drivers/block/cciss_scsi.c~dynamic_scsi 
linux-2.6.24/drivers/block/cciss_scsi.c
--- kernel.org2/linux-2.6.24/drivers/block/cciss_scsi.c~dynamic_scsi
2008-02-04 07:52:44.0 -0600
+++ kernel.org2-root/linux-2.6.24/drivers/block/cciss_scsi.c2008-02-04 
07:52:44.0 -0600
@@ -358,10 +358,15 @@ find_bus_target_lun(int ctlr, int *bus, 
}
return (!found);
 }
+struct scsi2map {
+   char scsi3addr[8];
+   int bus, target, lun;
+};
 
 static int 
 cciss_scsi_add_entry(int ctlr, int hostno, 
-   unsigned char *scsi3addr, int devtype)
+   unsigned char *scsi3addr, int devtype,
+   struct scsi2map *added, int *nadded)
 {
/* assumes hba[ctlr]->scsi_ctlr->lock is held */ 
int n = ccissscsi[ctlr].ndevices;
@@ -375,6 +380,12 @@ cciss_scsi_add_entry(int ctlr, int hostn
sd = &ccissscsi[ctlr].dev[n];
if (find_bus_target_lun(ctlr, &sd->bus, &sd->target, &sd->lun) != 0)
return -1;
+
+   added[*nadded].bus = sd->bus;
+   added[*nadded].target = sd->target;
+   added[*nadded].lun = sd->lun;
+   (*nadded)++;
+
memcpy(&sd->scsi3addr[0], scsi3addr, 8);
sd->devtype = devtype;
ccissscsi[ctlr].ndevices++;
@@ -390,7 +401,8 @@ cciss_scsi_add_entry(int ctlr, int hostn
 }
 
 static void
-cciss_scsi_remove_entry(int ctlr, int hostno, int entry)
+cciss_scsi_remove_entry(int ctlr, int hostno, int entry,
+   struct scsi2map *removed, int *nremoved)
 {
/* assumes hba[ctlr]->scsi_ctlr->lock is held */ 
int i;
@@ -398,6 +410,10 @@ cciss_scsi_remove_entry(int ctlr, int ho
 
if (entry < 0 || entry >= CCISS_MAX_SCSI_DEVS_PER_HBA) return;
sd = ccissscsi[ctlr].dev[entry];
+   removed[*nremoved].bus= sd.bus;
+   removed[*nremoved].target = sd.target;
+   removed[*nremoved].lun= sd.lun;
+   (*nremoved)++;
for (i=entry;iscsi_ctlr)->scsi_host;
+
/* find any devices in ccissscsi[] that are not in 
   sd[] and remove them from ccissscsi[] */
 
i = 0;
+   nremoved = 0;
+   nadded = 0;
while(idevtype), hostno,
csd->bus, csd->target, csd->lun); */
-   cciss_scsi_remove_entry(ctlr, hostno, i);
-   /* note, i not incremented */
+   cciss_scsi_remove_entry(ctlr, hostno, i,
+   removed, &nremoved);
+   /* remove ^^^, hence i not incremented */
} 
else if (found == 1) { /* device is different kind */
changes++;
@@ -464,8 +521,15 @@ adjust_cciss_scsi_table(int ctlr, int ho
"(device type now %s).\n",
ctlr, hostno, csd->bus, csd->target, csd->lun,
scsi_device_type(csd->devtype));
+   cciss_scsi_remove_entry(ctlr, hostno, i,
+   removed, &nremoved);
+   /* remove ^^^, hence i not incremented */
+   if (cciss_scsi_add_entry(ctlr, hostno,
+   &sd[j].scsi3addr[0], sd[j].devtype,
+   added, &nadded) != 0)
+   /* we just removed one, so add can't fail. */
+   BUG();
csd->devtype = sd[j].devtype;
-   i++;/* so just move along. */
} else  /* device is same as it ever was, */
i++;/* so just move along. */
}
@@ -489,7 +553,9 @@ adjust_cciss_scsi_table(int ctlr, int ho
if (!found) {
changes++;
if (cciss_scsi_add_entry(ctlr, hostno, 
-   &sd[i].scsi3addr[0], sd[i].devtype) != 0)
+
+   &sd[i].scsi3addr[0], sd[i].devtype,
+   added, &nadded) != 0)
break;
} else if (found == 1) {
/* should never happen... */
@@ -501,9 +567,50 @@ adjust_cciss_scsi_table(int ctlr, int ho
}
CPQ_TAPE_UNLOCK(ctlr, flags);
 
-   if (!changes) 
-   printk("cciss%d: No device changes detected.\n", ctlr);
+   /* Don't notify scsi mid layer of any changes the first time through */
+   /* (or if there are no changes) scsi_scan_host will do it later the */
+   /* first time through.

[patch 1/3] cciss: Don't call pci_free_consistent with irqs disabled

2008-02-04 Thread scameron


Don't call pci_free_consistent with irqs disabled
(Was triggering a warning in arch/x86/kernel/pci-dma_32.c
in dma_free_coherent)

Signed-off-by: Stephen M. Cameron <[EMAIL PROTECTED]>
---

 linux-2.6.24/drivers/block/cciss_scsi.c |2 +-
 1 files changed, 1 insertion(+), 1 deletion(-)

diff -puN linux-2.6.24/drivers/block/cciss_scsi.c~fix_pci_free_irq_bug 
linux-2.6.24/drivers/block/cciss_scsi.c
--- kernel.org2/linux-2.6.24/drivers/block/cciss_scsi.c~fix_pci_free_irq_bug
2008-02-04 07:52:39.0 -0600
+++ kernel.org2-root/linux-2.6.24/drivers/block/cciss_scsi.c2008-02-04 
07:52:39.0 -0600
@@ -1349,9 +1349,9 @@ cciss_unregister_scsi(int ctlr)
/* set scsi_host to NULL so our detect routine will 
   find us on register */
sa->scsi_host = NULL;
+   spin_unlock_irqrestore(CCISS_LOCK(ctlr), flags);
scsi_cmd_stack_free(ctlr);
kfree(sa);
-   spin_unlock_irqrestore(CCISS_LOCK(ctlr), flags);
 }
 
 static int 
_
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [SCSI] hpsa: dial down lockup detection during firmware flash

2012-10-10 Thread scameron

Yes, I think so.  Thanks.

-- steve

On Wed, Oct 10, 2012 at 04:47:38AM +0100, Ben Hutchings wrote:
> Should this fix for hpsa be included in stable updates?  It looks like
> it would be needed in 3.2.y and 3.4.y, as lockup detection was
> introduced in 3.2 and the fix went into 3.5.
> 
> commit e85c59746957fd6e3595d02cf614370056b5816e
> Author: Stephen M. Cameron 
> Date:   Tue May 1 11:43:42 2012 -0500
> 
> [SCSI] hpsa: dial down lockup detection during firmware flash
> 
> Ben.
> 
> -- 
> Ben Hutchings
> Who are all these weirdos? - David Bowie, about L-Space IRC channel #afp


--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html