RE: [RFC PATCH V4 2/2] scsi: core: don't limit per-LUN queue depth for SSD

2019-10-23 Thread Kashyap Desai
V4 2/2] scsi: core: don't limit per-LUN queue depth
> for SSD
>
> On Fri, Oct 18, 2019 at 12:00:07AM +0530, Kashyap Desai wrote:
> > > On 10/9/19 2:32 AM, Ming Lei wrote:
> > > > @@ -354,7 +354,8 @@ void scsi_device_unbusy(struct scsi_device
> > > > *sdev,
> > > struct scsi_cmnd *cmd)
> > > > if (starget->can_queue > 0)
> > > > atomic_dec(&starget->target_busy);
> > > >
> > > > -   atomic_dec(&sdev->device_busy);
> > > > +   if (!blk_queue_nonrot(sdev->request_queue))
> > > > +   atomic_dec(&sdev->device_busy);
> > > >   }
> > > >
> > >
> > > Hi Ming,
> > >
> > > Does this patch impact the meaning of the queue_depth sysfs
> > > attribute (see also sdev_store_queue_depth()) and also the queue
> > > depth ramp up/down mechanism (see also
> scsi_handle_queue_ramp_up())?
> > > Have you considered to enable/disable busy tracking per LUN
> > > depending on whether or not sdev-
> > > >queue_depth < shost->can_queue?
> > >
> > > The megaraid and mpt3sas drivers read sdev->device_busy directly. Is
> > > the current version of this patch compatible with these drivers?
> >
> > We need to know per scsi device outstanding in mpt3sas and
> > megaraid_sas driver.
>
> Is the READ done in fast path or slow path? If it is on slow path, it
should be
> easy to do via blk_mq_in_flight_rw().

READ is done in fast path.

>
> > Can we get supporting API from block layer (through SML)  ? something
> > similar to "atomic_read(&hctx->nr_active)" which can be derived from
> > sdev->request_queue->hctx ?
> > At least for those driver which is nr_hw_queue = 1, it will be useful
> > and we can avoid sdev->device_busy dependency.
>
> If you mean to add new atomic counter, we just move the .device_busy
into
> blk-mq, that can become new bottleneck.

How about below ? We define and use below API instead of
"atomic_read(&scp->device->device_busy) >" and it is giving expected
value. I have not captured performance impact on max IOPs profile.

Inline unsigned long sdev_nr_inflight_request(struct request_queue *q)
{
struct blk_mq_hw_ctx *hctx;
unsigned long nr_requests = 0;
int i;

queue_for_each_hw_ctx(q, hctx, i)
nr_requests += atomic_read(&hctx->nr_active);

return nr_requests;
}

Kashyap

>
>
> thanks,
> Ming


RE: [RFC PATCH V4 2/2] scsi: core: don't limit per-LUN queue depth for SSD

2019-10-17 Thread Kashyap Desai
> On 10/9/19 2:32 AM, Ming Lei wrote:
> > @@ -354,7 +354,8 @@ void scsi_device_unbusy(struct scsi_device *sdev,
> struct scsi_cmnd *cmd)
> > if (starget->can_queue > 0)
> > atomic_dec(&starget->target_busy);
> >
> > -   atomic_dec(&sdev->device_busy);
> > +   if (!blk_queue_nonrot(sdev->request_queue))
> > +   atomic_dec(&sdev->device_busy);
> >   }
> >
>
> Hi Ming,
>
> Does this patch impact the meaning of the queue_depth sysfs attribute (see
> also sdev_store_queue_depth()) and also the queue depth ramp up/down
> mechanism (see also scsi_handle_queue_ramp_up())? Have you considered to
> enable/disable busy tracking per LUN depending on whether or not sdev-
> >queue_depth < shost->can_queue?
>
> The megaraid and mpt3sas drivers read sdev->device_busy directly. Is the
> current version of this patch compatible with these drivers?

We need to know per scsi device outstanding in mpt3sas and megaraid_sas
driver.
Can we get supporting API from block layer (through SML)  ? something
similar to "atomic_read(&hctx->nr_active)" which can be derived from
sdev->request_queue->hctx ?
At least for those driver which is nr_hw_queue = 1, it will be useful and we
can avoid sdev->device_busy dependency.


Kashyap
>
> Thanks,
>
> Bart.


Re: [V3 00/10] mpt3sas: Aero/Sea HBA feature addition

2019-06-18 Thread Kashyap Desai
On Fri, Jun 7, 2019 at 10:19 PM Martin K. Petersen
 wrote:
>
>
> Kashyap,
>
> > AMD EPYC is not efficient w.r.t QPI transaction.
> [...]
> > Same test on Intel architecture provides better result
>
> Heuristics are always hard.
>
> However, you are making assumptions based on observed performance of
> current Intel offerings vs. current AMD offerings. This results in what
> is inevitably going to be a short-lived heuristic in the kernel. Things
> could easily be reversed in next generation platforms from these
> vendors.
>
> So while I appreciate that the logic works given the machines you are
> currently testing, I think CPU manufacturer is a horrible heuristic. You
> are stating "This will be the right choice for all future processors
> manufactured by Intel". That's a bit of a leap of faith.
>
> Instead of predicting the future I prefer to make decisions based on
> things we know. Measured negative impact on current EPYC family, for
> instance. That's a fairly well-defined and narrow scope.
>
> That said, I am still not a big fan of platform-specific tweaks in
> drivers. While I prefer the kernel to do the right thing out of the box,
> I think the module parameter is probably the better choice in this case.

Martin,
If we decide to remove cpu arch check later, things will be
unnecessary complex to explain default driver behavior as we may have
two driver behaviors.
We are going to remove cpu architecture detection logic. It is good to
have module parameter based dependency from day one.
We will be sending relevant patch soon.

Kashyap


>
> --
> Martin K. Petersen  Oracle Linux Engineering


RE: [V3 00/10] mpt3sas: Aero/Sea HBA feature addition

2019-06-07 Thread Kashyap Desai
>
> Suganath,
>
> I applied this series to 5.3/scsi-queue.
>
> However, I remain unconvinced of the merits of the config page putback.
Why
> even bother if a controller reset causes the defaults to be loaded from
> NVRAM?
>
> Also, triggering on X86 for selecting performance mode seems
questionable. I
> would like to see a follow-on patch that comes up with a better
heuristic.

Martin -

AMD EPYC is not efficient w.r.t QPI transaction.  I tested performance on
AMD EPYC 7601 Chipset. It has totally 128 logical CPU.
Aero/Sea controller support at max 128 MSIx vector. In good case scenario,
we will have 1:1 CPU to MSIX mapping.  I can get 2.4 M IOPS in this case.

Just to simulate performance issue, I reduce controller msix vector to 64.
It means cpu to msix mapping is 2:1. Indirectly, I am trying to generate
completion which requires completion on remote cpu (via
call_function_single_interrupt).
In this case, I can get 1.7M IOPS.

Same test on Intel architecture provides better result (Negligible
performance impact).  This patch set maps high iops queues (queues with
interrupt coalescing turned on) to local numa node.
High iops queue count is limited and it depends upon QPI for io
completion.  We have enable this feature only for intel arch where we have
seen improvement.  Not having this feature is not bad, but if we enable
this feature we may get negative impact if QPI overhead (like AMD) is
high.

Kashyap

>
> --
> Martin K. PetersenOracle Linux Engineering


RE: [PATCH] mpt3sas: Fix kernel panic occurs during expander reset

2019-03-21 Thread Kashyap Desai
> > > > >>> Hannes & Christoph: Please comment on Sreekanth's proposed
> approach.
> > > > >>
> > > > >> Iterating over all tags from the driver is always wrong.  We've
> > > > >> been though this a few times.
> > > > >
> > > > > Current issue is very easy to be reproduced and it is widely
> > > > > impacted.
> > > > > We proposed this approach i.e. invoking scsi_host_find_tag() for
> > > > > only those tags which are outstanding at the driver level; as
> > > > > this  has very minimal code changes without impacting any design
> > > > > and also it will work in both non-mq + mq mode.
> > > > > We can rework on those code sections where driver is iterating
> > > > > over all tags. I understood from your reply that - "Low level
> > > > > driver should not have any requirement to loop outstanding IOs".
> > > > > Not sure if such things can be done without SML support. AFAIK,
> > > > > similar issue is very generic and many low level scsi driver has
> > > > > similar
> requirement.
> > > > >
> > > > > Can we go with current solution assuming any new interface as
> > > > > you requested can be done as separate activity?


Hi Martin, Christopher,

Can you please consider latest fix  since it is a multiple field issue and
it is critical. We can work on further improvement as Christopher mentioned
in his last comment.
Driver finding total firmware outstanding command is very common in lots of
SCSI driver, so we may have to figure out what best mid layer can provide
which can avoid driver work.

Kashyap

> > > > >
> > > > > Thanks,
> > > > > Sreekanth
> > > > >
> > > >
> > > > In context of this issue (in my case kernel panics on shutdown
> > > > that I mentioned in another mail some time ago) - which patch
> > > > should I be using (even if temporarily) ? Currently I'm on
> > > > https://patchwork.kernel.org/patch/10829927/ .
> > >
> > > Please use below patch,
> > > https://patchwork.kernel.org/patch/1083/
> > >
> > > Chris, Hannes,
> > > Just a gentle ping..
> > > This patch will just fix this kernel panic which are observed during
> > > expander resets, system shutdown or driver unload operation.
> > > It has a very minimal code change without impacting any design.
> > > Many customers are observing this issue. Please consider this patch.
> > >
> > > As I mentioned in the above mail that we can rework on those code
> > > sections where driver is iterating over all tags.
> >
> > Just a gentle ping..
>
> Hi All, Any update here.
>
> >
> > Regards,
> > Sreekanth
> >
> > >
> > > Thanks,
> > > Sreekanth
> > >
> > > >


RE: Proof of concept NDOB support

2019-03-04 Thread Kashyap Desai
>
>
> I rebased my old NDOB patch on top of 5.1/scsi-queue. It requires the
device
> to implement REPORT SUPPORTED OPERATION CODES to determine whether
> NDOB is supported. If it is, no zeroed payload will be attached to the
I/O. Only
> superficially tested with scsi_debug.

Hi Martin, I will test this patch set with some SSDs which supports NDOB.
BTW, I have SSDs (HGST S300) which does not expose NDOB in RSOC, but if I
send WS with NDOB through sgutils, it does not fail the command.  Trying
to figure out if SSD supports NDBO, but does not expose in RSOC.

Kashyap

>
> --
> Martin K. PetersenOracle Linux Engineering


RE: [scsi] write same with NDOB (no data-out buffer) support

2019-02-27 Thread Kashyap Desai
Adding Bob Sheffield from Broadcom.

>
> Hi Kashyap,
>
> > I was going through below discussion as well as going through linux
> > scsi code to know if linux scsi stack support NDOB.
>
> Last time NDOB came up there were absolutely no benefits to it from the
> kernel perspective. These days we can save the buffer memory allocation
so
> there may be a small win. I do have a patch we can revive.


We can test if you have any patch.

>
> However, I am not aware of any devices that actually support NDOB. Plus
it's
> hard to detect since we need to resort to RSOC masks. And blindly
sending
> RSOC is risky business. That's why my patch never went anywhere. It was
a lot
> of heuristics churn to set a single bit flag.
>
> Since the benefits are modest (basically saves a memory compare on the
> device), what is the reason you are looking at this?

SCSI SBC-4 requires that any drive that supports WRITE SAME (unmap=1) also
support ndob=1. So a drive supports it if it reports LBWS=1 in the Block
Provisioning VPD page. If not, the drive violates the SBC-4 standard, so
it's a drive problem.
ssuing WRITE SAME (unmap=1, ndob=0) only achieves block unmapping if the
pattern in the data-out buffer matches the "provisioning initialization
pattern" implemented by the drive.
AFAIK,  current scsi stack isn't somehow determining what that pattern is
for each drive, so eventually current method of using WRITE SAME (unmap=1,
ndob=0)  likely is ineffective in unmapping blocks on media.

On the other hand, if the drive supports WRITE SAME (unmap=1, ndob=1) - as
required by SBC-4 - then Linux can use it to reliably cause LBAs to be
unmapped on media.

Perhaps this is considered a minor issue for direct attached drives, but
in the RAID world, it's a big enough issue that relying on it is the only
way we can reliably maintain coherency across stripes in redundant data
mappings.

>
> > One more question. What happens if WS w/ UNMAP command is passed to
> > the device without zeroed data out buffer in current scsi stack ? Will
> > it permanently disable WS on that device ?
>
> Depends how the device responds.
>
> --
> Martin K. PetersenOracle Linux Engineering


[scsi] write same with NDOB (no data-out buffer) support

2019-02-26 Thread Kashyap Desai
Hi Martin/Chris,

I was going through below discussion as well as going through linux
scsi code to know if linux scsi stack support NDOB. It looks like
current scsi stack does not have support of NDOB.  From below thread,
it looks like NDOB is not good way to support considering lack of
device level support is not stable. Below is bit old discussion, so
want to understand if current situation is good to introduce NDOB
support (considering underlying device works as expected with NDOB) ?

https://lore.kernel.org/patchwork/patch/673046/

One more question. What happens if WS w/ UNMAP command is passed to
the device without zeroed data out buffer in current scsi stack ? Will
it permanently disable WS on that device ?

Thanks, Kashyap


Re: [PATCH v1] mpt3sas: Use driver scsi lookup to track outstanding IOs

2019-02-26 Thread Kashyap Desai
On Tue, Feb 26, 2019 at 8:23 PM Hannes Reinecke  wrote:
>
> On 2/26/19 3:33 PM, Christoph Hellwig wrote:
> > On Tue, Feb 26, 2019 at 02:49:30PM +0100, Hannes Reinecke wrote:
> >> Attached is a patch to demonstrate my approach.
> >> I am aware that it'll only be useful for latest upstream where the legacy
> >> I/O path has been dropped completely, so we wouldn't need to worry about 
> >> it.
> >> For older releases indeed you would need to with something like your
> >> proposed patch, but for upstream I really would like to switch to
> >> blk_mq_tagset_busy() iter.
> >
> > While this is better than the driver private tracking we really should
> > not have to iterate all outstanding command, because if we have any
> > there is a bug we need to fix in the higher layers instead of working
> > around it in the drivers.

Hi Chris, Looking at other driver's code, I think similar issue is
impacted to many scsi hbas (like fnic, qla4xxx, snic etc..). They also
need similar logic to traverse outstanding scsi commands.
One of the example is - At the time of HBA reset driver would like to
release scsi command back to SML to retry and for that driver requires
to loop all possible smid to figure out outstanding scsi command
@Firmware/SML level.

> >
> Ah-ha.
>
> But what else should we be doing here?
> The driver needs to abort all outstanding commands; I somewhat fail to
> see how we could be doing it otherwise ...

Hannes, Primary drawback of using "blk_mq_tagset_busy_iter" is API is
only for blk-mq and it is not available for all the kernel with blk-mq
support. We have seen multiple failures from customer and those
kernels does not support blk_mq_tagset_busy_iter. In fact, blk-mq and
non-mq stack is alive in many Linux distribution and customer is using
those. If we just scope to fix current kernel 5.x (which does not have
non-mq support), we can opt blk_mq_tagset_busy_iter(). Earlier I
requested upstream to accept driver changes without
blk_mq_tagset_busy_iter() because it is hitting many customers so we
are looking for generic fix which can server both blk-mq as well as
non-mq.

We will certainly enhance and optimize working this area (driver
trying to find outstanding scsi command) with scsi mailing list as
incremental approach.

Kashyap

>
> Cheers,
>
> Hannes
>


RE: [PATCH] megaraid_sas: enable blk-mq for fusion

2019-01-11 Thread Kashyap Desai
> Fusion adapters can steer completions to individual queues, so we can
enable
> blk-mq for those adapters.
> And in doing so we can rely on the interrupt affinity from the block
layer and
> drop the hand-crafted 'reply_map' construct.

Hannes, I understand the intend of this patch as we discussed couple of
time on same topic in past. We really don't need such interface in MR/IT
HBA because we have single submission queue (h/w registers.).
Mimicing multiple h/w submission queue really does not help any
performance. Relevant discussion at below -

https://marc.info/?l=linux-scsi&m=153078454611933&w=2

We can hit max h/w and storage performance limit using nr_hw_queues = 1.
We could not find any reason to really setup more than one hw queue.

Kashyap


RE: [PATCH V2] blk-mq: Set request mapping to NULL in blk_mq_put_driver_tag

2018-12-18 Thread Kashyap Desai
>
> I actually took a look at scsi_host_find_tag() - what I think needs fixing
> here is
> that it should not return a tag that isn't allocated.
> You're just looking up random stuff, that is a recipe for disaster.
> But even with that, there's no guarantee that the tag isn't going away.

Got your point. Let us fix in  driver.

>
> The mpt3sas use case is crap. It's iterating every tag, just in case it
> needs to do
> something to it.

Many drivers in scsi layer is having similar trouble.  May be they are less
exposed. That was a main reason, I thought to provide common fix in block
layer.

>
> My suggestion would be to scrap that bad implementation and have
> something available for iterating busy tags instead. That'd be more
> appropriate and a lot more efficient that a random loop from 0..depth.
> If you are flushing running commands, looking up tags that aren't even
> active
> is silly and counterproductive.

We will address this issue through  driver changes in two steps.
1. I can use  driver's internal memory and will not rely on request/scsi
command. Tag 0...depth loop  is not in main IO path, so what we need is
contention free access of the list. Having driver's internal memory and
array will provide that control.
2. As you suggested best way is to use busy tag iteration.  (only for blk-mq
stack)


Thanks for your feedback.

Kashyap

>
> --
> Jens Axboe


RE: [PATCH V2] blk-mq: Set request mapping to NULL in blk_mq_put_driver_tag

2018-12-18 Thread Kashyap Desai
> >
> > At the time of device removal,  it requires reverse traversing.  Find
> > out if each requests associated with sdev is part of hctx->tags->rqs()
> > and clear that entry.
> > Not sure about atomic traverse if more than one device removal is
> > happening in parallel.  May be more error prone. ?
> >
> > Just wondering both the way we will be removing invalid request from
> > array.
> > Are you suspecting any performance issue if we do it per IO ?
>
> It's an extra store, and it's a store to an area that's then now shared
> between
> issue and completion. Those are never a good idea. Besides, it's the kind
> of
> issue you solve in the SLOW path, not in the fast path. Since that's
> doable, it
> would be silly to do it for every IO.
>
> This might not matter on mpt3sas, but on more efficient hw it definitely
> will.

Understood your primary concern is to avoid per IO and do it if no better
way.

> I'm still trying to convince myself that this issue even exists. I can see
> having
> stale entries, but those should never be busy. Why are you finding them
> with
> the tag iteration? It must be because the tag is reused, and you are
> finding it
> before it's re-assigned?


Stale entries will be forever if we remove scsi devices. It is not timing
issue. If memory associated with request (freed due to device removal)
reused, kernel panic occurs.
We have 24 Drives behind Expander and follow expander reset which will
remove all 24 drives and add it back. Add and removal of all the drives
happens quickly.
As part of Expander reset,  driver process broadcast primitive
event and that requires finding all outstanding scsi command.

In some cases, we need firmware restart and that path also requires tag
iteration.


>
> --
> Jens Axboe


RE: [PATCH V2] blk-mq: Set request mapping to NULL in blk_mq_put_driver_tag

2018-12-18 Thread Kashyap Desai
> On 12/18/18 10:48 AM, Kashyap Desai wrote:
> >>
> >> On 12/18/18 10:08 AM, Kashyap Desai wrote:
> >>>>>
> >>>>> Other block drivers (e.g. ib_srp, skd) do not need this to work
> >>>>> reliably.
> >>>>> It has been explained to you that the bug that you reported can be
> >>>>> fixed by modifying the mpt3sas driver. So why to fix this by
> >>>>> modifying the block layer? Additionally, what prevents that a race
> >>>>> condition occurs between the block layer clearing
> >>>>> hctx->tags->rqs[rq->tag] and
> >>>>> scsi_host_find_tag() reading that same array element? I'm afraid
> >>>>> that this is an attempt to paper over a real problem instead of
> >>>>> fixing the root
> >>>> cause.
> >>>>
> >>>> I have to agree with Bart here, I just don't see how the mpt3sas
> >>>> use case is special. The change will paper over the issue in any
> >>>> case.
> >>>
> >>> Hi Jens, Bart
> >>>
> >>> One of the key requirement for iterating whole tagset  using
> >>> scsi_host_find_tag is to block scsi host. Once we are done that, we
> >>> should be good. No race condition is possible if that part is taken
> >>> care.
> >>> Without this patch, if driver still receive scsi command from the
> >>> hctx->tags->rqs which is really not outstanding.  I am finding this
> >>> hctx->tags->is
> >>> common issue for many scsi low level drivers.
> >>>
> >>> Just for example  - fnic_is_abts_pending() function has below
> >>> code -
> >>>
> >>> for (tag = 0; tag < fnic->fnic_max_tag_id; tag++) {
> >>> sc = scsi_host_find_tag(fnic->lport->host, tag);
> >>> /*
> >>>  * ignore this lun reset cmd or cmds that do not
> >>> belong to
> >>>  * this lun
> >>>  */
> >>> if (!sc || (lr_sc && (sc->device != lun_dev || sc ==
> >>> lr_sc)))
> >>> continue;
> >>>
> >>> Above code also has similar exposure of kernel panic like 
> >>> driver while accessing sc->device.
> >>>
> >>> Panic is more obvious if we have add/removal of scsi device before
> >>> looping through scsi_host_find_tag().
> >>>
> >>> Avoiding block layer changes is also attempted in  but our
> >>> problem is to convert that code common for non-mq and mq.
> >>> Temporary to unblock this issue, We have fixed  using
> >>> driver internals scsiio_tracker() instead of piggy back in
> >>> scsi_command.
> >>
> >> For mq, the requests never go out of scope, they are always valid. So
> >> the key question here is WHY they have been freed. If the queue gets
> >> killed, then one potential solution would be to clear pointers in the
> >> tag map belonging to that queue. That also takes it out of the hot
> >> path.
> >
> > At driver load whenever driver does scsi_add_host_with_dma(), it
> > follows below code path in block layer.
> >
> > scsi_mq_setup_tags
> >   ->blk_mq_alloc_tag_set
> >   -> blk_mq_alloc_rq_maps
> >  -> __blk_mq_alloc_rq_maps
> >
> > SML create two set of request pool. One is per HBA and other is per
> > SDEV. I was confused why SML creates request pool per HBA.
> >
> > Example - IF HBA queue depth is 1K and there are 8 device behind that
> > HBA, total request pool is created is 1K + 8 * scsi_device queue
> > depth. 1K will be always static, but other request pool is managed
> > whenever scsi device is added/removed.
> >
> > I never observe requests allocated per HBA is used in IO path. It is
> > always request allocated per scsi device is what active.
> > Also, what I observed is whenever scsi_device is deleted, associated
> > request is also deleted. What is missing is - "Deleted request still
> > available in
> > hctx->tags->rqs[rq->tag]."
>
> So that sounds like the issue. If the device is deleted and its requests
> go away,
> those pointers should be cleared. That's what your patch should do, not do
> it
> for each IO.

At the time of device removal,  it requires reverse traversing.  Find out if
each requests associated with sdev is part of hctx->tags->rqs() and clear
that entry.
Not sure about atomic traverse if more than one device removal is happening
in parallel.  May be more error prone. ?

Just wondering both the way we will be removing invalid request from array.
Are you suspecting any performance issue if we do it per IO ?

Kashyap

>
>
> --
> Jens Axboe


RE: [PATCH V2] blk-mq: Set request mapping to NULL in blk_mq_put_driver_tag

2018-12-18 Thread Kashyap Desai
>
> On 12/18/18 10:08 AM, Kashyap Desai wrote:
> >>>
> >>> Other block drivers (e.g. ib_srp, skd) do not need this to work
> >>> reliably.
> >>> It has been explained to you that the bug that you reported can be
> >>> fixed by modifying the mpt3sas driver. So why to fix this by
> >>> modifying the block layer? Additionally, what prevents that a race
> >>> condition occurs between the block layer clearing
> >>> hctx->tags->rqs[rq->tag] and
> >>> scsi_host_find_tag() reading that same array element? I'm afraid
> >>> that this is an attempt to paper over a real problem instead of
> >>> fixing the root
> >> cause.
> >>
> >> I have to agree with Bart here, I just don't see how the mpt3sas use
> >> case is special. The change will paper over the issue in any case.
> >
> > Hi Jens, Bart
> >
> > One of the key requirement for iterating whole tagset  using
> > scsi_host_find_tag is to block scsi host. Once we are done that, we
> > should be good. No race condition is possible if that part is taken
> > care.
> > Without this patch, if driver still receive scsi command from the
> > hctx->tags->rqs which is really not outstanding.  I am finding this is
> > common issue for many scsi low level drivers.
> >
> > Just for example  - fnic_is_abts_pending() function has below
> > code -
> >
> > for (tag = 0; tag < fnic->fnic_max_tag_id; tag++) {
> > sc = scsi_host_find_tag(fnic->lport->host, tag);
> > /*
> >  * ignore this lun reset cmd or cmds that do not belong
> > to
> >  * this lun
> >  */
> > if (!sc || (lr_sc && (sc->device != lun_dev || sc ==
> > lr_sc)))
> > continue;
> >
> > Above code also has similar exposure of kernel panic like 
> > driver while accessing sc->device.
> >
> > Panic is more obvious if we have add/removal of scsi device before
> > looping through scsi_host_find_tag().
> >
> > Avoiding block layer changes is also attempted in  but our
> > problem is to convert that code common for non-mq and mq.
> > Temporary to unblock this issue, We have fixed  using driver
> > internals scsiio_tracker() instead of piggy back in scsi_command.
>
> For mq, the requests never go out of scope, they are always valid. So the
> key
> question here is WHY they have been freed. If the queue gets killed, then
> one
> potential solution would be to clear pointers in the tag map belonging to
> that
> queue. That also takes it out of the hot path.

At driver load whenever driver does scsi_add_host_with_dma(), it follows
below code path in block layer.

scsi_mq_setup_tags
  ->blk_mq_alloc_tag_set
  -> blk_mq_alloc_rq_maps
 -> __blk_mq_alloc_rq_maps

SML create two set of request pool. One is per HBA and other is per SDEV. I
was confused why SML creates request pool per HBA.

Example - IF HBA queue depth is 1K and there are 8 device behind that HBA,
total request pool is created is 1K + 8 * scsi_device queue depth. 1K will
be always static, but other request pool is managed whenever scsi device is
added/removed.

I never observe requests allocated per HBA is used in IO path. It is always
request allocated per scsi device is what active.
Also, what I observed is whenever scsi_device is deleted, associated request
is also deleted. What is missing is - "Deleted request still available in
hctx->tags->rqs[rq->tag]."

IF there is an assurance that all the request will be valid as long as hctx
is available, this patch is not correct. I posted patch based on assumption
that request per hctx can be removed whenever scsi device is removed.

Kashyap

>
> --
> Jens Axboe


RE: [PATCH V2] blk-mq: Set request mapping to NULL in blk_mq_put_driver_tag

2018-12-18 Thread Kashyap Desai
> >
> > Other block drivers (e.g. ib_srp, skd) do not need this to work
> > reliably.
> > It has been explained to you that the bug that you reported can be
> > fixed by modifying the mpt3sas driver. So why to fix this by modifying
> > the block layer? Additionally, what prevents that a race condition
> > occurs between the block layer clearing hctx->tags->rqs[rq->tag] and
> > scsi_host_find_tag() reading that same array element? I'm afraid that
> > this is an attempt to paper over a real problem instead of fixing the
> > root
> cause.
>
> I have to agree with Bart here, I just don't see how the mpt3sas use case
> is
> special. The change will paper over the issue in any case.

Hi Jens, Bart

One of the key requirement for iterating whole tagset  using
scsi_host_find_tag is to block scsi host. Once we are done that, we should
be good. No race condition is possible if that part is taken care.
Without this patch, if driver still receive scsi command from the
hctx->tags->rqs which is really not outstanding.  I am finding this is
common issue for many scsi low level drivers.

Just for example  - fnic_is_abts_pending() function has below code -

for (tag = 0; tag < fnic->fnic_max_tag_id; tag++) {
sc = scsi_host_find_tag(fnic->lport->host, tag);
/*
 * ignore this lun reset cmd or cmds that do not belong to
 * this lun
 */
if (!sc || (lr_sc && (sc->device != lun_dev || sc ==
lr_sc)))
continue;

Above code also has similar exposure of kernel panic like  driver
while accessing sc->device.

Panic is more obvious if we have add/removal of scsi device before looping
through scsi_host_find_tag().

Avoiding block layer changes is also attempted in  but our problem
is to convert that code common for non-mq and mq.
Temporary to unblock this issue, We have fixed  using driver
internals scsiio_tracker() instead of piggy back in scsi_command.

Kashyap

>
> --
> Jens Axboe


[PATCH V2] blk-mq: Set request mapping to NULL in blk_mq_put_driver_tag

2018-12-17 Thread Kashyap Desai
V1 -> V2
Added fix in __blk_mq_finish_request around blk_mq_put_tag() for
non-internal tags

Problem statement :
Whenever try to get outstanding request via scsi_host_find_tag,
block layer will return stale entries instead of actual outstanding
request. Kernel panic if stale entry is inaccessible or memory is reused.
Fix :
Undo request mapping in blk_mq_put_driver_tag  nce request is return.

More detail :
Whenever each SDEV entry is created, block layer allocate separate tags
and static requestis.Those requests are not valid after SDEV is deleted
from the system. On the fly, block layer maps static rqs to rqs as below
from blk_mq_get_driver_tag()

data.hctx->tags->rqs[rq->tag] = rq;

Above mapping is active in-used requests and it is the same mapping which
is referred in function scsi_host_find_tag().
After running some IOs, “data.hctx->tags->rqs[rq->tag]” will have some
entries which will never be reset in block layer.

There would be a kernel panic, If request pointing to
“data.hctx->tags->rqs[rq->tag]” is part of “sdev” which is removed
and as part of that all the memory allocation of request associated with
that sdev might be reused or inaccessible to the driver.
Kernel panic snippet -

BUG: unable to handle kernel paging request at ff800010
IP: [] mpt3sas_scsih_scsi_lookup_get+0x6c/0xc0 [mpt3sas]
PGD aa4414067 PUD 0
Oops:  [#1] SMP
Call Trace:
 [] mpt3sas_get_st_from_smid+0x1f/0x60 [mpt3sas]
 [] scsih_shutdown+0x55/0x100 [mpt3sas]

Cc: 
Signed-off-by: Kashyap Desai 
Signed-off-by: Sreekanth Reddy 

---
 block/blk-mq.c | 4 +++-
 block/blk-mq.h | 1 +
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6a75662..88d1e92 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -477,8 +477,10 @@ static void __blk_mq_free_request(struct request *rq)
 const int sched_tag = rq->internal_tag;

 blk_pm_mark_last_busy(rq);
-if (rq->tag != -1)
+if (rq->tag != -1) {
+hctx->tags->rqs[rq->tag] = NULL;
 blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
+}
 if (sched_tag != -1)
 blk_mq_put_tag(hctx, hctx->sched_tags, ctx, sched_tag);
 blk_mq_sched_restart(hctx);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 9497b47..57432be 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -175,6 +175,7 @@ static inline bool
blk_mq_get_dispatch_budget(struct blk_mq_hw_ctx *hctx)
 static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
struct request *rq)
 {
+hctx->tags->rqs[rq->tag] = NULL;
 blk_mq_put_tag(hctx, hctx->tags, rq->mq_ctx, rq->tag);
 rq->tag = -1;

-- 
1.8.3.1


RE: +AFs-PATCH+AF0- blk-mq: Set request mapping to NULL in blk+AF8-mq+AF8-put+AF8-driver+AF8-tag

2018-12-13 Thread Kashyap Desai
> > > On Thu, Dec 06, 2018 at 11:15:13AM +0530, Kashyap Desai wrote:
> > > > >
> > > > > If the 'tag' passed to scsi_host_find_tag() is valid, I think
there
> > > > > shouldn't have such issue.
> > > > >
> > > > > If you want to find outstanding IOs, maybe you can try
> > > > > blk_mq_queue_tag_busy_iter()
> > > > > or blk_mq_tagset_busy_iter(), because you may not know if the
passed
> > > > 'tag'
> > > > > to
> > > > > scsi_host_find_tag() is valid or not.
> > > >
> > > > We tried quick change in mpt3sas driver using
blk_mq_tagset_busy_iter
> > and
> > > > it returns/callback for valid requests (no stale entries are
returned).
> > > > Expected.
> > > > Above two APIs are only for blk-mq.  What about non-mq case ?
Driver
> > > > should use scsi_host_find_tag for non-mq and
blk_mq_tagset_busy_iter
> > for
> > > > blk-mq case ?
> > >
> > > But your patch is only for blk-mq, is there same issue on non-mq
case?
> >
> > Problematic part from below function is code path which goes from "
> > shost_use_blk_mq(shost))".
> > Non-mq path works fine because every IO completion set bqt-
> >tag_index[tag]
> > = NULL from blk_queue_end_tag().
> >
> > I did similar things for mq path in this patch.
> >
> > static inline struct scsi_cmnd *scsi_host_find_tag(struct Scsi_Host
*shost,
> > int tag)
> > {
> > struct request *req = NULL;
> >
> > if (tag == SCSI_NO_TAG)
> > return NULL;
> >
> > if (shost_use_blk_mq(shost)) {
> > u16 hwq = blk_mq_unique_tag_to_hwq(tag);
> >
> > if (hwq < shost->tag_set.nr_hw_queues) {
> > req =
blk_mq_tag_to_rq(shost->tag_set.tags[hwq],
> > blk_mq_unique_tag_to_tag(tag));
> > }
> > } else {
> > req = blk_map_queue_find_tag(shost->bqt, tag);
> > }
>
>
> Hi Jens,
>
> Any conclusion/feedback on this topic/patch ?  As discussed, This is a
safe
> change and  good to have if no design issue.

Hi, Since we have not concluded on fix, we can resume discussion on next
revision of the patch. I will be posting V2 patch with complete fix.

Kashyap

>
> Kashyap


RE: +AFs-PATCH+AF0- blk-mq: Set request mapping to NULL in blk+AF8-mq+AF8-put+AF8-driver+AF8-tag

2018-12-11 Thread Kashyap Desai
> > On Thu, Dec 06, 2018 at 11:15:13AM +0530, Kashyap Desai wrote:
> > > >
> > > > If the 'tag' passed to scsi_host_find_tag() is valid, I think
there
> > > > shouldn't have such issue.
> > > >
> > > > If you want to find outstanding IOs, maybe you can try
> > > > blk_mq_queue_tag_busy_iter()
> > > > or blk_mq_tagset_busy_iter(), because you may not know if the
passed
> > > 'tag'
> > > > to
> > > > scsi_host_find_tag() is valid or not.
> > >
> > > We tried quick change in mpt3sas driver using
blk_mq_tagset_busy_iter
> and
> > > it returns/callback for valid requests (no stale entries are
returned).
> > > Expected.
> > > Above two APIs are only for blk-mq.  What about non-mq case ? Driver
> > > should use scsi_host_find_tag for non-mq and blk_mq_tagset_busy_iter
> for
> > > blk-mq case ?
> >
> > But your patch is only for blk-mq, is there same issue on non-mq case?
>
> Problematic part from below function is code path which goes from "
> shost_use_blk_mq(shost))".
> Non-mq path works fine because every IO completion set
bqt->tag_index[tag]
> = NULL from blk_queue_end_tag().
>
> I did similar things for mq path in this patch.
>
> static inline struct scsi_cmnd *scsi_host_find_tag(struct Scsi_Host
*shost,
> int tag)
> {
> struct request *req = NULL;
>
> if (tag == SCSI_NO_TAG)
> return NULL;
>
> if (shost_use_blk_mq(shost)) {
> u16 hwq = blk_mq_unique_tag_to_hwq(tag);
>
> if (hwq < shost->tag_set.nr_hw_queues) {
> req = blk_mq_tag_to_rq(shost->tag_set.tags[hwq],
> blk_mq_unique_tag_to_tag(tag));
> }
> } else {
> req = blk_map_queue_find_tag(shost->bqt, tag);
> }


Hi Jens,

Any conclusion/feedback on this topic/patch ?  As discussed, This is a
safe change and  good to have if no design issue.

Kashyap


RE: +AFs-PATCH+AF0- blk-mq: Set request mapping to NULL in blk+AF8-mq+AF8-put+AF8-driver+AF8-tag

2018-12-07 Thread Kashyap Desai
> -Original Message-
> From: Ming Lei [mailto:ming@redhat.com]
> Sent: Friday, December 7, 2018 3:50 PM
> To: Kashyap Desai
> Cc: Bart Van Assche; linux-block; Jens Axboe; linux-scsi; Suganath Prabu
> Subramani; Sreekanth Reddy; Sathya Prakash Veerichetty
> Subject: Re: +AFs-PATCH+AF0- blk-mq: Set request mapping to NULL in
> blk+AF8-mq+AF8-put+AF8-driver+AF8-tag
>
> On Thu, Dec 06, 2018 at 11:15:13AM +0530, Kashyap Desai wrote:
> > >
> > > If the 'tag' passed to scsi_host_find_tag() is valid, I think there
> > > shouldn't have such issue.
> > >
> > > If you want to find outstanding IOs, maybe you can try
> > > blk_mq_queue_tag_busy_iter()
> > > or blk_mq_tagset_busy_iter(), because you may not know if the passed
> > 'tag'
> > > to
> > > scsi_host_find_tag() is valid or not.
> >
> > We tried quick change in mpt3sas driver using blk_mq_tagset_busy_iter
and
> > it returns/callback for valid requests (no stale entries are
returned).
> > Expected.
> > Above two APIs are only for blk-mq.  What about non-mq case ? Driver
> > should use scsi_host_find_tag for non-mq and blk_mq_tagset_busy_iter
for
> > blk-mq case ?
>
> But your patch is only for blk-mq, is there same issue on non-mq case?

Problematic part from below function is code path which goes from "
shost_use_blk_mq(shost))".
Non-mq path works fine because every IO completion set bqt->tag_index[tag]
= NULL from blk_queue_end_tag().

I did similar things for mq path in this patch.

static inline struct scsi_cmnd *scsi_host_find_tag(struct Scsi_Host
*shost,
int tag)
{
struct request *req = NULL;

if (tag == SCSI_NO_TAG)
return NULL;

if (shost_use_blk_mq(shost)) {
u16 hwq = blk_mq_unique_tag_to_hwq(tag);

if (hwq < shost->tag_set.nr_hw_queues) {
req = blk_mq_tag_to_rq(shost->tag_set.tags[hwq],
blk_mq_unique_tag_to_tag(tag));
}
} else {
req = blk_map_queue_find_tag(shost->bqt, tag);
}

Kashyap
>
> Thanks,
> Ming


RE: +AFs-PATCH+AF0- blk-mq: Set request mapping to NULL in blk+AF8-mq+AF8-put+AF8-driver+AF8-tag

2018-12-06 Thread Kashyap Desai
> On 12/5/18 10:45 PM, Kashyap Desai wrote:
> >>
> >> If the 'tag' passed to scsi_host_find_tag() is valid, I think there
> >> shouldn't have such issue.
> >>
> >> If you want to find outstanding IOs, maybe you can try
> >> blk_mq_queue_tag_busy_iter()
> >> or blk_mq_tagset_busy_iter(), because you may not know if the passed
> > 'tag'
> >> to
> >> scsi_host_find_tag() is valid or not.
> >
> > We tried quick change in mpt3sas driver using blk_mq_tagset_busy_iter
> > and
> > it returns/callback for valid requests (no stale entries are returned).
> > Expected.
> > Above two APIs are only for blk-mq.  What about non-mq case ? Driver
> > should use scsi_host_find_tag for non-mq and blk_mq_tagset_busy_iter for
> > blk-mq case ?
> > I don't see that will be good interface. Also, blk_mq_tagset_busy_iter
> > API
> > does not provide control if driver wants to quit in-between or do some
> > retry logic etc.
> >
> > Why can't we add single API which provides the correct output.
>
> From 4.21 and forward, there will only be blk/scsi-mq. This is exactly
> the problem with having to maintain two stacks, it's a huge pain.

Hi Jens, Fix for this issue also required to be back ported to stable
kernels (which still use non-mq + mq stack).
We have multiple choices to fix this.

1. Use " blk_mq_tagset_busy_iter" in *all* the affected drivers. This API
has certain limitation as explained and also fix only blk-mq part. Using
this API may need more code in low level drivers to handle non-mq and mq
separately.
2. Driver can use internal memory for scsiio_track (driver private) and
track all the outstanding IO within a driver. This is mostly a scsi mid
layer interface. All the affected driver require a changes.
3. Fix blk-mq code around  blk_mq_put_tag and driver can still use "
scsi_host_find_tag". No driver changes are required.
This is smooth to back port stable kernel and Linux Distribution which
normally pick critical fixes from stable can pick the fix. This is better
fix and did not change any design.
In fact, I mimic the same flow as non-mq code is doing.

I don't see any design or functional issue with #3 (PATCH provided in this
thread.) What is your feedback for this patch ?

Kashyap

>
> --
> Jens Axboe


RE: +AFs-PATCH+AF0- blk-mq: Set request mapping to NULL in blk+AF8-mq+AF8-put+AF8-driver+AF8-tag

2018-12-05 Thread Kashyap Desai
>
> If the 'tag' passed to scsi_host_find_tag() is valid, I think there
> shouldn't have such issue.
>
> If you want to find outstanding IOs, maybe you can try
> blk_mq_queue_tag_busy_iter()
> or blk_mq_tagset_busy_iter(), because you may not know if the passed
'tag'
> to
> scsi_host_find_tag() is valid or not.

We tried quick change in mpt3sas driver using blk_mq_tagset_busy_iter and
it returns/callback for valid requests (no stale entries are returned).
Expected.
Above two APIs are only for blk-mq.  What about non-mq case ? Driver
should use scsi_host_find_tag for non-mq and blk_mq_tagset_busy_iter for
blk-mq case ?
I don't see that will be good interface. Also, blk_mq_tagset_busy_iter API
does not provide control if driver wants to quit in-between or do some
retry logic etc.

Why can't we add single API which provides the correct output.

scsi_host_find_tag () API works well in non-mq case because,
blk_queue_end_tag() set bqt->tag_index[tag] = NULL;.
We are missing similar reset upon request completion in blk-mq case. This
patch has similar approach as non-mq and there is no race condition I can
foresee.

BTW - My original patch is half fix. We also need below changes -

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 3f91c6e..d8f53ac 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -477,8 +477,10 @@ static void __blk_mq_free_request(struct request *rq)
const int sched_tag = rq->internal_tag;

blk_pm_mark_last_busy(rq);
-   if (rq->tag != -1)
+   if (rq->tag != -1) {
+   hctx->tags->rqs[rq->tag] = NULL;
blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
+   }
if (sched_tag != -1)
blk_mq_put_tag(hctx, hctx->sched_tags, ctx, sched_tag);
blk_mq_sched_restart(hctx);


>
>
> Thanks,
> Ming


RE: +AFs-PATCH+AF0- blk-mq: Set request mapping to NULL in blk+AF8-mq+AF8-put+AF8-driver+AF8-tag

2018-12-04 Thread Kashyap Desai
> -Original Message-
> From: Bart Van Assche [mailto:bvanass...@acm.org]
> Sent: Tuesday, December 4, 2018 10:45 PM
> To: Kashyap Desai; linux-block; Jens Axboe; Ming Lei; linux-scsi
> Cc: Suganath Prabu Subramani; Sreekanth Reddy; Sathya Prakash Veerichetty
> Subject: Re: [PATCH] blk-mq: Set request mapping to NULL in
> blk_mq_put_driver_tag
>
> On Tue, 2018-12-04 at 22:17 +0530, Kashyap Desai wrote:
> > + Linux-scsi
> >
> > > > diff --git a/block/blk-mq.h b/block/blk-mq.h
> > > > index 9497b47..57432be 100644
> > > > --- a/block/blk-mq.h
> > > > +++ b/block/blk-mq.h
> > > > @@ -175,6 +175,7 @@ static inline bool
> > > > blk_mq_get_dispatch_budget(struct blk_mq_hw_ctx *hctx)
> > > >   static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx
> *hctx,
> > > >  struct request *rq)
> > > >   {
> > > > +hctx->tags->rqs[rq->tag] = NULL;
> > > >   blk_mq_put_tag(hctx, hctx->tags, rq->mq_ctx, rq->tag);
> > > >   rq->tag = -1;
> > >
> > > No SCSI driver should call scsi_host_find_tag() after a request has
> > > finished. The above patch introduces yet another race and hence can't
> > > be
> > > a proper fix.
> >
> > Bart, many scsi drivers use scsi_host_find_tag() to traverse max tag_id
> > to
> > find out pending IO in firmware.
> > One of the use case is -  HBA firmware recovery.  In case of firmware
> > recovery, driver may require to traverse the list and return back
> > pending
> > scsi command to SML for retry.
> > I quickly grep the scsi code and found that snic_scsi, qla4xxx, fnic,
> > mpt3sas are using API scsi_host_find_tag for the same purpose.
> >
> > Without this patch, we hit very basic kernel panic due to page fault.
> > This
> > is not an issue in non-mq code path. Non-mq path use
> > blk_map_queue_find_tag() and that particular API does not provide stale
> > requests.
>
> As I wrote before, your patch doesn't fix the race you described but only
> makes the race window smaller.
Hi Bart,

Let me explain the issue. It is not a race, but very straight issue.  Let's
say we have one scsi_device /dev/sda and total IO submitted + completed are
some number 100.
All the 100 IO is *completed*.   Now, As part of Firmware recovery, driver
tries to find our outstanding IOs using scsi_host_find_tag().
Block layer will return all the 100 commands to the driver but really those
100 commands are not outstanding. This patch will return *actual*
outstanding commands.
If scsi_device /dev/sda is not removed in OS, driver accessing scmd of those
100 commands are safe memory access.

Now consider a case where scsi_device /dev/sda is removed and driver
performs firmware recovery. This time driver will crash while accessing scmd
(randomly based on memory reused.).

Along with this patch, low level driver should make sure that all request
queue at block layer is quiesce.

If you want an example of how to use
> scsi_host_find_tag() properly, have a look at the SRP initiator driver
> (drivers/infiniband/ulp/srp). That driver uses scsi_host_find_tag()
> without
> triggering any NULL pointer dereferences.

I am not able to find right context from srp, but I check the srp code and
looks like that driver is getting scmd using scsi_host_find_tag() for live
command.

> The approach used in that driver
> also works when having to support HBA firmware recovery.
>
> Bart.


RE: [PATCH] blk-mq: Set request mapping to NULL in blk_mq_put_driver_tag

2018-12-04 Thread Kashyap Desai
+ Linux-scsi

> > diff --git a/block/blk-mq.h b/block/blk-mq.h
> > index 9497b47..57432be 100644
> > --- a/block/blk-mq.h
> > +++ b/block/blk-mq.h
> > @@ -175,6 +175,7 @@ static inline bool
> > blk_mq_get_dispatch_budget(struct blk_mq_hw_ctx *hctx)
> >   static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
> >  struct request *rq)
> >   {
> > +hctx->tags->rqs[rq->tag] = NULL;
> >   blk_mq_put_tag(hctx, hctx->tags, rq->mq_ctx, rq->tag);
> >   rq->tag = -1;
>
> No SCSI driver should call scsi_host_find_tag() after a request has
> finished. The above patch introduces yet another race and hence can't be
> a proper fix.

Bart, many scsi drivers use scsi_host_find_tag() to traverse max tag_id to
find out pending IO in firmware.
One of the use case is -  HBA firmware recovery.  In case of firmware
recovery, driver may require to traverse the list and return back pending
scsi command to SML for retry.
I quickly grep the scsi code and found that snic_scsi, qla4xxx, fnic,
mpt3sas are using API scsi_host_find_tag for the same purpose.

Without this patch, we hit very basic kernel panic due to page fault.  This
is not an issue in non-mq code path. Non-mq path use
blk_map_queue_find_tag() and that particular API does not provide stale
requests.

Kashyap

>
> Bart.


RE: Performance drop due to "blk-mq-sched: improve sequential I/O performance"

2018-05-02 Thread Kashyap Desai
> > I have created internal code changes based on below RFC and using irq
> > poll CPU lockup issue is resolved.
> > https://www.spinics.net/lists/linux-scsi/msg116668.html
>
> Could we use the 1:1 mapping and not apply out-of-tree irq poll in the
> following test? So that we can keep at same page easily.

Above RFC changes are not used in my testing.  I used same inbox driver
from 4.17-rc.

>
> Thanks,
> Ming


RE: Performance drop due to "blk-mq-sched: improve sequential I/O performance"

2018-05-02 Thread Kashyap Desai
> -Original Message-
> From: Ming Lei [mailto:ming@redhat.com]
> Sent: Wednesday, May 2, 2018 3:17 PM
> To: Kashyap Desai
> Cc: linux-scsi@vger.kernel.org; linux-bl...@vger.kernel.org
> Subject: Re: Performance drop due to "blk-mq-sched: improve sequential
I/O
> performance"
>
> On Wed, May 02, 2018 at 01:13:34PM +0530, Kashyap Desai wrote:
> > Hi Ming,
> >
> > I was running some performance test on latest 4.17-rc and figure out
> > performance drop (approximate 15% drop) due to below patch set.
> > https://marc.info/?l=linux-block&m=150802309522847&w=2
> >
> > I observed drop on latest 4.16.6 stable and 4.17-rc kernel as well.
> > Taking bisect approach, figure out that Issue is not observed using
> > last stable kernel 4.14.38.
> > I pick 4.14.38 stable kernel  as base line and applied above patch to
> > confirm the behavior.
> >
> > lscpu output -
> >
> > Architecture:  x86_64
> > CPU op-mode(s):32-bit, 64-bit
> > Byte Order:Little Endian
> > CPU(s):72
> > On-line CPU(s) list:   0-71
> > Thread(s) per core:2
> > Core(s) per socket:18
> > Socket(s): 2
> > NUMA node(s):  2
> > Vendor ID: GenuineIntel
> > CPU family:6
> > Model: 85
> > Model name:Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz
> > Stepping:  4
> > CPU MHz:   1457.182
> > CPU max MHz:   2701.
> > CPU min MHz:   1200.
> > BogoMIPS:  5400.00
> > Virtualization:VT-x
> > L1d cache: 32K
> > L1i cache: 32K
> > L2 cache:  1024K
> > L3 cache:  25344K
> > NUMA node0 CPU(s): 0-17,36-53
> > NUMA node1 CPU(s): 18-35,54-71
> >
> > I am having 16 SSDs - "SDLL1DLR400GCCA1". Created two R0 VD (each VD
> > consist of 8 SSDs) using MegaRaid Ventura series adapter.
> >
> > fio script -
> > numactl -N 1 fio 2vd.fio --bs=4k --iodepth=128 -rw=randread
> > --group_report --ioscheduler=none --numjobs=4
> >
> >
> > | v4.14.38-stable   | patched
> > v4.14.38-stable
> > | mq-none   | mq-none
> > -
> > randread"iops"   | 1597k| 1377k
> >
> >
> > Below is perf tool report without patch set. ( Looks like lock
> > contention is causing this drop, so provided relevant snippet)
> >
> > -3.19% 2.89%  fio  [kernel.vmlinux][k]
> > _raw_spin_lock
> >- 2.43% io_submit
> >   - 2.30% entry_SYSCALL_64
> >  - do_syscall_64
> > - 2.18% do_io_submit
> >- 1.59% blk_finish_plug
> >   - 1.59% blk_flush_plug_list
> >  - 1.59% blk_mq_flush_plug_list
> > - 1.00% __blk_mq_delay_run_hw_queue
> >- 0.99% blk_mq_sched_dispatch_requests
> >   - 0.63% blk_mq_dispatch_rq_list
> >0.60% scsi_queue_rq
> > - 0.57% blk_mq_sched_insert_requests
> >- 0.56% blk_mq_insert_requests
> > 0.51% _raw_spin_lock
> >
> > Below is perf tool report after applying patch set.
> >
> > -4.10% 3.51%  fio  [kernel.vmlinux][k]
> > _raw_spin_lock
> >- 3.09% io_submit
> >   - 2.97% entry_SYSCALL_64
> >  - do_syscall_64
> > - 2.85% do_io_submit
> >- 2.35% blk_finish_plug
> >   - 2.35% blk_flush_plug_list
> >  - 2.35% blk_mq_flush_plug_list
> > - 1.83% __blk_mq_delay_run_hw_queue
> >- 1.83% __blk_mq_run_hw_queue
> >   - 1.83% blk_mq_sched_dispatch_requests
> >  - 1.82% blk_mq_do_dispatch_ctx
> > - 1.14% blk_mq_dequeue_from_ctx
> >- 1.11% dispatch_rq_from_ctx
> > 1.03% _raw_spin_lock
> >   0.50% blk_mq_sched_insert_requests
> >
> > Let me know if you want more data or is this something a known
> > implic

Performance drop due to "blk-mq-sched: improve sequential I/O performance"

2018-05-02 Thread Kashyap Desai
Hi Ming,

I was running some performance test on latest 4.17-rc and figure out
performance drop (approximate 15% drop) due to below patch set.
https://marc.info/?l=linux-block&m=150802309522847&w=2

I observed drop on latest 4.16.6 stable and 4.17-rc kernel as well. Taking
bisect approach, figure out that Issue is not observed using last stable
kernel 4.14.38.
I pick 4.14.38 stable kernel  as base line and applied above patch to
confirm the behavior.

lscpu output -

Architecture:  x86_64
CPU op-mode(s):32-bit, 64-bit
Byte Order:Little Endian
CPU(s):72
On-line CPU(s) list:   0-71
Thread(s) per core:2
Core(s) per socket:18
Socket(s): 2
NUMA node(s):  2
Vendor ID: GenuineIntel
CPU family:6
Model: 85
Model name:Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz
Stepping:  4
CPU MHz:   1457.182
CPU max MHz:   2701.
CPU min MHz:   1200.
BogoMIPS:  5400.00
Virtualization:VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache:  1024K
L3 cache:  25344K
NUMA node0 CPU(s): 0-17,36-53
NUMA node1 CPU(s): 18-35,54-71

I am having 16 SSDs - "SDLL1DLR400GCCA1". Created two R0 VD (each VD
consist of 8 SSDs) using MegaRaid Ventura series adapter.

fio script -
numactl -N 1 fio 2vd.fio --bs=4k --iodepth=128 -rw=randread --group_report
--ioscheduler=none --numjobs=4


| v4.14.38-stable   | patched
v4.14.38-stable
| mq-none   | mq-none
-
randread"iops"   | 1597k| 1377k


Below is perf tool report without patch set. ( Looks like lock contention
is causing this drop, so provided relevant snippet)

-3.19% 2.89%  fio  [kernel.vmlinux][k]
_raw_spin_lock
   - 2.43% io_submit
  - 2.30% entry_SYSCALL_64
 - do_syscall_64
- 2.18% do_io_submit
   - 1.59% blk_finish_plug
  - 1.59% blk_flush_plug_list
 - 1.59% blk_mq_flush_plug_list
- 1.00% __blk_mq_delay_run_hw_queue
   - 0.99% blk_mq_sched_dispatch_requests
  - 0.63% blk_mq_dispatch_rq_list
   0.60% scsi_queue_rq
- 0.57% blk_mq_sched_insert_requests
   - 0.56% blk_mq_insert_requests
0.51% _raw_spin_lock

Below is perf tool report after applying patch set.

-4.10% 3.51%  fio  [kernel.vmlinux][k]
_raw_spin_lock
   - 3.09% io_submit
  - 2.97% entry_SYSCALL_64
 - do_syscall_64
- 2.85% do_io_submit
   - 2.35% blk_finish_plug
  - 2.35% blk_flush_plug_list
 - 2.35% blk_mq_flush_plug_list
- 1.83% __blk_mq_delay_run_hw_queue
   - 1.83% __blk_mq_run_hw_queue
  - 1.83% blk_mq_sched_dispatch_requests
 - 1.82% blk_mq_do_dispatch_ctx
- 1.14% blk_mq_dequeue_from_ctx
   - 1.11% dispatch_rq_from_ctx
1.03% _raw_spin_lock
  0.50% blk_mq_sched_insert_requests

Let me know if you want more data or is this something a known implication
of patch-set ?

Thanks, Kashyap


RE: MegaCli fails to communicate with Raid-Controller

2018-04-26 Thread Kashyap Desai
> -Original Message-
> From: Volker Schwicking [mailto:volker.schwick...@godaddy.com]
> Sent: Thursday, April 26, 2018 8:22 PM
> To: Kashyap Desai
> Cc: Martin K. Petersen; linux-scsi@vger.kernel.org; Sumit Saxena;
> Shivasharan
> Srikanteshwara
> Subject: Re: MegaCli fails to communicate with Raid-Controller
>
> On 23. Apr 2018, at 11:03, Volker Schwicking
>  wrote:
> >
> > I will add the printk to dma_alloc_coherent() as well to see, which
> > request
> actually fails. But i have to be a bit patient since its a production
> system and
> the customers aren’t to happy about reboots.
>
> Alright, here are some results.
>
> Looking at my debug lines i can tell, that requesting either 2048 or 4
> regularly
> fail. Other values don’t ever show up as failed, but there are several  as
> you
> can see in the attached log.
>
> The failed requests:
> ###
> $ grep 'GD IOV-len FAILED' /var/log/kern.log  | awk '{ print $9, $10 }' |
> sort |
> uniq -c
>  59 FAILED: 2048
>  64 FAILED: 4
> ###

Thanks.! This helps to understand the problem. Few question -

What is a frequency of this failure ? Can you reproduce on demand ?
Are you able to see no failure on 4.6 kernel ?
How your setup looks like ? Are you running VM or this failure is on host
OS. Can you share full dmesg logs ?

>
> I attached full debugging output from several executions of
> “megacli -ldpdinfo
> -a0” in 5 second intervals, successful and failed and content from
> /proc/buddyinfo again.
>
>  Can you make any sense of that? Where should i go from here?

May be better to find out call trace of dma_alloc_coherent using ftrace.
Depending upon DMA engine configured, failure may be related to those DMA
engine code changes.
Can you get those ftrace logs as well. ? You may have to cherry pick ftrace
filter around dma_alloc_coherent().

I quickly grep in arch/xen to see something related to memory allocation and
found that pci_xen_swiotlb_detect() has some methods to enable/disable
certain features and one of the key factor is DMA range 32 bit or 64 bit.
Since older controller is requesting DMA buffer below 4GB region, some kind
of code changes in those are from 4.6 -> 4.14.x might be a possible reason
of the frequent memory allocation failure. This is my wild guess based on
the info that 4.6 is  *not at all* exposured to memory failure at the same
frequency of 4.14.

Kashyap


RE: MegaCli fails to communicate with Raid-Controller

2018-04-23 Thread Kashyap Desai
>
> Interesting. What is considered old and new? I have a third machine "Dell
> R515, MegaRAID SAS 2108”, is that considered new? Its running the same
> Xen/Kernel/Megacli-versions as the other two, but the error does not
> occur.

Nope this is also old controller. When I say new controller, It is pretty
much active development like SAS3.0 and SAS3.5. Driver level changes related
to DMA mask settings is FW dependent, so we cannot open it for all.

>
> > There can be a two possibilities.
> >
> > 1. This is actual memory allocation failure due to system resource
> > issue.
>
> I have not seen any OOMs on the two machines when/where the SGL-error
> occurs. According to "xl info” and our munin-graphs it all looks ok with a
> couple 100 MiB “free".
>
>
> > 2. IOCLT provided large memory length in iov and dma buffer allocation
> > from below API failed due to large memory chunk requested.
> >
> >kbuff_arr[i] = dma_alloc_coherent(&instance->pdev->dev,
> >ioc->sgl[i].iov_len,
> >&buf_handle,
> > GFP_KERNEL);
> >
> > Can you change driver code *printk* to dump iov_len ? Just to confirm.
>
> Just did that on the “Dell R730xd, MegaRAID SAS-3 3108” and get the
> following output when the megacli works fine.
>
> ###
> Apr 23 09:31:37 xh643 kernel: [  368.319092] GD IOV-len: 2048 Apr 23
> 09:31:37
> xh643 kernel: [  368.319426] GD IOV-len: 32 Apr 23 09:31:37 xh643 kernel:
> [
> 368.319563] GD IOV-len: 320 Apr 23 09:31:37 xh643 kernel: [  368.319698]
> GD
> IOV-len: 616 Apr 23 09:31:37 xh643 kernel: [  368.319887] GD IOV-len: 1664
> Apr 23 09:31:37 xh643 kernel: [  368.320040] GD IOV-len: 32 Apr 23
> 09:31:37
> xh643 kernel: [  368.320174] GD IOV-len: 8 … ###
>
> Full output is attached in iov_len_megacli_works.txt, it also contains the
> output of /proc/buddyinfo which might be important based in my research so
> far.

We need similar output whenever there is a dma_alloc_coherent() failure.
Did you added new prints in failure of dma_alloc_coherent() OR it is generic
print for all the case ?


RE: MegaCli fails to communicate with Raid-Controller

2018-04-18 Thread Kashyap Desai
> -Original Message-
> From: Martin K. Petersen [mailto:martin.peter...@oracle.com]
> Sent: Wednesday, April 18, 2018 10:43 PM
> To: volker.schwick...@godaddy.com
> Cc: linux-scsi@vger.kernel.org; Kashyap Desai; Sumit Saxena
> Subject: Re: MegaCli fails to communicate with Raid-Controller
>
>
> Volker,
>
> > after our latest kernel-update from 4.6. to 4.14.14 we are having
> > trouble getting data out of our LSI-raid-controllers using the
> > megacli-binary.
> >
> > For every execution of the megacli-binary a line shows up in the
> > kern.log
> >
> > ###
> > [547216.425556] megaraid_sas :03:00.0: Failed to alloc kernel SGL
> > buffer for IOCTL ###
>
> Well, that explains why things aren't working. The kernel is unable to
allocate
> a DMA buffer for the ioctl.
>
> There really hasn't been any changes to this code since 4.6. The only
thing
> that springs to mind is some mucking around with the DMA mask in a
previous
> megaraid update. But given how old your controller is, I'd expect this
mask to
> be 32 bits both before and after.

I think you may see issue with 4.6 kernel as well. This is run time memory
allocation failure. Older controller used 32 bit consistence DMA mask, so
possibility of memory allocation failure is high compare to 64 bit
consistence DMA mask. Newer controller has fix in this area, but you are
using gen-1 controller. ("Dell R710, MegaRAID SAS 1078").

There can be a two possibilities.

1. This is actual memory allocation failure due to system resource issue.
2. IOCLT provided large memory length in iov and dma buffer allocation
from below API failed due to large memory chunk requested.

kbuff_arr[i] = dma_alloc_coherent(&instance->pdev->dev,
ioc->sgl[i].iov_len,
&buf_handle,
GFP_KERNEL);

Can you change driver code *printk* to dump iov_len ? Just to confirm.

One wild guess -  You are using Xen flavor, which will reserve less memory
for Dom0 and there may be some way to increase dom0 memory. Can you tune
that as well and see ? I am not sure how to do that in your case, but in
Citrix we used to see such issue frequently compare to *default* Linux.
Providing some tuning in grub increase the dom0 memory and that make
things better compare to default settings.


>
> Kashyap? Sumit?
>
> --
> Martin K. PetersenOracle Linux Engineering


smp affinity and kworker io submission

2018-03-22 Thread Kashyap Desai
Hi,

I am running FIO script on Linux 4.15. This is generic behavior even on
3.x kernels as well. I wanted to know if my observation is correct or not.

Here is FIO command -

numactl -C 0-2  fio single --bs=4k  --iodepth=64 --rw=randread
--ioscheduler=none --group_report --numjobs=2

If driver is provides affinity_hint, kernel choose only kworker (0,1,2)
(it looks like kworker binding is smartly handled by kernel because I am
running FIO from cpu0,1,2)  for IO submission from delayed context.

14140 root  15  -5  519296   1560612 R  87.7  0.0   0:20.91 fio

 14138 root  15  -5  519292   1556608 R  76.1  0.0   0:21.79 fio

 14142 root  15  -5  519308   1560612 R  66.8  0.0   0:19.69 fio

 14141 root  15  -5  519304   1564616 R  54.5  0.0   0:20.51 fio

   923 root   0 -20   0  0  0 S   6.3  0.0   0:09.73
kworker/1:1H

  1075 root   0 -20   0  0  0 S   5.3  0.0   0:08.69
kworker/0:1H

   924 root   0 -20   0  0  0 S   3.3  0.0   0:12.82
kworker/2:1H

If driver is not providing affinity_hint, kernel choose *any* kworker from
local numa node for IO submission from delayed context. In below snippet,
you can see kworke4, kworke5 and kworke3 was participating in IO
submission.

14281 root  15  -5  519308   1556612 R  87.0  0.0   0:16.16 fio

 14280 root  15  -5  519304   1560616 R  74.1  0.0   0:14.62 fio

 14279 root  15  -5  519296   1556612 R  71.8  0.0   0:15.02 fio

 14277 root  15  -5  519292   1552608 R  66.8  0.0   0:15.06 fio

  1887 root   0 -20   0  0  0 R  15.3  0.0   0:40.91
kworker/4:1H

  3856 root   0 -20   0  0  0 S  13.6  0.0   0:38.90
kworker/5:1H

  3646 root   0 -20   0  0  0 S  13.0  0.0   0:40.17
kworker/3:1H

Which kernel component is making this decision ? Is this behavior tied to
block layer/irq subsystem  ?

I am trying to see which behavior is most suitable for my test. I am
seeing performance is not improving because it is CPU bound and If I
choose not to do smp affinity hint in driver, it is helping as explained
above.


Thanks, Kashyap


RE: [PATCH V5 1/5] scsi: hpsa: fix selection of reply queue

2018-03-19 Thread Kashyap Desai
> -Original Message-
> From: Artem Bityutskiy [mailto:dedeki...@gmail.com]
> Sent: Monday, March 19, 2018 8:12 PM
> To: h...@lst.de; Thomas Gleixner
> Cc: linux-bl...@vger.kernel.org; snit...@redhat.com; h...@suse.de;
> mr...@linux.ee; linux-scsi@vger.kernel.org; don.br...@microsemi.com;
> pbonz...@redhat.com; lober...@redhat.com;
> kashyap.de...@broadcom.com; Jens Axboe; martin.peter...@oracle.com;
> james.bottom...@hansenpartnership.com; ming@redhat.com
> Subject: Re: [PATCH V5 1/5] scsi: hpsa: fix selection of reply queue
>
> On Mon, 2018-03-19 at 08:31 -0600, Jens Axboe wrote:
> > I'm assuming that Martin will eventually queue this up. But probably
> > for 4.17, then we can always flag it for a backport to stable once
> > it's been thoroughly tested.
>
> Jens, thanks for reply.
>
> I wonder if folks agree that in this case we should revert
>
> 84676c1f21e8 genirq/affinity: assign vectors to all possible CPUs
>
> for v4.16.
>
> If this was a minor niche use-case regression the -stable scenario would
> probably be OK. But the patch seem to miss the fact that kernel's
> "possible
> CPUs" notion may be way off and side effects are bad.

Also it is performance issue as posted at below link, if we just use
"84676c1f21e8 genirq/affinity: assign vectors to all possible CPUs".

https://www.spinics.net/lists/linux-scsi/msg118301.html

Performance drop was resolved using patch set (available at below link)under
discussion posted by Ming.

https://marc.info/?l=linux-block&m=152050646332092&w=2

Kashyap

>
> Christoph, Thomas, what do you think?
>
> Thanks,
> Artem.


RE: [PATCH V5 2/5] scsi: megaraid_sas: fix selection of reply queue

2018-03-13 Thread Kashyap Desai
> -Original Message-
> From: Ming Lei [mailto:ming@redhat.com]
> Sent: Tuesday, March 13, 2018 3:13 PM
> To: James Bottomley; Jens Axboe; Martin K . Petersen
> Cc: Christoph Hellwig; linux-scsi@vger.kernel.org; linux-
> bl...@vger.kernel.org; Meelis Roos; Don Brace; Kashyap Desai; Laurence
> Oberman; Mike Snitzer; Paolo Bonzini; Ming Lei; Hannes Reinecke; James
> Bottomley; Artem Bityutskiy
> Subject: [PATCH V5 2/5] scsi: megaraid_sas: fix selection of reply queue
>
> From 84676c1f21 (genirq/affinity: assign vectors to all possible CPUs),
one
> msix vector can be created without any online CPU mapped, then command
> may be queued, and won't be notified after its completion.
>
> This patch setups mapping between cpu and reply queue according to irq
> affinity info retrived by pci_irq_get_affinity(), and uses this info to
choose
> reply queue for queuing one command.
>
> Then the chosen reply queue has to be active, and fixes IO hang caused
by
> using inactive reply queue which doesn't have any online CPU mapped.
>
> Cc: Hannes Reinecke 
> Cc: "Martin K. Petersen" ,
> Cc: James Bottomley ,
> Cc: Christoph Hellwig ,
> Cc: Don Brace 
> Cc: Kashyap Desai 
> Cc: Laurence Oberman 
> Cc: Mike Snitzer 
> Cc: Meelis Roos 
> Cc: Artem Bityutskiy 
> Fixes: 84676c1f21e8 ("genirq/affinity: assign vectors to all possible
CPUs")
> Signed-off-by: Ming Lei 


Side note - For a max performance, your below proposed patch/series is
required.  Without below patch, performance is going to be dropped due to
fewer reply queues are getting utilized one this particular patch is
included.
genirq/affinity: irq vector spread  among online CPUs as far as possible

Acked-by: Kashyap Desai 
Tested-by: Kashyap Desai 


RE: [PATCH V4 2/4] scsi: megaraid_sas: fix selection of reply queue

2018-03-09 Thread Kashyap Desai
> -Original Message-
> From: Ming Lei [mailto:ming@redhat.com]
> Sent: Friday, March 9, 2018 5:33 PM
> To: Kashyap Desai
> Cc: James Bottomley; Jens Axboe; Martin K . Petersen; Christoph Hellwig;
> linux-scsi@vger.kernel.org; linux-bl...@vger.kernel.org; Meelis Roos;
Don
> Brace; Laurence Oberman; Mike Snitzer; Hannes Reinecke; Artem Bityutskiy
> Subject: Re: [PATCH V4 2/4] scsi: megaraid_sas: fix selection of reply
queue
>
> On Fri, Mar 09, 2018 at 04:37:56PM +0530, Kashyap Desai wrote:
> > > -Original Message-
> > > From: Ming Lei [mailto:ming@redhat.com]
> > > Sent: Friday, March 9, 2018 9:02 AM
> > > To: James Bottomley; Jens Axboe; Martin K . Petersen
> > > Cc: Christoph Hellwig; linux-scsi@vger.kernel.org; linux-
> > > bl...@vger.kernel.org; Meelis Roos; Don Brace; Kashyap Desai;
> > > Laurence Oberman; Mike Snitzer; Ming Lei; Hannes Reinecke; James
> > > Bottomley; Artem Bityutskiy
> > > Subject: [PATCH V4 2/4] scsi: megaraid_sas: fix selection of reply
> > > queue
> > >
> > > From 84676c1f21 (genirq/affinity: assign vectors to all possible
> > > CPUs),
> > one
> > > msix vector can be created without any online CPU mapped, then
> > > command may be queued, and won't be notified after its completion.
> > >
> > > This patch setups mapping between cpu and reply queue according to
> > > irq affinity info retrived by pci_irq_get_affinity(), and uses this
> > > info to
> > choose
> > > reply queue for queuing one command.
> > >
> > > Then the chosen reply queue has to be active, and fixes IO hang
> > > caused
> > by
> > > using inactive reply queue which doesn't have any online CPU mapped.
> >
> > Also megaraid FW will use reply queue 0 for any async notification.
> > We want to set pre_vectors = 1 and make sure reply queue 0 is not part
> > of affinity hint.
> > To meet that requirement, I have to make some more changes like add
> > extra queue.
> > Example if reply queue supported by FW is 96 and online CPU is 16,
> > current driver will allocate 16 msix vector. We may have to allocate
> > 17 msix vector and reserve reply queue 0 for async reply from FW.
> >
> > I will be sending follow up patch soon.
>
> OK, but the above extra change shouldn't belong to this patch, which
focuses
> on fixing IO hang because of reply queue selection.

Fine. That will be a  separate patch to handle reply queue 0 affinity
case.
>
> >
> > >
> > > Cc: Hannes Reinecke 
> > > Cc: "Martin K. Petersen" ,
> > > Cc: James Bottomley ,
> > > Cc: Christoph Hellwig ,
> > > Cc: Don Brace 
> > > Cc: Kashyap Desai 
> > > Cc: Laurence Oberman 
> > > Cc: Mike Snitzer 
> > > Cc: Meelis Roos 
> > > Cc: Artem Bityutskiy 
> > > Fixes: 84676c1f21e8 ("genirq/affinity: assign vectors to all
> > > possible
> > CPUs")
> > > Signed-off-by: Ming Lei 
> > > ---
> > >  drivers/scsi/megaraid/megaraid_sas.h|  2 +-
> > >  drivers/scsi/megaraid/megaraid_sas_base.c   | 34
> > > -
> > >  drivers/scsi/megaraid/megaraid_sas_fusion.c | 12 --
> > >  3 files changed, 38 insertions(+), 10 deletions(-)
> > >
> > > diff --git a/drivers/scsi/megaraid/megaraid_sas.h
> > > b/drivers/scsi/megaraid/megaraid_sas.h
> > > index ba6503f37756..a644d2be55b6 100644
> > > --- a/drivers/scsi/megaraid/megaraid_sas.h
> > > +++ b/drivers/scsi/megaraid/megaraid_sas.h
> > > @@ -2127,7 +2127,7 @@ enum MR_PD_TYPE {
> > >  #define MR_NVME_PAGE_SIZE_MASK   0x00FF
> > >
> > >  struct megasas_instance {
> > > -
> > > + unsigned int *reply_map;
> > >   __le32 *producer;
> > >   dma_addr_t producer_h;
> > >   __le32 *consumer;
> > > diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c
> > > b/drivers/scsi/megaraid/megaraid_sas_base.c
> > > index a71ee67df084..065956cb2aeb 100644
> > > --- a/drivers/scsi/megaraid/megaraid_sas_base.c
> > > +++ b/drivers/scsi/megaraid/megaraid_sas_base.c
> > > @@ -5165,6 +5165,26 @@ megasas_setup_jbod_map(struct
> > > megasas_instance *instance)
> > >   instance->use_seqnum_jbod_fp = false;  }
> > >
> > > +static void megasas_setup_reply_map(struct megasas_instance
> > > +*instance) {
> > > + const struct cpumask *mask;
> > > + unsigned int queue, cpu;
> >

RE: [PATCH V4 2/4] scsi: megaraid_sas: fix selection of reply queue

2018-03-09 Thread Kashyap Desai
> -Original Message-
> From: Ming Lei [mailto:ming@redhat.com]
> Sent: Friday, March 9, 2018 9:02 AM
> To: James Bottomley; Jens Axboe; Martin K . Petersen
> Cc: Christoph Hellwig; linux-scsi@vger.kernel.org; linux-
> bl...@vger.kernel.org; Meelis Roos; Don Brace; Kashyap Desai; Laurence
> Oberman; Mike Snitzer; Ming Lei; Hannes Reinecke; James Bottomley; Artem
> Bityutskiy
> Subject: [PATCH V4 2/4] scsi: megaraid_sas: fix selection of reply queue
>
> From 84676c1f21 (genirq/affinity: assign vectors to all possible CPUs),
one
> msix vector can be created without any online CPU mapped, then command
> may be queued, and won't be notified after its completion.
>
> This patch setups mapping between cpu and reply queue according to irq
> affinity info retrived by pci_irq_get_affinity(), and uses this info to
choose
> reply queue for queuing one command.
>
> Then the chosen reply queue has to be active, and fixes IO hang caused
by
> using inactive reply queue which doesn't have any online CPU mapped.

Also megaraid FW will use reply queue 0 for any async notification.  We
want to set pre_vectors = 1 and make sure reply queue 0 is not part of
affinity hint.
To meet that requirement, I have to make some more changes like add extra
queue.
Example if reply queue supported by FW is 96 and online CPU is 16, current
driver will allocate 16 msix vector. We may have to allocate 17 msix
vector and reserve reply queue 0 for async reply from FW.

I will be sending follow up patch soon.

>
> Cc: Hannes Reinecke 
> Cc: "Martin K. Petersen" ,
> Cc: James Bottomley ,
> Cc: Christoph Hellwig ,
> Cc: Don Brace 
> Cc: Kashyap Desai 
> Cc: Laurence Oberman 
> Cc: Mike Snitzer 
> Cc: Meelis Roos 
> Cc: Artem Bityutskiy 
> Fixes: 84676c1f21e8 ("genirq/affinity: assign vectors to all possible
CPUs")
> Signed-off-by: Ming Lei 
> ---
>  drivers/scsi/megaraid/megaraid_sas.h|  2 +-
>  drivers/scsi/megaraid/megaraid_sas_base.c   | 34
> -
>  drivers/scsi/megaraid/megaraid_sas_fusion.c | 12 --
>  3 files changed, 38 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/scsi/megaraid/megaraid_sas.h
> b/drivers/scsi/megaraid/megaraid_sas.h
> index ba6503f37756..a644d2be55b6 100644
> --- a/drivers/scsi/megaraid/megaraid_sas.h
> +++ b/drivers/scsi/megaraid/megaraid_sas.h
> @@ -2127,7 +2127,7 @@ enum MR_PD_TYPE {
>  #define MR_NVME_PAGE_SIZE_MASK   0x00FF
>
>  struct megasas_instance {
> -
> + unsigned int *reply_map;
>   __le32 *producer;
>   dma_addr_t producer_h;
>   __le32 *consumer;
> diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c
> b/drivers/scsi/megaraid/megaraid_sas_base.c
> index a71ee67df084..065956cb2aeb 100644
> --- a/drivers/scsi/megaraid/megaraid_sas_base.c
> +++ b/drivers/scsi/megaraid/megaraid_sas_base.c
> @@ -5165,6 +5165,26 @@ megasas_setup_jbod_map(struct
> megasas_instance *instance)
>   instance->use_seqnum_jbod_fp = false;  }
>
> +static void megasas_setup_reply_map(struct megasas_instance *instance)
> +{
> + const struct cpumask *mask;
> + unsigned int queue, cpu;
> +
> + for (queue = 0; queue < instance->msix_vectors; queue++) {
> + mask = pci_irq_get_affinity(instance->pdev, queue);
> + if (!mask)
> + goto fallback;
> +
> + for_each_cpu(cpu, mask)
> + instance->reply_map[cpu] = queue;
> + }
> + return;
> +
> +fallback:
> + for_each_possible_cpu(cpu)
> + instance->reply_map[cpu] = 0;

Fallback should be better instead of just assigning to single reply queue.
May be something like below.

   for_each_possible_cpu(cpu)
   instance->reply_map[cpu] = cpu % instance->msix_vectors;;

If smp_affinity_enable module parameter is set to 0, I see performance
drop because IO is going to single reply queue.

> +}
> +
>  /**
>   * megasas_init_fw - Initializes the FW
>   * @instance:Adapter soft state
> @@ -5343,6 +5363,8 @@ static int megasas_init_fw(struct megasas_instance
> *instance)
>   goto fail_setup_irqs;
>   }
>
> + megasas_setup_reply_map(instance);
> +
>   dev_info(&instance->pdev->dev,
>   "firmware supports msix\t: (%d)", fw_msix_count);
>   dev_info(&instance->pdev->dev,
> @@ -6448,6 +6470,11 @@ static int megasas_probe_one(struct pci_dev
> *pdev,
>   memset(instance, 0, sizeof(*instance));
>   atomic_set(&instance->fw_reset_no_pci_access, 0);
>
> + instance->reply_map = kzalloc(sizeof(unsigned int) * nr_cpu_ids,
> + 

RE: [PATCH V3 8/8] scsi: megaraid: improve scsi_mq performance via .host_tagset

2018-03-08 Thread Kashyap Desai
> -Original Message-
> From: Ming Lei [mailto:ming@redhat.com]
> Sent: Thursday, March 8, 2018 4:54 PM
> To: Kashyap Desai
> Cc: Jens Axboe; linux-bl...@vger.kernel.org; Christoph Hellwig; Mike
Snitzer;
> linux-scsi@vger.kernel.org; Hannes Reinecke; Arun Easi; Omar Sandoval;
> Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace;
Peter
> Rivera; Laurence Oberman
> Subject: Re: [PATCH V3 8/8] scsi: megaraid: improve scsi_mq performance
via
> .host_tagset
>
> On Thu, Mar 08, 2018 at 07:06:25PM +0800, Ming Lei wrote:
> > On Thu, Mar 08, 2018 at 03:34:31PM +0530, Kashyap Desai wrote:
> > > > -Original Message-
> > > > From: Ming Lei [mailto:ming@redhat.com]
> > > > Sent: Thursday, March 8, 2018 6:46 AM
> > > > To: Kashyap Desai
> > > > Cc: Jens Axboe; linux-bl...@vger.kernel.org; Christoph Hellwig;
> > > > Mike
> > > Snitzer;
> > > > linux-scsi@vger.kernel.org; Hannes Reinecke; Arun Easi; Omar
> > > > Sandoval; Martin K . Petersen; James Bottomley; Christoph Hellwig;
> > > > Don Brace;
> > > Peter
> > > > Rivera; Laurence Oberman
> > > > Subject: Re: [PATCH V3 8/8] scsi: megaraid: improve scsi_mq
> > > > performance
> > > via
> > > > .host_tagset
> > > >
> > > > On Wed, Mar 07, 2018 at 10:58:34PM +0530, Kashyap Desai wrote:
> > > > > > >
> > > > > > > Also one observation using V3 series patch. I am seeing
> > > > > > > below Affinity mapping whereas I have only 72 logical CPUs.
> > > > > > > It means we are really not going to use all reply queues.
> > > > > > > e.a If I bind fio jobs on CPU 18-20, I am seeing only one
> > > > > > > reply queue is used and that may lead to performance drop as
well.
> > > > > >
> > > > > > If the mapping is in such shape, I guess it should be quite
> > > > > > difficult to
> > > > > figure out
> > > > > > one perfect way to solve this situation because one reply
> > > > > > queue has to
> > > > > handle
> > > > > > IOs submitted from 4~5 CPUs at average.
> > > > >
> > > > > 4.15.0-rc1 kernel has below mapping - I am not sure which commit
> > > > > id in
> > > "
> > > > > linux_4.16-rc-host-tags-v3.2" is changing the mapping of IRQ to
CPU.
> > > > > It
> > > >
> > > > I guess the mapping you posted is read from
/proc/irq/126/smp_affinity.
> > > >
> > > > If yes, no any patch in linux_4.16-rc-host-tags-v3.2 should change
> > > > IRQ
> > > affinity
> > > > code, which is done in irq_create_affinity_masks(), as you saw, no
> > > > any
> > > patch
> > > > in linux_4.16-rc-host-tags-v3.2 touches that code.
> > > >
> > > > Could you simply apply the patches in linux_4.16-rc-host-tags-v3.2
> > > against
> > > > 4.15-rc1 kernel and see any difference?
> > > >
> > > > > will be really good if we can fall back to below mapping once
again.
> > > > > Current repo linux_4.16-rc-host-tags-v3.2 is giving lots of
> > > > > random mapping of CPU - MSIx. And that will be problematic in
> > > > > performance
> > > run.
> > > > >
> > > > > As I posted earlier, latest repo will only allow us to use *18*
> > > > > reply
> > > >
> > > > Looks not see this report before, could you share us how you
> > > > conclude
> > > that?
> > > > The only patch changing reply queue is the following one:
> > > >
> > > > https://marc.info/?l=linux-block&m=151972611911593&w=2
> > > >
> > > > But not see any issue in this patch yet, can you recover to 72
> > > > reply
> > > queues
> > > > after reverting the patch in above link?
> > > Ming -
> > >
> > > While testing, my system went bad. I debug further and understood
> > > that affinity mapping was changed due to below commit -
> > > 84676c1f21e8ff54befe985f4f14dc1edc10046b
> > >
> > > [PATCH] genirq/affinity: assign vectors to all possible CPUs
> > >
> > > Because of above change, we end up using very less reply queue. Many
> > > reply queues on my setup was mapped to offline/not-available CPUs.
> > > T

RE: [PATCH V3 8/8] scsi: megaraid: improve scsi_mq performance via .host_tagset

2018-03-08 Thread Kashyap Desai
> -Original Message-
> From: Ming Lei [mailto:ming@redhat.com]
> Sent: Thursday, March 8, 2018 6:46 AM
> To: Kashyap Desai
> Cc: Jens Axboe; linux-bl...@vger.kernel.org; Christoph Hellwig; Mike
Snitzer;
> linux-scsi@vger.kernel.org; Hannes Reinecke; Arun Easi; Omar Sandoval;
> Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace;
Peter
> Rivera; Laurence Oberman
> Subject: Re: [PATCH V3 8/8] scsi: megaraid: improve scsi_mq performance
via
> .host_tagset
>
> On Wed, Mar 07, 2018 at 10:58:34PM +0530, Kashyap Desai wrote:
> > > >
> > > > Also one observation using V3 series patch. I am seeing below
> > > > Affinity mapping whereas I have only 72 logical CPUs.  It means we
> > > > are really not going to use all reply queues.
> > > > e.a If I bind fio jobs on CPU 18-20, I am seeing only one reply
> > > > queue is used and that may lead to performance drop as well.
> > >
> > > If the mapping is in such shape, I guess it should be quite
> > > difficult to
> > figure out
> > > one perfect way to solve this situation because one reply queue has
> > > to
> > handle
> > > IOs submitted from 4~5 CPUs at average.
> >
> > 4.15.0-rc1 kernel has below mapping - I am not sure which commit id in
"
> > linux_4.16-rc-host-tags-v3.2" is changing the mapping of IRQ to CPU.
> > It
>
> I guess the mapping you posted is read from /proc/irq/126/smp_affinity.
>
> If yes, no any patch in linux_4.16-rc-host-tags-v3.2 should change IRQ
affinity
> code, which is done in irq_create_affinity_masks(), as you saw, no any
patch
> in linux_4.16-rc-host-tags-v3.2 touches that code.
>
> Could you simply apply the patches in linux_4.16-rc-host-tags-v3.2
against
> 4.15-rc1 kernel and see any difference?
>
> > will be really good if we can fall back to below mapping once again.
> > Current repo linux_4.16-rc-host-tags-v3.2 is giving lots of random
> > mapping of CPU - MSIx. And that will be problematic in performance
run.
> >
> > As I posted earlier, latest repo will only allow us to use *18* reply
>
> Looks not see this report before, could you share us how you conclude
that?
> The only patch changing reply queue is the following one:
>
>   https://marc.info/?l=linux-block&m=151972611911593&w=2
>
> But not see any issue in this patch yet, can you recover to 72 reply
queues
> after reverting the patch in above link?
Ming -

While testing, my system went bad. I debug further and understood that
affinity mapping was changed due to below commit -
84676c1f21e8ff54befe985f4f14dc1edc10046b

[PATCH] genirq/affinity: assign vectors to all possible CPUs

Because of above change, we end up using very less reply queue. Many reply
queues on my setup was mapped to offline/not-available CPUs. This may be
primary contributing to odd performance impact and it may not be truly due
to V3/V4 patch series.

I am planning to check your V3 and V4 series after removing above commit
ID (for performance impact.).

It is good if we spread possible CPUs (instead of online cpus) to all irq
vectors  considering -  We should have at least *one* online CPU mapped to
the vector.

>
> > queue instead of *72*.  Lots of performance related issue can be pop
> > up on different setup due to inconsistency in CPU - MSIx mapping. BTW,
> > changes in this area is intentional @" linux_4.16-rc-host-tags-v3.2".
?
>
> As you mentioned in the following link, you didn't see big performance
drop
> with linux_4.16-rc-host-tags-v3.2, right?
>
>   https://marc.info/?l=linux-block&m=151982993810092&w=2
>
>
> Thanks,
> Ming


RE: [PATCH V3 8/8] scsi: megaraid: improve scsi_mq performance via .host_tagset

2018-03-07 Thread Kashyap Desai
> >
> > Also one observation using V3 series patch. I am seeing below Affinity
> > mapping whereas I have only 72 logical CPUs.  It means we are really
> > not going to use all reply queues.
> > e.a If I bind fio jobs on CPU 18-20, I am seeing only one reply queue
> > is used and that may lead to performance drop as well.
>
> If the mapping is in such shape, I guess it should be quite difficult to
figure out
> one perfect way to solve this situation because one reply queue has to
handle
> IOs submitted from 4~5 CPUs at average.

4.15.0-rc1 kernel has below mapping - I am not sure which commit id in "
linux_4.16-rc-host-tags-v3.2" is changing the mapping of IRQ to CPU.  It
will be really good if we can fall back to below mapping once again.
Current repo linux_4.16-rc-host-tags-v3.2 is giving lots of random mapping
of CPU - MSIx. And that will be problematic in performance run.

As I posted earlier, latest repo will only allow us to use *18* reply
queue instead of *72*.  Lots of performance related issue can be pop up on
different setup due to inconsistency in CPU - MSIx mapping. BTW, changes
in this area is intentional @" linux_4.16-rc-host-tags-v3.2". ?

irq 218, cpu list 0
irq 219, cpu list 1
irq 220, cpu list 2
irq 221, cpu list 3
irq 222, cpu list 4
irq 223, cpu list 5
irq 224, cpu list 6
irq 225, cpu list 7
irq 226, cpu list 8
irq 227, cpu list 9
irq 228, cpu list 10
irq 229, cpu list 11
irq 230, cpu list 12
irq 231, cpu list 13
irq 232, cpu list 14
irq 233, cpu list 15
irq 234, cpu list 16
irq 235, cpu list 17
irq 236, cpu list 36
irq 237, cpu list 37
irq 238, cpu list 38
irq 239, cpu list 39
irq 240, cpu list 40
irq 241, cpu list 41
irq 242, cpu list 42
irq 243, cpu list 43
irq 244, cpu list 44
irq 245, cpu list 45
irq 246, cpu list 46
irq 247, cpu list 47
irq 248, cpu list 48
irq 249, cpu list 49
irq 250, cpu list 50
irq 251, cpu list 51
irq 252, cpu list 52
irq 253, cpu list 53
irq 254, cpu list 18
irq 255, cpu list 19
irq 256, cpu list 20
irq 257, cpu list 21
irq 258, cpu list 22
irq 259, cpu list 23
irq 260, cpu list 24
irq 261, cpu list 25
irq 262, cpu list 26
irq 263, cpu list 27
irq 264, cpu list 28
irq 265, cpu list 29
irq 266, cpu list 30
irq 267, cpu list 31
irq 268, cpu list 32
irq 269, cpu list 33
irq 270, cpu list 34
irq 271, cpu list 35
irq 272, cpu list 54
irq 273, cpu list 55
irq 274, cpu list 56
irq 275, cpu list 57
irq 276, cpu list 58
irq 277, cpu list 59
irq 278, cpu list 60
irq 279, cpu list 61
irq 280, cpu list 62
irq 281, cpu list 63
irq 282, cpu list 64
irq 283, cpu list 65
irq 284, cpu list 66
irq 285, cpu list 67
irq 286, cpu list 68
irq 287, cpu list 69
irq 288, cpu list 70
irq 289, cpu list 71


>
> The application should have the knowledge to avoid this kind of usage.
>
>
> Thanks,
> Ming


RE: [PATCH V3 8/8] scsi: megaraid: improve scsi_mq performance via .host_tagset

2018-03-07 Thread Kashyap Desai
> -Original Message-
> From: Ming Lei [mailto:ming@redhat.com]
> Sent: Wednesday, March 7, 2018 10:58 AM
> To: Kashyap Desai
> Cc: Jens Axboe; linux-bl...@vger.kernel.org; Christoph Hellwig; Mike
Snitzer;
> linux-scsi@vger.kernel.org; Hannes Reinecke; Arun Easi; Omar Sandoval;
> Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace;
Peter
> Rivera; Laurence Oberman
> Subject: Re: [PATCH V3 8/8] scsi: megaraid: improve scsi_mq performance
via
> .host_tagset
>
> On Wed, Feb 28, 2018 at 08:28:48PM +0530, Kashyap Desai wrote:
> > Ming -
> >
> > Quick testing on my setup -  Performance slightly degraded (4-5%
> > drop)for megaraid_sas driver with this patch. (From 1610K IOPS it goes
> > to 1544K) I confirm that after applying this patch, we have #queue =
#numa
> node.
> >
> > ls -l
> >
>
/sys/devices/pci:80/:80:02.0/:83:00.0/host10/target10:2:23/10:
> > 2:23:0/block/sdy/mq
> > total 0
> > drwxr-xr-x. 18 root root 0 Feb 28 09:53 0 drwxr-xr-x. 18 root root 0
> > Feb 28 09:53 1
> >
> >
> > I would suggest to skip megaraid_sas driver changes using
> > shared_tagset until and unless there is obvious gain. If overall
> > interface of using shared_tagset is commit in kernel tree, we will
> > investigate (megaraid_sas
> > driver) in future about real benefit of using it.
>
> Hi Kashyap,
>
> Now I have put patches for removing operating on scsi_host->host_busy in
> V4[1], especially which are done in the following 3 patches:
>
>   9221638b9bc9 scsi: avoid to hold host_busy for scsi_mq
>   1ffc8c0ffbe4 scsi: read host_busy via scsi_host_busy()
>   e453d3983243 scsi: introduce scsi_host_busy()
>
>
> Could you run your test on V4 and see if IOPS can be improved on
> megaraid_sas?
>
>
> [1] https://github.com/ming1/linux/commits/v4.16-rc-host-tags-v4

I will be doing testing soon.

BTW - Performance impact is due below patch only -
"[PATCH V3 8/8] scsi: megaraid: improve scsi_mq performance via
.host_tagset"

Below patch is really needed -
"[PATCH V3 2/8] scsi: megaraid_sas: fix selection of reply queue"

I am currently doing review on my setup.  I think above patch is fixing
real issue of performance (for megaraid_sas) as driver may not be sending
IO to optimal reply queue.
Having CPU to MSIx mapping will solve that. Megaraid_sas driver always
create max MSIx as min (online CPU, # MSIx HW support).
I will do more review and testing for that particular patch as well.

Also one observation using V3 series patch. I am seeing below Affinity
mapping whereas I have only 72 logical CPUs.  It means we are really not
going to use all reply queues.
e.a If I bind fio jobs on CPU 18-20, I am seeing only one reply queue is
used and that may lead to performance drop as well.

PCI name is 86:00.0, dump its irq affinity:
irq 218, cpu list 0-2,36-37
irq 219, cpu list 3-5,39-40
irq 220, cpu list 6-8,42-43
irq 221, cpu list 9-11,45-46
irq 222, cpu list 12-13,48-49
irq 223, cpu list 14-15,50-51
irq 224, cpu list 16-17,52-53
irq 225, cpu list 38,41,44,47
irq 226, cpu list 72,74,76,78
irq 227, cpu list 80,82,84,86
irq 228, cpu list 88,90,92,94
irq 229, cpu list 96,98,100,102
irq 230, cpu list 104,106,108,110
irq 231, cpu list 112,114,116,118
irq 232, cpu list 120,122,124,126
irq 233, cpu list 128,130,132,134
irq 234, cpu list 136,138,140,142
irq 235, cpu list 144,146,148,150
irq 236, cpu list 152,154,156,158
irq 237, cpu list 160,162,164,166
irq 238, cpu list 168,170,172,174
irq 239, cpu list 176,178,180,182
irq 240, cpu list 184,186,188,190
irq 241, cpu list 192,194,196,198
irq 242, cpu list 200,202,204,206
irq 243, cpu list 208,210,212,214
irq 244, cpu list 216,218,220,222
irq 245, cpu list 224,226,228,230
irq 246, cpu list 232,234,236,238
irq 247, cpu list 240,242,244,246
irq 248, cpu list 248,250,252,254
irq 249, cpu list 256,258,260,262
irq 250, cpu list 264,266,268,270
irq 251, cpu list 272,274,276,278
irq 252, cpu list 280,282,284,286
irq 253, cpu list 288,290,292,294
irq 254, cpu list 18-20,54-55
irq 255, cpu list 21-23,57-58
irq 256, cpu list 24-26,60-61
irq 257, cpu list 27-29,63-64
irq 258, cpu list 30-31,66-67
irq 259, cpu list 32-33,68-69
irq 260, cpu list 34-35,70-71
irq 261, cpu list 56,59,62,65
irq 262, cpu list 73,75,77,79
irq 263, cpu list 81,83,85,87
irq 264, cpu list 89,91,93,95
irq 265, cpu list 97,99,101,103
irq 266, cpu list 105,107,109,111
irq 267, cpu list 113,115,117,119
irq 268, cpu list 121,123,125,127
irq 269, cpu list 129,131,133,135
irq 270, cpu list 137,139,141,143
irq 271, cpu list 145,147,149,151
irq 272, cpu list 153,155,157,159
irq 273, cpu list 161,163,165,167
irq 274, cpu list 169,171,173,175
irq 275, cpu list 177,179,181,183
irq 276, cpu list 185,187,189,191
irq 277, cpu list 193,195,197,199
irq 278, cpu list 201,203,20

RE: [PATCH V3 1/8] scsi: hpsa: fix selection of reply queue

2018-03-04 Thread Kashyap Desai
> -Original Message-
> From: Laurence Oberman [mailto:lober...@redhat.com]
> Sent: Saturday, March 3, 2018 3:23 AM
> To: Don Brace; Ming Lei
> Cc: Jens Axboe; linux-bl...@vger.kernel.org; Christoph Hellwig; Mike
> Snitzer;
> linux-scsi@vger.kernel.org; Hannes Reinecke; Arun Easi; Omar Sandoval;
> Martin K . Petersen; James Bottomley; Christoph Hellwig; Kashyap Desai;
> Peter
> Rivera; Meelis Roos
> Subject: Re: [PATCH V3 1/8] scsi: hpsa: fix selection of reply queue
>
> On Fri, 2018-03-02 at 15:03 +, Don Brace wrote:
> > > -Original Message-
> > > From: Laurence Oberman [mailto:lober...@redhat.com]
> > > Sent: Friday, March 02, 2018 8:09 AM
> > > To: Ming Lei 
> > > Cc: Don Brace ; Jens Axboe  > > k>;
> > > linux-bl...@vger.kernel.org; Christoph Hellwig ;
> > > Mike Snitzer ; linux-scsi@vger.kernel.org;
> > > Hannes Reinecke ; Arun Easi ;
> > > Omar Sandoval ; Martin K . Petersen
> > > ; James Bottomley
> > > ; Christoph Hellwig
> > > ; Kashyap Desai ; Peter
> > > Rivera ; Meelis Roos 
> > > Subject: Re: [PATCH V3 1/8] scsi: hpsa: fix selection of reply queue
> > >
> > > EXTERNAL EMAIL
> > >
> > >
> > > On Fri, 2018-03-02 at 10:16 +0800, Ming Lei wrote:
> > > > On Thu, Mar 01, 2018 at 04:19:34PM -0500, Laurence Oberman wrote:
> > > > > On Thu, 2018-03-01 at 14:01 -0500, Laurence Oberman wrote:
> > > > > > On Thu, 2018-03-01 at 16:18 +, Don Brace wrote:
> > > > > > > > -Original Message-
> > > > > > > > From: Ming Lei [mailto:ming@redhat.com]
> > > > > > > > Sent: Tuesday, February 27, 2018 4:08 AM
> > > > > > > > To: Jens Axboe ; linux-block@vger.kernel
> > > > > > > > .org ; Christoph Hellwig ; Mike Snitzer
> > > > > > > >  > > > > > > > >
> > > > > > > >
> > > > > > > > Cc: linux-scsi@vger.kernel.org; Hannes Reinecke  > > > > > > > e.de
> > > > > > > > > ;
> > > > > > > >
> > > > > > > > Arun Easi
> > > > > > > > ; Omar Sandoval ;
> > > > > > > > Martin K .
> > > > > > > > Petersen ; James Bottomley
> > > > > > > > ; Christoph Hellwig
> > > > > > > > ; Don Brace ;
> > > > > > > > Kashyap Desai ; Peter Rivera
> > > > > > > >  > > > > > > > om>;
> > > > > > > > Laurence Oberman ; Ming Lei
> > > > > > > > ; Meelis Roos 
> > > > > > > > Subject: [PATCH V3 1/8] scsi: hpsa: fix selection of reply
> > > > > > > > queue
> > > > > > > >
> > > >
> > > > Seems Don run into IO failure without blk-mq, could you run your
> > > > tests again in legacy mode?
> > > >
> > > > Thanks,
> > > > Ming
> > >
> > > Hello Ming
> > > I ran multiple passes on Legacy and still see no issues in my test
> > > bed
> > >
> > > BOOT_IMAGE=/vmlinuz-4.16.0-rc2.ming+ root=UUID=43f86d71-b1bf-4789-
> > > a28e-
> > > 21c6ddc90195 ro crashkernel=256M@64M log_buf_len=64M
> > > console=ttyS1,115200n8
> > >
> > > HEAD of the git kernel I am using
> > >
> > > 694e16f scsi: megaraid: improve scsi_mq performance via .host_tagset
> > > 793686c scsi: hpsa: improve scsi_mq performance via .host_tagset
> > > 60d5b36 block: null_blk: introduce module parameter of 'g_host_tags'
> > > 8847067 scsi: Add template flag 'host_tagset'
> > > a8fbdd6 blk-mq: introduce BLK_MQ_F_HOST_TAGS 4710fab blk-mq:
> > > introduce 'start_tag' field to 'struct blk_mq_tags'
> > > 09bb153 scsi: megaraid_sas: fix selection of reply queue
> > > 52700d8 scsi: hpsa: fix selection of reply queue
> >
> > I checkout out Linus's tree (4.16.0-rc3+) and re-applied the above
> > patches.
> > I  and have been running 24 hours with no issues.
> > Evidently my forked copy was corrupted.
> >
> > So, my I/O testing has gone well.
> >
> > I'll run some performance numbers next.
> >
> > Thanks,
> > Don
>
> Unless Kashyap is not happy we need to consider getting this in to Linus
> now
> because we are seeing HPE servers that keep hanging now with the original
> commit now upstream.
>
> Kashyap, are you good with the v3 patchset or still concerned with
> performance. I was getting pretty good IOPS/sec to individual SSD drives
> set
> up as jbod devices on the megaraid_sas.

Laurence -
Did you find difference with/without the patch ? What was IOPs number with
and without patch.
It is not urgent feature, so I would like to take some time to get BRCM's
performance team involved and do full analysis of performance run and find
pros/cons.

Kashyap
>
> With larger I/O sizes like 1MB I was getting good MB/sec and not seeing a
> measurable performance impact.
>
> I dont have the hardware you have to mimic your configuration.
>
> Thanks
> Laurence


RE: [PATCH V3 8/8] scsi: megaraid: improve scsi_mq performance via .host_tagset

2018-02-28 Thread Kashyap Desai
> -Original Message-
> From: Laurence Oberman [mailto:lober...@redhat.com]
> Sent: Wednesday, February 28, 2018 9:52 PM
> To: Ming Lei; Kashyap Desai
> Cc: Jens Axboe; linux-bl...@vger.kernel.org; Christoph Hellwig; Mike
> Snitzer;
> linux-scsi@vger.kernel.org; Hannes Reinecke; Arun Easi; Omar Sandoval;
> Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace; Peter
> Rivera
> Subject: Re: [PATCH V3 8/8] scsi: megaraid: improve scsi_mq performance
> via
> .host_tagset
>
> On Wed, 2018-02-28 at 23:21 +0800, Ming Lei wrote:
> > On Wed, Feb 28, 2018 at 08:28:48PM +0530, Kashyap Desai wrote:
> > > Ming -
> > >
> > > Quick testing on my setup -  Performance slightly degraded (4-5%
> > > drop)for megaraid_sas driver with this patch. (From 1610K IOPS it
> > > goes to
> > > 1544K)
> > > I confirm that after applying this patch, we have #queue = #numa
> > > node.
> > >
> > > ls -l
> > > /sys/devices/pci:80/:80:02.0/:83:00.0/host10/target10:2
> > > :23/10:
> > > 2:23:0/block/sdy/mq
> > > total 0
> > > drwxr-xr-x. 18 root root 0 Feb 28 09:53 0 drwxr-xr-x. 18 root root 0
> > > Feb 28 09:53 1
> >
> > OK, thanks for your test.
> >
> > As I mentioned to you, this patch should have improved performance on
> > megaraid_sas, but the current slight degrade might be caused by
> > scsi_host_queue_ready() in scsi_queue_rq(), I guess.
> >
> > With .host_tagset enabled and use per-numa-node hw queue, request can
> > be queued to lld more frequently/quick than single queue, then the
> > cost of
> > atomic_inc_return(&host->host_busy) may be increased much meantime,
> > think about millions of such operations, and finally slight IOPS drop
> > is observed when the hw queue depth becomes half of .can_queue.
> >
> > >
> > >
> > > I would suggest to skip megaraid_sas driver changes using
> > > shared_tagset until and unless there is obvious gain. If overall
> > > interface of using shared_tagset is commit in kernel tree, we will
> > > investigate (megaraid_sas
> > > driver) in future about real benefit of using it.
> >
> > I'd suggest to not merge it until it is proved that performance can be
> > improved in real device.

Noted.

> >
> > I will try to work to remove the expensive atomic_inc_return(&host-
> > >host_busy)
> > from scsi_queue_rq(), since it isn't needed for SCSI_MQ, once it is
> > done, will ask you to test again.

Ming - Do you mean removing host_busy stats  from scsi_queue_rq() will still
provide correct value in host_busy whenever IO reach to LLD ?

> >
> >
> > Thanks,
> > Ming
>
> I will test this here as well
> I just put the Megaraid card in to my system here
>
> Kashyap, do you have ssd's on the back-end and are you you using jbods or
> virtual devices. Let me have your config.
> I only have 6G sas shelves though.

Laurence -
I am using 12 SSD drives in JBOD mode OR single drive R0 mode.  Single SSD
is capable of ~138K IOPS (4K RR).
With all 12 SSDs performance scale linearly and goes upto ~1610K IOPS.

I think if you have 6G SAS fully loaded, you may need more number of drives
to reach 1600K IOPs (sequential load with nomerges=2 on HDD is required to
avoid IO merge at block layer.)

SSD model I am using is -  HGST  - " HUSMH8020BSS200"
Here is lscpu output of my setup -

lscpu
Architecture:  x86_64
CPU op-mode(s):32-bit, 64-bit
Byte Order:Little Endian
CPU(s):32
On-line CPU(s) list:   0-31
Thread(s) per core:2
Core(s) per socket:8
Socket(s): 2
NUMA node(s):  2
Vendor ID: GenuineIntel
CPU family:6
Model: 79
Model name:Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
Stepping:  1
CPU MHz:   1726.217
BogoMIPS:  4199.37
Virtualization:VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache:  256K
L3 cache:  20480K
NUMA node0 CPU(s): 0-7,16-23
NUMA node1 CPU(s): 8-15,24-31

>
> Regards
> Laurence


RE: [PATCH V3 8/8] scsi: megaraid: improve scsi_mq performance via .host_tagset

2018-02-28 Thread Kashyap Desai
Ming -

Quick testing on my setup -  Performance slightly degraded (4-5% drop)for
megaraid_sas driver with this patch. (From 1610K IOPS it goes to 1544K)
I confirm that after applying this patch, we have #queue = #numa node.

ls -l
/sys/devices/pci:80/:80:02.0/:83:00.0/host10/target10:2:23/10:
2:23:0/block/sdy/mq
total 0
drwxr-xr-x. 18 root root 0 Feb 28 09:53 0
drwxr-xr-x. 18 root root 0 Feb 28 09:53 1


I would suggest to skip megaraid_sas driver changes using shared_tagset
until and unless there is obvious gain. If overall interface of using
shared_tagset is commit in kernel tree, we will investigate (megaraid_sas
driver) in future about real benefit of using it.

Without patch -

  4.64%  [megaraid_sas]   [k] complete_cmd_fusion
   3.23%  [kernel] [k] irq_entries_start
   3.18%  [kernel] [k] _raw_spin_lock
   3.06%  [kernel] [k] syscall_return_via_sysret
   2.74%  [kernel] [k] bt_iter
   2.55%  [kernel] [k] scsi_queue_rq
   2.21%  [megaraid_sas]   [k] megasas_build_io_fusion
   1.80%  [megaraid_sas]   [k] megasas_queue_command
   1.59%  [kernel] [k] __audit_syscall_exit
   1.55%  [kernel] [k] _raw_spin_lock_irqsave
   1.38%  [megaraid_sas]   [k] megasas_build_and_issue_cmd_fusion
   1.34%  [kernel] [k] do_io_submit
   1.33%  [kernel] [k] gup_pgd_range
   1.26%  [kernel] [k] scsi_softirq_done
   1.20%  fio  [.] __fio_gettime
   1.20%  [kernel] [k] switch_mm_irqs_off
   1.00%  [megaraid_sas]   [k] megasas_build_ldio_fusion
   0.97%  fio  [.] get_io_u
   0.89%  [kernel] [k] lookup_ioctx
   0.80%  [kernel] [k] scsi_dec_host_busy
   0.78%  [kernel] [k] blkdev_direct_IO
   0.78%  [megaraid_sas]   [k] MR_GetPhyParams
   0.73%  [kernel] [k] aio_read_events
   0.70%  [megaraid_sas]   [k] MR_BuildRaidContext
   0.64%  [kernel] [k] blk_mq_complete_request
   0.64%  fio  [.] thread_main
   0.63%  [kernel] [k] blk_queue_split
   0.63%  [kernel] [k] blk_mq_get_request
   0.61%  [kernel] [k] read_tsc
   0.59%  [kernel] [k] kmem_cache_a


With patch -

   4.36%  [megaraid_sas]   [k] complete_cmd_fusion
   3.24%  [kernel] [k] irq_entries_start
   3.00%  [kernel] [k] syscall_return_via_sysret
   2.41%  [kernel] [k] scsi_queue_rq
   2.41%  [kernel] [k] _raw_spin_lock
   2.22%  [megaraid_sas]   [k] megasas_build_io_fusion
   1.92%  [kernel] [k] bt_iter
   1.74%  [megaraid_sas]   [k] megasas_queue_command
   1.48%  [kernel] [k] gup_pgd_range
   1.44%  [kernel] [k] __audit_syscall_exit
   1.33%  [megaraid_sas]   [k] megasas_build_and_issue_cmd_fusion
   1.29%  [kernel] [k] _raw_spin_lock_irqsave
   1.25%  fio  [.] get_io_u
   1.24%  fio  [.] __fio_gettime
   1.22%  [kernel] [k] do_io_submit
   1.18%  [megaraid_sas]   [k] megasas_build_ldio_fusion
   1.02%  [kernel] [k] blk_mq_get_request
   0.91%  [kernel] [k] lookup_ioctx
   0.91%  [kernel] [k] scsi_softirq_done
   0.88%  [kernel] [k] scsi_dec_host_busy
   0.87%  [kernel] [k] blkdev_direct_IO
   0.77%  [megaraid_sas]   [k] MR_BuildRaidContext
   0.76%  [megaraid_sas]   [k] MR_GetPhyParams
   0.73%  [kernel] [k] __fget
   0.70%  [kernel] [k] switch_mm_irqs_off
   0.70%  fio  [.] thread_main
   0.69%  [kernel] [k] aio_read_events
   0.68%  [kernel] [k] note_interrupt
   0.65%  [kernel] [k] do_syscal

Kashyap

> -Original Message-
> From: Ming Lei [mailto:ming@redhat.com]
> Sent: Tuesday, February 27, 2018 3:38 PM
> To: Jens Axboe; linux-bl...@vger.kernel.org; Christoph Hellwig; Mike
Snitzer
> Cc: linux-scsi@vger.kernel.org; Hannes Reinecke; Arun Easi; Omar
Sandoval;
> Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace;
Kashyap
> Desai; Peter Rivera; Laurence Oberman; Ming Lei
> Subject: [PATCH V3 8/8] scsi: megaraid: improve scsi_mq performance via
> .host_tagset
>
> It is observed on null_blk that IOPS can be improved much by simply
making
> hw queue per NUMA node, so this patch applies the introduced
.host_tagset
> for improving performance.
>
> In reality, .can_queue is quite big, and NUMA node number is often
small, so
> each hw queue's depth should be high enough to saturate device.
>
> Cc: Arun Easi 
> Cc: Omar Sandoval ,
> C

RE: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq

2018-02-13 Thread Kashyap Desai
> -Original Message-
> From: Ming Lei [mailto:ming@redhat.com]
> Sent: Tuesday, February 13, 2018 6:11 AM
> To: Kashyap Desai
> Cc: Hannes Reinecke; Jens Axboe; linux-bl...@vger.kernel.org; Christoph
> Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; Arun Easi; Omar
Sandoval;
> Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace;
Peter
> Rivera; Paolo Bonzini; Laurence Oberman
> Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce
> force_blk_mq
>
> Hi Kashyap,
>
> On Tue, Feb 13, 2018 at 12:05:14AM +0530, Kashyap Desai wrote:
> > > -Original Message-
> > > From: Ming Lei [mailto:ming@redhat.com]
> > > Sent: Sunday, February 11, 2018 11:01 AM
> > > To: Kashyap Desai
> > > Cc: Hannes Reinecke; Jens Axboe; linux-bl...@vger.kernel.org;
> > > Christoph Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; Arun
> > > Easi; Omar
> > Sandoval;
> > > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace;
> > Peter
> > > Rivera; Paolo Bonzini; Laurence Oberman
> > > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags &
> > > introduce force_blk_mq
> > >
> > > On Sat, Feb 10, 2018 at 09:00:57AM +0800, Ming Lei wrote:
> > > > Hi Kashyap,
> > > >
> > > > On Fri, Feb 09, 2018 at 02:12:16PM +0530, Kashyap Desai wrote:
> > > > > > -Original Message-
> > > > > > From: Ming Lei [mailto:ming@redhat.com]
> > > > > > Sent: Friday, February 9, 2018 11:01 AM
> > > > > > To: Kashyap Desai
> > > > > > Cc: Hannes Reinecke; Jens Axboe; linux-bl...@vger.kernel.org;
> > > > > > Christoph Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org;
> > > > > > Arun Easi; Omar
> > > > > Sandoval;
> > > > > > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don
> > > > > > Brace;
> > > > > Peter
> > > > > > Rivera; Paolo Bonzini; Laurence Oberman
> > > > > > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags &
> > > > > > introduce force_blk_mq
> > > > > >
> > > > > > On Fri, Feb 09, 2018 at 10:28:23AM +0530, Kashyap Desai wrote:
> > > > > > > > -Original Message-
> > > > > > > > From: Ming Lei [mailto:ming@redhat.com]
> > > > > > > > Sent: Thursday, February 8, 2018 10:23 PM
> > > > > > > > To: Hannes Reinecke
> > > > > > > > Cc: Kashyap Desai; Jens Axboe;
> > > > > > > > linux-bl...@vger.kernel.org; Christoph Hellwig; Mike
> > > > > > > > Snitzer; linux-scsi@vger.kernel.org; Arun Easi; Omar
> > > > > > > Sandoval;
> > > > > > > > Martin K . Petersen; James Bottomley; Christoph Hellwig;
> > > > > > > > Don Brace;
> > > > > > > Peter
> > > > > > > > Rivera; Paolo Bonzini; Laurence Oberman
> > > > > > > > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global
> > > > > > > > tags & introduce force_blk_mq
> > > > > > > >
> > > > > > > > On Thu, Feb 08, 2018 at 08:00:29AM +0100, Hannes Reinecke
> > wrote:
> > > > > > > > > On 02/07/2018 03:14 PM, Kashyap Desai wrote:
> > > > > > > > > >> -Original Message-
> > > > > > > > > >> From: Ming Lei [mailto:ming@redhat.com]
> > > > > > > > > >> Sent: Wednesday, February 7, 2018 5:53 PM
> > > > > > > > > >> To: Hannes Reinecke
> > > > > > > > > >> Cc: Kashyap Desai; Jens Axboe;
> > > > > > > > > >> linux-bl...@vger.kernel.org; Christoph Hellwig; Mike
> > > > > > > > > >> Snitzer; linux-scsi@vger.kernel.org; Arun Easi; Omar
> > > > > > > > > > Sandoval;
> > > > > > > > > >> Martin K . Petersen; James Bottomley; Christoph
> > > > > > > > > >> Hellwig; Don Brace;
> > > > > > > > > > Peter
> > > > > > > > > >> Rivera; Paolo Bonzini; Laurence Oberman
> > > > > > > > > >> Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support
> > > > > > > > > &g

RE: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq

2018-02-12 Thread Kashyap Desai
> -Original Message-
> From: Ming Lei [mailto:ming@redhat.com]
> Sent: Sunday, February 11, 2018 11:01 AM
> To: Kashyap Desai
> Cc: Hannes Reinecke; Jens Axboe; linux-bl...@vger.kernel.org; Christoph
> Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; Arun Easi; Omar
Sandoval;
> Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace;
Peter
> Rivera; Paolo Bonzini; Laurence Oberman
> Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce
> force_blk_mq
>
> On Sat, Feb 10, 2018 at 09:00:57AM +0800, Ming Lei wrote:
> > Hi Kashyap,
> >
> > On Fri, Feb 09, 2018 at 02:12:16PM +0530, Kashyap Desai wrote:
> > > > -Original Message-
> > > > From: Ming Lei [mailto:ming....@redhat.com]
> > > > Sent: Friday, February 9, 2018 11:01 AM
> > > > To: Kashyap Desai
> > > > Cc: Hannes Reinecke; Jens Axboe; linux-bl...@vger.kernel.org;
> > > > Christoph Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; Arun
> > > > Easi; Omar
> > > Sandoval;
> > > > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don
> > > > Brace;
> > > Peter
> > > > Rivera; Paolo Bonzini; Laurence Oberman
> > > > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags &
> > > > introduce force_blk_mq
> > > >
> > > > On Fri, Feb 09, 2018 at 10:28:23AM +0530, Kashyap Desai wrote:
> > > > > > -Original Message-
> > > > > > From: Ming Lei [mailto:ming@redhat.com]
> > > > > > Sent: Thursday, February 8, 2018 10:23 PM
> > > > > > To: Hannes Reinecke
> > > > > > Cc: Kashyap Desai; Jens Axboe; linux-bl...@vger.kernel.org;
> > > > > > Christoph Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org;
> > > > > > Arun Easi; Omar
> > > > > Sandoval;
> > > > > > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don
> > > > > > Brace;
> > > > > Peter
> > > > > > Rivera; Paolo Bonzini; Laurence Oberman
> > > > > > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags &
> > > > > > introduce force_blk_mq
> > > > > >
> > > > > > On Thu, Feb 08, 2018 at 08:00:29AM +0100, Hannes Reinecke
wrote:
> > > > > > > On 02/07/2018 03:14 PM, Kashyap Desai wrote:
> > > > > > > >> -Original Message-
> > > > > > > >> From: Ming Lei [mailto:ming@redhat.com]
> > > > > > > >> Sent: Wednesday, February 7, 2018 5:53 PM
> > > > > > > >> To: Hannes Reinecke
> > > > > > > >> Cc: Kashyap Desai; Jens Axboe;
> > > > > > > >> linux-bl...@vger.kernel.org; Christoph Hellwig; Mike
> > > > > > > >> Snitzer; linux-scsi@vger.kernel.org; Arun Easi; Omar
> > > > > > > > Sandoval;
> > > > > > > >> Martin K . Petersen; James Bottomley; Christoph Hellwig;
> > > > > > > >> Don Brace;
> > > > > > > > Peter
> > > > > > > >> Rivera; Paolo Bonzini; Laurence Oberman
> > > > > > > >> Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global
> > > > > > > >> tags & introduce force_blk_mq
> > > > > > > >>
> > > > > > > >> On Wed, Feb 07, 2018 at 07:50:21AM +0100, Hannes Reinecke
> > > wrote:
> > > > > > > >>> Hi all,
> > > > > > > >>>
> > > > > > > >>> [ .. ]
> > > > > > > >>>>>
> > > > > > > >>>>> Could you share us your patch for enabling
> > > > > > > >>>>> global_tags/MQ on
> > > > > > > >>>> megaraid_sas
> > > > > > > >>>>> so that I can reproduce your test?
> > > > > > > >>>>>
> > > > > > > >>>>>> See below perf top data. "bt_iter" is consuming 4
> > > > > > > >>>>>> times more
> > > > > CPU.
> > > > > > > >>>>>
> > > > > > > >>>>> Could you share us what the IOPS/CPU utilization
> > > > > > > >>>>> effect is after
> > > > >

RE: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq

2018-02-09 Thread Kashyap Desai
> -Original Message-
> From: Ming Lei [mailto:ming@redhat.com]
> Sent: Friday, February 9, 2018 11:01 AM
> To: Kashyap Desai
> Cc: Hannes Reinecke; Jens Axboe; linux-bl...@vger.kernel.org; Christoph
> Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; Arun Easi; Omar
Sandoval;
> Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace;
Peter
> Rivera; Paolo Bonzini; Laurence Oberman
> Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce
> force_blk_mq
>
> On Fri, Feb 09, 2018 at 10:28:23AM +0530, Kashyap Desai wrote:
> > > -Original Message-
> > > From: Ming Lei [mailto:ming@redhat.com]
> > > Sent: Thursday, February 8, 2018 10:23 PM
> > > To: Hannes Reinecke
> > > Cc: Kashyap Desai; Jens Axboe; linux-bl...@vger.kernel.org;
> > > Christoph Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; Arun
> > > Easi; Omar
> > Sandoval;
> > > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace;
> > Peter
> > > Rivera; Paolo Bonzini; Laurence Oberman
> > > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags &
> > > introduce force_blk_mq
> > >
> > > On Thu, Feb 08, 2018 at 08:00:29AM +0100, Hannes Reinecke wrote:
> > > > On 02/07/2018 03:14 PM, Kashyap Desai wrote:
> > > > >> -Original Message-
> > > > >> From: Ming Lei [mailto:ming@redhat.com]
> > > > >> Sent: Wednesday, February 7, 2018 5:53 PM
> > > > >> To: Hannes Reinecke
> > > > >> Cc: Kashyap Desai; Jens Axboe; linux-bl...@vger.kernel.org;
> > > > >> Christoph Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org;
> > > > >> Arun Easi; Omar
> > > > > Sandoval;
> > > > >> Martin K . Petersen; James Bottomley; Christoph Hellwig; Don
> > > > >> Brace;
> > > > > Peter
> > > > >> Rivera; Paolo Bonzini; Laurence Oberman
> > > > >> Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags &
> > > > >> introduce force_blk_mq
> > > > >>
> > > > >> On Wed, Feb 07, 2018 at 07:50:21AM +0100, Hannes Reinecke
wrote:
> > > > >>> Hi all,
> > > > >>>
> > > > >>> [ .. ]
> > > > >>>>>
> > > > >>>>> Could you share us your patch for enabling global_tags/MQ on
> > > > >>>> megaraid_sas
> > > > >>>>> so that I can reproduce your test?
> > > > >>>>>
> > > > >>>>>> See below perf top data. "bt_iter" is consuming 4 times
> > > > >>>>>> more
> > CPU.
> > > > >>>>>
> > > > >>>>> Could you share us what the IOPS/CPU utilization effect is
> > > > >>>>> after
> > > > >>>> applying the
> > > > >>>>> patch V2? And your test script?
> > > > >>>> Regarding CPU utilization, I need to test one more time.
> > > > >>>> Currently system is in used.
> > > > >>>>
> > > > >>>> I run below fio test on total 24 SSDs expander attached.
> > > > >>>>
> > > > >>>> numactl -N 1 fio jbod.fio --rw=randread --iodepth=64 --bs=4k
> > > > >>>> --ioengine=libaio --rw=randread
> > > > >>>>
> > > > >>>> Performance dropped from 1.6 M IOPs to 770K IOPs.
> > > > >>>>
> > > > >>> This is basically what we've seen with earlier iterations.
> > > > >>
> > > > >> Hi Hannes,
> > > > >>
> > > > >> As I mentioned in another mail[1], Kashyap's patch has a big
> > > > >> issue,
> > > > > which
> > > > >> causes only reply queue 0 used.
> > > > >>
> > > > >> [1] https://marc.info/?l=linux-scsi&m=151793204014631&w=2
> > > > >>
> > > > >> So could you guys run your performance test again after fixing
> > > > >> the
> > > > > patch?
> > > > >
> > > > > Ming -
> > > > >
> > > > > I tried after change you requested.  Performance drop is still
> > unresolved.
> > > > > From 1.6 M IOPS to 770K IOPS.
> > > 

RE: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq

2018-02-08 Thread Kashyap Desai
> -Original Message-
> From: Ming Lei [mailto:ming@redhat.com]
> Sent: Thursday, February 8, 2018 10:23 PM
> To: Hannes Reinecke
> Cc: Kashyap Desai; Jens Axboe; linux-bl...@vger.kernel.org; Christoph
> Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; Arun Easi; Omar
Sandoval;
> Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace;
Peter
> Rivera; Paolo Bonzini; Laurence Oberman
> Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce
> force_blk_mq
>
> On Thu, Feb 08, 2018 at 08:00:29AM +0100, Hannes Reinecke wrote:
> > On 02/07/2018 03:14 PM, Kashyap Desai wrote:
> > >> -Original Message-
> > >> From: Ming Lei [mailto:ming@redhat.com]
> > >> Sent: Wednesday, February 7, 2018 5:53 PM
> > >> To: Hannes Reinecke
> > >> Cc: Kashyap Desai; Jens Axboe; linux-bl...@vger.kernel.org;
> > >> Christoph Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; Arun
> > >> Easi; Omar
> > > Sandoval;
> > >> Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace;
> > > Peter
> > >> Rivera; Paolo Bonzini; Laurence Oberman
> > >> Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags &
> > >> introduce force_blk_mq
> > >>
> > >> On Wed, Feb 07, 2018 at 07:50:21AM +0100, Hannes Reinecke wrote:
> > >>> Hi all,
> > >>>
> > >>> [ .. ]
> > >>>>>
> > >>>>> Could you share us your patch for enabling global_tags/MQ on
> > >>>> megaraid_sas
> > >>>>> so that I can reproduce your test?
> > >>>>>
> > >>>>>> See below perf top data. "bt_iter" is consuming 4 times more
CPU.
> > >>>>>
> > >>>>> Could you share us what the IOPS/CPU utilization effect is after
> > >>>> applying the
> > >>>>> patch V2? And your test script?
> > >>>> Regarding CPU utilization, I need to test one more time.
> > >>>> Currently system is in used.
> > >>>>
> > >>>> I run below fio test on total 24 SSDs expander attached.
> > >>>>
> > >>>> numactl -N 1 fio jbod.fio --rw=randread --iodepth=64 --bs=4k
> > >>>> --ioengine=libaio --rw=randread
> > >>>>
> > >>>> Performance dropped from 1.6 M IOPs to 770K IOPs.
> > >>>>
> > >>> This is basically what we've seen with earlier iterations.
> > >>
> > >> Hi Hannes,
> > >>
> > >> As I mentioned in another mail[1], Kashyap's patch has a big issue,
> > > which
> > >> causes only reply queue 0 used.
> > >>
> > >> [1] https://marc.info/?l=linux-scsi&m=151793204014631&w=2
> > >>
> > >> So could you guys run your performance test again after fixing the
> > > patch?
> > >
> > > Ming -
> > >
> > > I tried after change you requested.  Performance drop is still
unresolved.
> > > From 1.6 M IOPS to 770K IOPS.
> > >
> > > See below data. All 24 reply queue is in used correctly.
> > >
> > > IRQs / 1 second(s)
> > > IRQ#  TOTAL  NODE0   NODE1  NAME
> > >  360  16422  0   16422  IR-PCI-MSI 70254653-edge megasas
> > >  364  15980  0   15980  IR-PCI-MSI 70254657-edge megasas
> > >  362  15979  0   15979  IR-PCI-MSI 70254655-edge megasas
> > >  345  15696  0   15696  IR-PCI-MSI 70254638-edge megasas
> > >  341  15659  0   15659  IR-PCI-MSI 70254634-edge megasas
> > >  369  15656  0   15656  IR-PCI-MSI 70254662-edge megasas
> > >  359  15650  0   15650  IR-PCI-MSI 70254652-edge megasas
> > >  358  15596  0   15596  IR-PCI-MSI 70254651-edge megasas
> > >  350  15574  0   15574  IR-PCI-MSI 70254643-edge megasas
> > >  342  15532  0   15532  IR-PCI-MSI 70254635-edge megasas
> > >  344  15527  0   15527  IR-PCI-MSI 70254637-edge megasas
> > >  346  15485  0   15485  IR-PCI-MSI 70254639-edge megasas
> > >  361  15482  0   15482  IR-PCI-MSI 70254654-edge megasas
> > >  348  15467  0   15467  IR-PCI-MSI 70254641-edge megasas
> > >  368  15463  0   15463  IR-PCI-MSI 70254661-edge megasas
> > >  354  15420  0   15420  IR-PCI-MSI 70254647-edge megasas
> > >  351  15378  0   15378  IR-PCI-MSI 70254644-edge megasas
> > >  352  15377  0   15377  IR-PCI-

RE: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq

2018-02-07 Thread Kashyap Desai
> -Original Message-
> From: Ming Lei [mailto:ming@redhat.com]
> Sent: Wednesday, February 7, 2018 5:53 PM
> To: Hannes Reinecke
> Cc: Kashyap Desai; Jens Axboe; linux-bl...@vger.kernel.org; Christoph
> Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; Arun Easi; Omar
Sandoval;
> Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace;
Peter
> Rivera; Paolo Bonzini; Laurence Oberman
> Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce
> force_blk_mq
>
> On Wed, Feb 07, 2018 at 07:50:21AM +0100, Hannes Reinecke wrote:
> > Hi all,
> >
> > [ .. ]
> > >>
> > >> Could you share us your patch for enabling global_tags/MQ on
> > > megaraid_sas
> > >> so that I can reproduce your test?
> > >>
> > >>> See below perf top data. "bt_iter" is consuming 4 times more CPU.
> > >>
> > >> Could you share us what the IOPS/CPU utilization effect is after
> > > applying the
> > >> patch V2? And your test script?
> > > Regarding CPU utilization, I need to test one more time. Currently
> > > system is in used.
> > >
> > > I run below fio test on total 24 SSDs expander attached.
> > >
> > > numactl -N 1 fio jbod.fio --rw=randread --iodepth=64 --bs=4k
> > > --ioengine=libaio --rw=randread
> > >
> > > Performance dropped from 1.6 M IOPs to 770K IOPs.
> > >
> > This is basically what we've seen with earlier iterations.
>
> Hi Hannes,
>
> As I mentioned in another mail[1], Kashyap's patch has a big issue,
which
> causes only reply queue 0 used.
>
> [1] https://marc.info/?l=linux-scsi&m=151793204014631&w=2
>
> So could you guys run your performance test again after fixing the
patch?

Ming -

I tried after change you requested.  Performance drop is still unresolved.
>From 1.6 M IOPS to 770K IOPS.

See below data. All 24 reply queue is in used correctly.

IRQs / 1 second(s)
IRQ#  TOTAL  NODE0   NODE1  NAME
 360  16422  0   16422  IR-PCI-MSI 70254653-edge megasas
 364  15980  0   15980  IR-PCI-MSI 70254657-edge megasas
 362  15979  0   15979  IR-PCI-MSI 70254655-edge megasas
 345  15696  0   15696  IR-PCI-MSI 70254638-edge megasas
 341  15659  0   15659  IR-PCI-MSI 70254634-edge megasas
 369  15656  0   15656  IR-PCI-MSI 70254662-edge megasas
 359  15650  0   15650  IR-PCI-MSI 70254652-edge megasas
 358  15596  0   15596  IR-PCI-MSI 70254651-edge megasas
 350  15574  0   15574  IR-PCI-MSI 70254643-edge megasas
 342  15532  0   15532  IR-PCI-MSI 70254635-edge megasas
 344  15527  0   15527  IR-PCI-MSI 70254637-edge megasas
 346  15485  0   15485  IR-PCI-MSI 70254639-edge megasas
 361  15482  0   15482  IR-PCI-MSI 70254654-edge megasas
 348  15467  0   15467  IR-PCI-MSI 70254641-edge megasas
 368  15463  0   15463  IR-PCI-MSI 70254661-edge megasas
 354  15420  0   15420  IR-PCI-MSI 70254647-edge megasas
 351  15378  0   15378  IR-PCI-MSI 70254644-edge megasas
 352  15377  0   15377  IR-PCI-MSI 70254645-edge megasas
 356  15348  0   15348  IR-PCI-MSI 70254649-edge megasas
 337  15344  0   15344  IR-PCI-MSI 70254630-edge megasas
 343  15320  0   15320  IR-PCI-MSI 70254636-edge megasas
 355  15266  0   15266  IR-PCI-MSI 70254648-edge megasas
 335  15247  0   15247  IR-PCI-MSI 70254628-edge megasas
 363  15233  0   15233  IR-PCI-MSI 70254656-edge megasas


Average:CPU  %usr %nice  %sys   %iowait%steal
%irq %soft%guest%gnice %idle
Average: 18  3.80  0.00 14.78 10.08  0.00
0.00  4.01  0.00  0.00 67.33
Average: 19  3.26  0.00 15.35 10.62  0.00
0.00  4.03  0.00  0.00 66.74
Average: 20  3.42  0.00 14.57 10.67  0.00
0.00  3.84  0.00  0.00 67.50
Average: 21  3.19  0.00 15.60 10.75  0.00
0.00  4.16  0.00  0.00 66.30
Average: 22  3.58  0.00 15.15 10.66  0.00
0.00  3.51  0.00  0.00 67.11
Average: 23  3.34  0.00 15.36 10.63  0.00
0.00  4.17  0.00  0.00 66.50
Average: 24  3.50  0.00 14.58 10.93  0.00
0.00  3.85  0.00  0.00 67.13
Average: 25  3.20  0.00 14.68 10.86  0.00
0.00  4.31  0.00  0.00 66.95
Average: 26  3.27  0.00 14.80 10.70  0.00
0.00  3.68  0.00  0.00 67.55
Average: 27  3.58  0.00 15.36 10.80  0.00
0.00  3.79  0.00  0.00 66.48
Average: 28  3.46  0.00 15.17 10.46  0.00
0.00  3.32 

RE: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq

2018-02-06 Thread Kashyap Desai
> -Original Message-
> From: Ming Lei [mailto:ming@redhat.com]
> Sent: Tuesday, February 6, 2018 6:02 PM
> To: Kashyap Desai
> Cc: Hannes Reinecke; Jens Axboe; linux-bl...@vger.kernel.org; Christoph
> Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; Arun Easi; Omar
Sandoval;
> Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace;
Peter
> Rivera; Paolo Bonzini; Laurence Oberman
> Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce
> force_blk_mq
>
> Hi Kashyap,
>
> On Tue, Feb 06, 2018 at 04:59:51PM +0530, Kashyap Desai wrote:
> > > -Original Message-
> > > From: Ming Lei [mailto:ming@redhat.com]
> > > Sent: Tuesday, February 6, 2018 1:35 PM
> > > To: Kashyap Desai
> > > Cc: Hannes Reinecke; Jens Axboe; linux-bl...@vger.kernel.org;
> > > Christoph Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; Arun
> > > Easi; Omar
> > Sandoval;
> > > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace;
> > Peter
> > > Rivera; Paolo Bonzini; Laurence Oberman
> > > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags &
> > > introduce force_blk_mq
> > >
> > > Hi Kashyap,
> > >
> > > On Tue, Feb 06, 2018 at 11:33:50AM +0530, Kashyap Desai wrote:
> > > > > > We still have more than one reply queue ending up completion
> > > > > > one
> > CPU.
> > > > >
> > > > > pci_alloc_irq_vectors(PCI_IRQ_AFFINITY) has to be used, that
> > > > > means smp_affinity_enable has to be set as 1, but seems it is
> > > > > the default
> > > > setting.
> > > > >
> > > > > Please see kernel/irq/affinity.c, especially
> > > > > irq_calc_affinity_vectors()
> > > > which
> > > > > figures out an optimal number of vectors, and the computation is
> > > > > based
> > > > on
> > > > > cpumask_weight(cpu_possible_mask) now. If all offline CPUs are
> > > > > mapped to some of reply queues, these queues won't be active(no
> > > > > request submitted
> > > > to
> > > > > these queues). The mechanism of PCI_IRQ_AFFINITY basically makes
> > > > > sure
> > > > that
> > > > > more than one irq vector won't be handled by one same CPU, and
> > > > > the irq vector spread is done in irq_create_affinity_masks().
> > > > >
> > > > > > Try to reduce MSI-x vector of megaraid_sas or mpt3sas driver
> > > > > > via module parameter to simulate the issue. We need more
> > > > > > number of Online CPU than reply-queue.
> > > > >
> > > > > IMO, you don't need to simulate the issue,
> > > > > pci_alloc_irq_vectors(
> > > > > PCI_IRQ_AFFINITY) will handle that for you. You can dump the
> > > > > returned
> > > > irq
> > > > > vector number, num_possible_cpus()/num_online_cpus() and each
> > > > > irq vector's affinity assignment.
> > > > >
> > > > > > We may see completion redirected to original CPU because of
> > > > > > "QUEUE_FLAG_SAME_FORCE", but ISR of low level driver can keep
> > > > > > one CPU busy in local ISR routine.
> > > > >
> > > > > Could you dump each irq vector's affinity assignment of your
> > > > > megaraid in
> > > > your
> > > > > test?
> > > >
> > > > To quickly reproduce, I restricted to single MSI-x vector on
> > > > megaraid_sas driver.  System has total 16 online CPUs.
> > >
> > > I suggest you don't do the restriction of single MSI-x vector, and
> > > just
> > use the
> > > device supported number of msi-x vector.
> >
> > Hi Ming,  CPU lock up is seen even though it is not single msi-x
vector.
> > Actual scenario need some specific topology and server for overnight
test.
> > Issue can be seen on servers which has more than 16 logical CPUs and
> > Thunderbolt series MR controller which supports at max 16 MSIx
vectors.
> > >
> > > >
> > > > Output of affinity hints.
> > > > kernel version:
> > > > Linux rhel7.3 4.15.0-rc1+ #2 SMP Mon Feb 5 12:13:34 EST 2018
> > > > x86_64
> > > > x86_64
> > > > x86_64 GNU/Linux
> > > > PCI name is 83:00.0, dump its irq affi

RE: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq

2018-02-06 Thread Kashyap Desai
> -Original Message-
> From: Ming Lei [mailto:ming@redhat.com]
> Sent: Tuesday, February 6, 2018 1:35 PM
> To: Kashyap Desai
> Cc: Hannes Reinecke; Jens Axboe; linux-bl...@vger.kernel.org; Christoph
> Hellwig; Mike Snitzer; linux-scsi@vger.kernel.org; Arun Easi; Omar
Sandoval;
> Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace;
Peter
> Rivera; Paolo Bonzini; Laurence Oberman
> Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce
> force_blk_mq
>
> Hi Kashyap,
>
> On Tue, Feb 06, 2018 at 11:33:50AM +0530, Kashyap Desai wrote:
> > > > We still have more than one reply queue ending up completion one
CPU.
> > >
> > > pci_alloc_irq_vectors(PCI_IRQ_AFFINITY) has to be used, that means
> > > smp_affinity_enable has to be set as 1, but seems it is the default
> > setting.
> > >
> > > Please see kernel/irq/affinity.c, especially
> > > irq_calc_affinity_vectors()
> > which
> > > figures out an optimal number of vectors, and the computation is
> > > based
> > on
> > > cpumask_weight(cpu_possible_mask) now. If all offline CPUs are
> > > mapped to some of reply queues, these queues won't be active(no
> > > request submitted
> > to
> > > these queues). The mechanism of PCI_IRQ_AFFINITY basically makes
> > > sure
> > that
> > > more than one irq vector won't be handled by one same CPU, and the
> > > irq vector spread is done in irq_create_affinity_masks().
> > >
> > > > Try to reduce MSI-x vector of megaraid_sas or mpt3sas driver via
> > > > module parameter to simulate the issue. We need more number of
> > > > Online CPU than reply-queue.
> > >
> > > IMO, you don't need to simulate the issue, pci_alloc_irq_vectors(
> > > PCI_IRQ_AFFINITY) will handle that for you. You can dump the
> > > returned
> > irq
> > > vector number, num_possible_cpus()/num_online_cpus() and each irq
> > > vector's affinity assignment.
> > >
> > > > We may see completion redirected to original CPU because of
> > > > "QUEUE_FLAG_SAME_FORCE", but ISR of low level driver can keep one
> > > > CPU busy in local ISR routine.
> > >
> > > Could you dump each irq vector's affinity assignment of your
> > > megaraid in
> > your
> > > test?
> >
> > To quickly reproduce, I restricted to single MSI-x vector on
> > megaraid_sas driver.  System has total 16 online CPUs.
>
> I suggest you don't do the restriction of single MSI-x vector, and just
use the
> device supported number of msi-x vector.

Hi Ming,  CPU lock up is seen even though it is not single msi-x vector.
Actual scenario need some specific topology and server for overnight test.
Issue can be seen on servers which has more than 16 logical CPUs and
Thunderbolt series MR controller which supports at max 16 MSIx vectors.
>
> >
> > Output of affinity hints.
> > kernel version:
> > Linux rhel7.3 4.15.0-rc1+ #2 SMP Mon Feb 5 12:13:34 EST 2018 x86_64
> > x86_64
> > x86_64 GNU/Linux
> > PCI name is 83:00.0, dump its irq affinity:
> > irq 105, cpu list 0-3,8-11
>
> In this case, which CPU is selected for handling the interrupt is
decided by
> interrupt controller, and it is easy to cause CPU overload if interrupt
controller
> always selects one same CPU to handle the irq.
>
> >
> > Affinity mask is created properly, but only CPU-0 is overloaded with
> > interrupt processing.
> >
> > # numactl --hardware
> > available: 2 nodes (0-1)
> > node 0 cpus: 0 1 2 3 8 9 10 11
> > node 0 size: 47861 MB
> > node 0 free: 46516 MB
> > node 1 cpus: 4 5 6 7 12 13 14 15
> > node 1 size: 64491 MB
> > node 1 free: 62805 MB
> > node distances:
> > node   0   1
> >   0:  10  21
> >   1:  21  10
> >
> > Output of  system activities (sar).  (gnice is 100% and it is consumed
> > in megaraid_sas ISR routine.)
> >
> >
> > 12:44:40 PM CPU  %usr %nice  %sys   %iowait%steal
> > %irq %soft%guest%gnice %idle
> > 12:44:41 PM all 6.03  0.0029.98  0.00
> > 0.00 0.000.000.000.00 63.99
> > 12:44:41 PM   0 0.00  0.00 0.000.00
> > 0.00 0.000.000.00   100.00 0
> >
> >
> > In my test, I used rq_affinity is set to 2. (QUEUE_FLAG_SAME_FORCE). I
> > also used " host_tagset" V2 patch set for megaraid_s

RE: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq

2018-02-05 Thread Kashyap Desai
> > We still have more than one reply queue ending up completion one CPU.
>
> pci_alloc_irq_vectors(PCI_IRQ_AFFINITY) has to be used, that means
> smp_affinity_enable has to be set as 1, but seems it is the default
setting.
>
> Please see kernel/irq/affinity.c, especially irq_calc_affinity_vectors()
which
> figures out an optimal number of vectors, and the computation is based
on
> cpumask_weight(cpu_possible_mask) now. If all offline CPUs are mapped to
> some of reply queues, these queues won't be active(no request submitted
to
> these queues). The mechanism of PCI_IRQ_AFFINITY basically makes sure
that
> more than one irq vector won't be handled by one same CPU, and the irq
> vector spread is done in irq_create_affinity_masks().
>
> > Try to reduce MSI-x vector of megaraid_sas or mpt3sas driver via
> > module parameter to simulate the issue. We need more number of Online
> > CPU than reply-queue.
>
> IMO, you don't need to simulate the issue, pci_alloc_irq_vectors(
> PCI_IRQ_AFFINITY) will handle that for you. You can dump the returned
irq
> vector number, num_possible_cpus()/num_online_cpus() and each irq
> vector's affinity assignment.
>
> > We may see completion redirected to original CPU because of
> > "QUEUE_FLAG_SAME_FORCE", but ISR of low level driver can keep one CPU
> > busy in local ISR routine.
>
> Could you dump each irq vector's affinity assignment of your megaraid in
your
> test?

To quickly reproduce, I restricted to single MSI-x vector on megaraid_sas
driver.  System has total 16 online CPUs.

Output of affinity hints.
kernel version:
Linux rhel7.3 4.15.0-rc1+ #2 SMP Mon Feb 5 12:13:34 EST 2018 x86_64 x86_64
x86_64 GNU/Linux
PCI name is 83:00.0, dump its irq affinity:
irq 105, cpu list 0-3,8-11

Affinity mask is created properly, but only CPU-0 is overloaded with
interrupt processing.

# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 8 9 10 11
node 0 size: 47861 MB
node 0 free: 46516 MB
node 1 cpus: 4 5 6 7 12 13 14 15
node 1 size: 64491 MB
node 1 free: 62805 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10

Output of  system activities (sar).  (gnice is 100% and it is consumed in
megaraid_sas ISR routine.)


12:44:40 PM CPU  %usr %nice  %sys   %iowait%steal
%irq %soft%guest%gnice %idle
12:44:41 PM all 6.03  0.0029.98  0.00
0.00 0.000.000.000.00 63.99
12:44:41 PM   0 0.00  0.00 0.000.00
0.00 0.000.000.00   100.00 0


In my test, I used rq_affinity is set to 2. (QUEUE_FLAG_SAME_FORCE). I
also used " host_tagset" V2 patch set for megaraid_sas.

Using RFC requested in -
"https://marc.info/?l=linux-scsi&m=151601833418346&w=2 " lockup is avoided
(you can noticed that gnice is shifted to softirq. Even though it is 100%
consumed, There is always exit for existing completion loop due to
irqpoll_weight  @irq_poll_init().

Average:CPU  %usr %nice  %sys   %iowait%steal
%irq %soft%guest%gnice %idle
Average:all  4.25  0.0021.61  0.00
0.00  0.00 6.61   0.00  0.00 67.54
Average:  0   0.00  0.00 0.00  0.00
0.00  0.00   100.000.00  0.00  0.00


Hope this clarifies. We need different fix to avoid lockups. Can we
consider using irq poll interface if #CPU is more than Reply queue/MSI-x.
?

>
> And the following script can do it easily, and the pci path (the 1st
column of
> lspci output) need to be passed, such as: 00:1c.4,
>
> #!/bin/sh
> if [ $# -ge 1 ]; then
> PCID=$1
> else
> PCID=`lspci | grep "Non-Volatile memory" | cut -c1-7` fi PCIP=`find
> /sys/devices -name *$PCID | grep pci` IRQS=`ls $PCIP/msi_irqs`
>
> echo "kernel version: "
> uname -a
>
> echo "PCI name is $PCID, dump its irq affinity:"
> for IRQ in $IRQS; do
> CPUS=`cat /proc/irq/$IRQ/smp_affinity_list`
> echo "\tirq $IRQ, cpu list $CPUS"
> done
>
>
> Thanks,
> Ming


RE: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq

2018-02-04 Thread Kashyap Desai
> -Original Message-
> From: Hannes Reinecke [mailto:h...@suse.de]
> Sent: Monday, February 5, 2018 12:28 PM
> To: Ming Lei; Jens Axboe; linux-bl...@vger.kernel.org; Christoph Hellwig;
> Mike Snitzer
> Cc: linux-scsi@vger.kernel.org; Arun Easi; Omar Sandoval; Martin K .
> Petersen;
> James Bottomley; Christoph Hellwig; Don Brace; Kashyap Desai; Peter
> Rivera;
> Paolo Bonzini; Laurence Oberman
> Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce
> force_blk_mq
>
> On 02/03/2018 05:21 AM, Ming Lei wrote:
> > Hi All,
> >
> > This patchset supports global tags which was started by Hannes
> > originally:
> >
> > https://marc.info/?l=linux-block&m=149132580511346&w=2
> >
> > Also inroduce 'force_blk_mq' to 'struct scsi_host_template', so that
> > driver can avoid to support two IO paths(legacy and blk-mq),
> > especially recent discusion mentioned that SCSI_MQ will be enabled at
> default soon.
> >
> > https://marc.info/?l=linux-scsi&m=151727684915589&w=2
> >
> > With the above two changes, it should be easier to convert SCSI drivers'
> > reply queue into blk-mq's hctx, then the automatic irq affinity issue
> > can be solved easily, please see detailed descrption in commit log.
> >
> > Also drivers may require to complete request on the submission CPU for
> > avoiding hard/soft deadlock, which can be done easily with blk_mq too.
> >
> > https://marc.info/?t=15160185141&r=1&w=2
> >
> > The final patch uses the introduced 'force_blk_mq' to fix virtio_scsi
> > so that IO hang issue can be avoided inside legacy IO path, this issue
> > is a bit generic, at least HPSA/virtio-scsi are found broken with
> > v4.15+.
> >
> > Thanks
> > Ming
> >
> > Ming Lei (5):
> >   blk-mq: tags: define several fields of tags as pointer
> >   blk-mq: introduce BLK_MQ_F_GLOBAL_TAGS
> >   block: null_blk: introduce module parameter of 'g_global_tags'
> >   scsi: introduce force_blk_mq
> >   scsi: virtio_scsi: fix IO hang by irq vector automatic affinity
> >
> >  block/bfq-iosched.c|  4 +--
> >  block/blk-mq-debugfs.c | 11 
> >  block/blk-mq-sched.c   |  2 +-
> >  block/blk-mq-tag.c | 67
> > +-
> >  block/blk-mq-tag.h | 15 ---
> >  block/blk-mq.c | 31 -
> >  block/blk-mq.h |  3 ++-
> >  block/kyber-iosched.c  |  2 +-
> >  drivers/block/null_blk.c   |  6 +
> >  drivers/scsi/hosts.c   |  1 +
> >  drivers/scsi/virtio_scsi.c | 59
> > +++-
> >  include/linux/blk-mq.h |  2 ++
> >  include/scsi/scsi_host.h   |  3 +++
> >  13 files changed, 105 insertions(+), 101 deletions(-)
> >
> Thanks Ming for picking this up.
>
> I'll give it a shot and see how it behaves on other hardware.

Ming -

There is no way we can enable global tags from SCSI stack in this patch
series.   I still think we have no solution for issue described below in
this patch series.
https://marc.info/?t=15160185141&r=1&w=2

What we will be doing is just use global tag HBA wide instead of h/w queue
based. We still have more than one reply queue ending up completion one CPU.
Try to reduce MSI-x vector of megaraid_sas or mpt3sas driver via module
parameter to simulate the issue. We need more number of Online CPU than
reply-queue.
We may see completion redirected to original CPU because of
"QUEUE_FLAG_SAME_FORCE", but ISR of low level driver can keep one CPU busy
in local ISR routine.


Kashyap

>
> Cheers,
>
> Hannes
> --
> Dr. Hannes Reinecke  Teamlead Storage & Networking
> h...@suse.de +49 911 74053 688
> SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
> GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284
> (AG Nürnberg)


RE: [RFC 0/2] mpt3sas/megaraid_sas : irq poll and load balancing of reply queue

2018-02-02 Thread Kashyap Desai
> -Original Message-
> From: Ming Lei [mailto:ming@redhat.com]
> Sent: Friday, February 2, 2018 3:44 PM
> To: Kashyap Desai
> Cc: linux-scsi@vger.kernel.org; Peter Rivera
> Subject: Re: [RFC 0/2] mpt3sas/megaraid_sas : irq poll and load
balancing of
> reply queue
>
> Hi Kashyap,
>
> On Mon, Jan 15, 2018 at 05:42:05PM +0530, Kashyap Desai wrote:
> > Hi All -
> >
> > We have seen cpu lock up issue from fields if system has greater (more
> > than 96) logical cpu count.
> > SAS3.0 controller (Invader series) supports at max 96 msix vector and
> > SAS3.5 product (Ventura) supports at max 128 msix vectors.
> >
> > This may be a generic issue (if PCI device support  completion on
> > multiple reply queues). Let me explain it w.r.t to mpt3sas supported
> > h/w just to simplify the problem and possible changes to handle such
> > issues. IT HBA
> > (mpt3sas) supports multiple reply queues in completion path. Driver
> > creates MSI-x vectors for controller as "min of ( FW supported Reply
> > queue, Logical CPUs)". If submitter is not interrupted via completion
> > on same CPU, there is a loop in the IO path. This behavior can cause
> > hard/soft CPU lockups, IO timeout, system sluggish etc.
>
> As I mentioned in another thread, this issue may be solved by SCSI_MQ
via
> mapping reply queue into hctx of blk_mq, together with
> QUEUE_FLAG_SAME_FORCE, especially you have set 'smp_affinity_enable' as
> 1 at default already, then pci_alloc_irq_vectors(PCI_IRQ_AFFINITY) can
do IRQ
> vectors spread on CPUs perfectly for you.
>
> But the following Hannes's patch is required for the conversion.
>
>   https://marc.info/?l=linux-block&m=149130770004507&w=2
>

Hi Ming -

I gone through thread discussing "support host-wide tagset". Below Link
has latest reply on that thread.
https://marc.info/?l=linux-block&m=149132580511346&w=2

I think, there is a confusion over mpt3sas and megaraid_sas h/w behavior.
Broadcom/LSI HBA and MR h/w has only one h/w queue for submission but
there are multiple reply queue.
Even though I include Hannes' patch for host-side tagset, problem
described in this RFC will not be resolved.  In fact, tagset can also
provide same results if completion queue is less than online CPU. Don't
you think ? OR I am missing anything ?

We don't have problem in submission path.  Current problem is MSI-x to
more than one  CPU can cause I/O loop. This is visible, if we have higher
number of online CPUs.

> >
> > Example - one CPU (e.g. CPU A) is busy submitting the IOs and another
> > CPU (e.g. CPU B) is busy with processing the corresponding IO's reply
> > descriptors from reply descriptor queue upon receiving the interrupts
> > from HBA. If the CPU A is continuously pumping the IOs then always CPU
> > B (which is executing the ISR) will see the valid reply descriptors in
> > the reply descriptor queue and it will be continuously processing
> > those reply descriptor in a loop without quitting the ISR handler.
> > Mpt3sas driver will exit ISR handler if it finds unused reply
> > descriptor in the reply descriptor queue. Since CPU A will be
> > continuously sending the IOs, CPU B may always see a valid reply
> > descriptor (posted by HBA Firmware after processing the IO) in the
> > reply descriptor queue. In worst case, driver will not quit from this
> > loop in the ISR handler. Eventually, CPU lockup will be detected by
> watchdog.
> >
> > Above mentioned behavior is not common if "rq_affinity" set to 2 or
> > affinity_hint is honored by irqbalance as "exact".
> > If rq_affinity is set to 2, submitter will be always interrupted via
> > completion on same CPU.
> > If irqbalance is using "exact" policy, interrupt will be delivered to
> > submitter CPU.
>
> Now you have used pci_alloc_irq_vectors(PCI_IRQ_AFFINITY) to get msix
> vector number, the irq affinity can't be changed by userspace any more.
>
> >
> > Problem statement -
> > If CPU counts to MSI-X vectors (reply descriptor Queues) count ratio
> > is not 1:1, we still have  exposure of issue explained above and for
> > that we don't have any solution.
> >
> > Exposure of soft/hard lockup if CPU count is more than MSI-x supported
> > by device.
> >
> > If CPUs count to MSI-x vectors count ratio is not 1:1, (Other way, if
> > CPU counts to MSI-x vector count ratio is something like X:1, where X
> > > 1) then 'exact' irqbalance policy OR rq_affinity = 2 won't help to
> > avoid CPU hard/soft lockups. There won't be any one to one mapping
> > bet

RE: [LSF/MM TOPIC] irq affinity handling for high CPU count machines

2018-02-02 Thread Kashyap Desai
> > > > Today I am looking at one megaraid_sas related issue, and found
> > > > pci_alloc_irq_vectors(PCI_IRQ_AFFINITY) is used in the driver, so
> > > > looks each reply queue has been handled by more than one CPU if
> > > > there are more CPUs than MSIx vectors in the system, which is done
> > > > by generic irq affinity code, please see kernel/irq/affinity.c.
> >
> > Yes. That is a problematic area. If CPU and MSI-x(reply queue) is 1:1
> > mapped, we don't have any issue.
>
> I guess the problematic area is similar with the following link:
>
>   https://marc.info/?l=linux-kernel&m=151748144730409&w=2

Hi Ming,

Above mentioned link is different discussion and looks like a generic
issue. megaraid_sas/mpt3sas will have same symptoms if irq affinity has
only offline CPUs.
Just for info - "In such condition, we can ask users to disable affinity
hit via module parameter " smp_affinity_enable".

>
> otherwise could you explain a bit about the area?

Please check below post for more details.

https://marc.info/?l=linux-scsi&m=151601833418346&w=2


RE: [LSF/MM TOPIC] irq affinity handling for high CPU count machines

2018-02-01 Thread Kashyap Desai
> -Original Message-
> From: Hannes Reinecke [mailto:h...@suse.de]
> Sent: Thursday, February 1, 2018 9:50 PM
> To: Ming Lei
> Cc: lsf...@lists.linux-foundation.org; linux-scsi@vger.kernel.org; linux-
> n...@lists.infradead.org; Kashyap Desai
> Subject: Re: [LSF/MM TOPIC] irq affinity handling for high CPU count
> machines
>
> On 02/01/2018 04:05 PM, Ming Lei wrote:
> > Hello Hannes,
> >
> > On Mon, Jan 29, 2018 at 10:08:43AM +0100, Hannes Reinecke wrote:
> >> Hi all,
> >>
> >> here's a topic which came up on the SCSI ML (cf thread '[RFC 0/2]
> >> mpt3sas/megaraid_sas: irq poll and load balancing of reply queue').
> >>
> >> When doing I/O tests on a machine with more CPUs than MSIx vectors
> >> provided by the HBA we can easily setup a scenario where one CPU is
> >> submitting I/O and the other one is completing I/O. Which will result
> >> in the latter CPU being stuck in the interrupt completion routine for
> >> basically ever, resulting in the lockup detector kicking in.
> >
> > Today I am looking at one megaraid_sas related issue, and found
> > pci_alloc_irq_vectors(PCI_IRQ_AFFINITY) is used in the driver, so
> > looks each reply queue has been handled by more than one CPU if there
> > are more CPUs than MSIx vectors in the system, which is done by
> > generic irq affinity code, please see kernel/irq/affinity.c.

Yes. That is a problematic area. If CPU and MSI-x(reply queue) is 1:1
mapped, we don't have any issue.

> >
> > Also IMO each reply queue may be treated as blk-mq's hw queue, then
> > megaraid may benefit from blk-mq's MQ framework, but one annoying
> > thing is that both legacy and blk-mq path need to be handled inside
> > driver.

Both MR and IT driver is (due to H/W design.) is using blk-mq frame work but
it is really  a single h/w queue.
IT and MR HBA has single submission queue and multiple reply queue.

> >
> The megaraid driver is a really strange beast;, having layered two
> different
> interfaces (the 'legacy' MFI interface and that from from
> mpt3sas) on top of each other.
> I had been thinking of converting it to scsi-mq, too (as my mpt3sas patch
> finally went in), but I'm not sure if we can benefit from it as we're
> still be
> bound by the HBA-wide tag pool.
> It's on my todo list, albeit pretty far down :-)

Hannes, this is typically same in both MR (megaraid_sas) and IT (mpt3sas).
Both the driver is using shared HBA-wide tag pool.
Both MR and IT driver use request->tag to get command from free pool.

>
> >>
> >> How should these situations be handled?
> >> Should it be made the responsibility of the drivers, ensuring that
> >> the interrupt completion routine is terminated after a certain time?
> >> Should it be made the resposibility of the upper layers?
> >> Should it be the responsibility of the interrupt mapping code?
> >> Can/should interrupt polling be used in these situations?
> >
> > Yeah, I guess interrupt polling may improve these situations,
> > especially KPTI introduces some extra cost in interrupt handling.
> >
> The question is not so much if one should be doing irq polling, but rather
> if we
> can come up with some guidance or even infrastructure to make this happen
> automatically.
> Having to rely on individual drivers to get this right is probably not the
> best
> option.
>
> Cheers,
>
> Hannes
> --
> Dr. Hannes Reinecke  Teamlead Storage & Networking
> h...@suse.de +49 911 74053 688
> SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
> GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284
> (AG Nürnberg)


RE: [RFC 0/2] mpt3sas/megaraid_sas : irq poll and load balancing of reply queue

2018-01-29 Thread Kashyap Desai
> -Original Message-
> From: Hannes Reinecke [mailto:h...@suse.de]
> Sent: Monday, January 29, 2018 2:29 PM
> To: Kashyap Desai; linux-scsi@vger.kernel.org; Peter Rivera
> Subject: Re: [RFC 0/2] mpt3sas/megaraid_sas : irq poll and load balancing
> of
> reply queue
>
> On 01/15/2018 01:12 PM, Kashyap Desai wrote:
> > Hi All -
> >
> > We have seen cpu lock up issue from fields if system has greater (more
> > than 96) logical cpu count.
> > SAS3.0 controller (Invader series) supports at max 96 msix vector and
> > SAS3.5 product (Ventura) supports at max 128 msix vectors.
> >
> > This may be a generic issue (if PCI device support  completion on
> > multiple reply queues). Let me explain it w.r.t to mpt3sas supported
> > h/w just to simplify the problem and possible changes to handle such
> > issues. IT HBA
> > (mpt3sas) supports multiple reply queues in completion path. Driver
> > creates MSI-x vectors for controller as "min of ( FW supported Reply
> > queue, Logical CPUs)". If submitter is not interrupted via completion
> > on same CPU, there is a loop in the IO path. This behavior can cause
> > hard/soft CPU lockups, IO timeout, system sluggish etc.
> >
> > Example - one CPU (e.g. CPU A) is busy submitting the IOs and another
> > CPU (e.g. CPU B) is busy with processing the corresponding IO's reply
> > descriptors from reply descriptor queue upon receiving the interrupts
> > from HBA. If the CPU A is continuously pumping the IOs then always CPU
> > B (which is executing the ISR) will see the valid reply descriptors in
> > the reply descriptor queue and it will be continuously processing
> > those reply descriptor in a loop without quitting the ISR handler.
> > Mpt3sas driver will exit ISR handler if it finds unused reply
> > descriptor in the reply descriptor queue. Since CPU A will be
> > continuously sending the IOs, CPU B may always see a valid reply
> > descriptor (posted by HBA Firmware after processing the IO) in the
> > reply descriptor queue. In worst case, driver will not quit from this
> > loop in the ISR handler. Eventually, CPU lockup will be detected by
> watchdog.
> >
> > Above mentioned behavior is not common if "rq_affinity" set to 2 or
> > affinity_hint is honored by irqbalance as "exact".
> > If rq_affinity is set to 2, submitter will be always interrupted via
> > completion on same CPU.
> > If irqbalance is using "exact" policy, interrupt will be delivered to
> > submitter CPU.
> >
> > Problem statement -
> > If CPU counts to MSI-X vectors (reply descriptor Queues) count ratio
> > is not 1:1, we still have  exposure of issue explained above and for
> > that we don't have any solution.
> >
> > Exposure of soft/hard lockup if CPU count is more than MSI-x supported
> > by device.
> >
> > If CPUs count to MSI-x vectors count ratio is not 1:1, (Other way, if
> > CPU counts to MSI-x vector count ratio is something like X:1, where X
> > > 1) then 'exact' irqbalance policy OR rq_affinity = 2 won't help to
> > avoid CPU hard/soft lockups. There won't be any one to one mapping
> > between CPU to MSI-x vector instead one MSI-x interrupt (or reply
> > descriptor queue) is shared with group/set of CPUs and there is a
> > possibility of having a loop in the IO path within that CPU group and
> > may
> observe lockups.
> >
> > For example: Consider a system having two NUMA nodes and each node
> > having four logical CPUs and also consider that number of MSI-x
> > vectors enabled on the HBA is two, then CPUs count to MSI-x vector count
> ratio as 4:1.
> > e.g.
> > MSIx vector 0 is affinity to  CPU 0, CPU 1, CPU 2 & CPU 3 of NUMA node
> > 0 and MSI-x vector 1 is affinity to CPU 4, CPU 5, CPU 6 & CPU 7 of
> > NUMA node 1.
> >
> > numactl --hardware
> > available: 2 nodes (0-1)
> > node 0 cpus: 0 1 2 3-->
> > MSI-x 0
> > node 0 size: 65536 MB
> > node 0 free: 63176 MB
> > node 1 cpus: 4 5 6 7
> > -->MSI-x 1
> > node 1 size: 65536 MB
> > node 1 free: 63176 MB
> >
> > Assume that user started an application which uses all the CPUs of
> > NUMA node 0 for issuing the IOs.
> > Only one CPU from affinity list (it can be any cpu since this behavior
> > depends upon irqbalance) CPU0 will receive the interrupts from MSIx
> > vector
> > 0 for all the IOs. Eventually, CPU 0 IO submission percentage will be
> > decreasing and ISR processing percentage will be increasi

RE: [LSF/MM TOPIC] irq affinity handling for high CPU count machines

2018-01-29 Thread Kashyap Desai
> -Original Message-
> From: Bart Van Assche [mailto:bart.vanass...@wdc.com]
> Sent: Monday, January 29, 2018 10:08 PM
> To: Elliott, Robert (Persistent Memory); Hannes Reinecke;
> lsf-pc@lists.linux-
> foundation.org
> Cc: linux-scsi@vger.kernel.org; linux-n...@lists.infradead.org; Kashyap
> Desai
> Subject: Re: [LSF/MM TOPIC] irq affinity handling for high CPU count
> machines
>
> On 01/29/18 07:41, Elliott, Robert (Persistent Memory) wrote:
> >> -Original Message-
> >> From: Linux-nvme [mailto:linux-nvme-boun...@lists.infradead.org] On
> >> Behalf Of Hannes Reinecke
> >> Sent: Monday, January 29, 2018 3:09 AM
> >> To: lsf...@lists.linux-foundation.org
> >> Cc: linux-n...@lists.infradead.org; linux-scsi@vger.kernel.org;
> >> Kashyap Desai 
> >> Subject: [LSF/MM TOPIC] irq affinity handling for high CPU count
> >> machines
> >>
> >> Hi all,
> >>
> >> here's a topic which came up on the SCSI ML (cf thread '[RFC 0/2]
> >> mpt3sas/megaraid_sas: irq poll and load balancing of reply queue').
> >>
> >> When doing I/O tests on a machine with more CPUs than MSIx vectors
> >> provided by the HBA we can easily setup a scenario where one CPU is
> >> submitting I/O and the other one is completing I/O. Which will result
> >> in the latter CPU being stuck in the interrupt completion routine for
> >> basically ever, resulting in the lockup detector kicking in.
> >>
> >> How should these situations be handled?
> >> Should it be made the responsibility of the drivers, ensuring that
> >> the interrupt completion routine is terminated after a certain time?
> >> Should it be made the responsibility of the upper layers?
> >> Should it be the responsibility of the interrupt mapping code?
> >> Can/should interrupt polling be used in these situations?
> >
> > Back when we introduced scsi-mq with hpsa, the best approach was to
> > route interrupts and completion handling so each CPU core handles its
> > own submissions; this way, they are self-throttling.


Ideal scenario is to make sure submitter is interrupted for completion.  It
is not possible to manage via any tuning like rq_affinity=2 (and --exact
irqbalance policy), if we have more # of CPUs than MSI-x vector supported by
controllers. If we use irq poll interface with good amount of weights in irq
poll API, we will no more see CPU lockups because low level driver will quit
ISR routine after each weighted completion. There will be always chance that
we will have back to back pressure on the same CPU for completion, but irq
poll design will manage to run watchdog task and timestamp will updated.
Using irq poll we may see close to 100% CPU consumption, but there will be
no  lockup detection.

>
> That approach may work for the hpsa adapter but I'm not sure whether it
> works for all adapter types. It has already been observed with the SRP
> initiator
> driver running inside a VM that a single core spent all its time
> processing IB
> interrupts.
>
> Additionally, only initiator workloads are self-throttling. Target style
> workloads are not self-throttling.
>
> In other words, I think it's worth to discuss this topic further.
>
> Bart.
>


RE: [RFC 0/2] mpt3sas/megaraid_sas : irq poll and load balancing of reply queue

2018-01-22 Thread Kashyap Desai
>
> In Summary,
> CPU completing IO which is not contributing to IO submission, may cause
cpu
> lockup.
> If CPUs count to MSI-X vector count ratio is X:1 (where X > 1) then
using irq poll
> interface, we can avoid the CPU lockups and by equally distributing the
> interrupts among the enabled MSI-x interrupts we can avoid performance
> issues.
>
> We are planning to use both the fixes only if cpu count is more than FW
> supported MSI-x vector.
> Please review and provide your feedback. I have appended both the
patches.

Hi -
Assuming method explained here is in-line with Linux SCSI subsystem and
there is no better method to fix such issue.
I am planning to provide the same solution for internal testing and
maintainers of respective driver (mpt3sas and megaraid_sas) will post
final patch to the upstream based on results.
As of now PoC results looks promising with the above mentioned solution
and no cpu lock was discovered.

>
> Thanks, Kashyap
>
>


RE: [RFC 1/2] mpt3sas/megaraid_sas : irq poll to avoid CPU hard and soft lockups

2018-01-15 Thread Kashyap Desai
> -Original Message-
> From: Johannes Thumshirn [mailto:jthumsh...@suse.de]
> Sent: Monday, January 15, 2018 5:49 PM
> To: Kashyap Desai
> Cc: linux-scsi@vger.kernel.org; Peter Rivera
> Subject: Re: [RFC 1/2] mpt3sas/megaraid_sas : irq poll to avoid CPU hard
and
> soft lockups
>
> On Mon, Jan 15, 2018 at 05:42:35PM +0530, Kashyap Desai wrote:
> > Patch for Fix-1 explained in PATCH 0.
>
> Ahm, PATCH 0 a.k.a the cover letter doesn't get merged so the git
history
> won't have an explanation at all. Please write a proper commit message.

For now I am looking for input from linux community regarding approach
used in my solution.
I will convert RFC to PATCH and most suitable header I will attach with
each patch. Hope it is OK.

Current RFC has changes included for mpt3sas. Same logic will be followed
for megaraid_sas as well. Just for reference I have posted changes of
mpt3sas and not included megaraid_sas.


>
> Thanks,
>   Johannes
>
> --
> Johannes Thumshirn  Storage
> jthumsh...@suse.de+49 911 74053 689
> SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
> GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG
> Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76
> 0850


RE: [PATCH 13/14] megaraid_sas: NVME passthru command support

2018-01-15 Thread Kashyap Desai
This patch is not yet included because of the ongoing discussion.

Chris H, Martin et all -  How are we moving forward with this patch ?

Thanks, Kashyap

> -Original Message-
> From: Sathya Prakash Veerichetty [mailto:sathya.prak...@broadcom.com]
> Sent: Thursday, January 11, 2018 11:37 PM
> To: Keith Busch
> Cc: dgilb...@interlog.com; Bart Van Assche; h...@infradead.org; Kashyap
> Desai; Shivasharan Srikanteshwara; Sumit Saxena; linux-
> n...@lists.infradead.org; Peter Rivera; linux-scsi@vger.kernel.org
> Subject: RE: [PATCH 13/14] megaraid_sas: NVME passthru command support
>
> >>So even when used as a RAID member, there will be a device handle at
> /dev/sdX for each NVMe device the megaraid controller manages?
> In megaraid controller, you can expose bare NVMe drives and RAID volumes
> created out of NVMe drives, when the RAID volume is created underlying
> member drives will not have /dev/sdX entries associated with them, however
> for bare NVMe drives there will be associated /dev/sdX entries.


[RFC 2/2] mpt3sas/megaraid_sas : reply queue load balancing

2018-01-15 Thread Kashyap Desai
Patch for Fix-2 explained in PATCH 0.

Signed-off-by: Kashyap Desai < kashyap.de...@broadcom.com>
---
 mpt3sas/mpt3sas_base.c | 5 -
 mpt3sas/mpt3sas_base.h | 1 +
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/mpt3sas/mpt3sas_base.c b/mpt3sas/mpt3sas_base.c index
0b351d4..20bf2ad 100644
--- a/mpt3sas/mpt3sas_base.c
+++ b/mpt3sas/mpt3sas_base.c
@@ -2811,7 +2811,8 @@ mpt3sas_base_get_reply_virt_addr(struct
MPT3SAS_ADAPTER *ioc, u32 phys_addr)  static inline u8
_base_get_msix_index(struct MPT3SAS_ADAPTER *ioc)  {
-   return ioc->cpu_msix_table[raw_smp_processor_id()];
+   return ioc->reply_queue_count ? (atomic64_add_return(1,
+   &ioc->total_io_cnt) % ioc->reply_queue_count) : 0;
 }

 /**
@@ -5775,6 +5776,8 @@ _base_make_ioc_operational(struct MPT3SAS_ADAPTER
*ioc)
dinitprintk(ioc, pr_info(MPT3SAS_FMT "%s\n", ioc->name,
__func__));

+   atomic64_set(&ioc->total_io_cnt, 0);
+
/* clean the delayed target reset list */
list_for_each_entry_safe(delayed_tr, delayed_tr_next,
&ioc->delayed_tr_list, list) {
diff --git a/mpt3sas/mpt3sas_base.h b/mpt3sas/mpt3sas_base.h index
456d928..1c39107 100644
--- a/mpt3sas/mpt3sas_base.h
+++ b/mpt3sas/mpt3sas_base.h
@@ -1357,6 +1357,7 @@ struct MPT3SAS_ADAPTER {
u8  is_gen35_ioc;
u8  atomic_desc_capable;
u32 irqpoll_weight;
+   atomic64_t  total_io_cnt;
PUT_SMID_IO_FP_HIP put_smid_scsi_io;
PUT_SMID_IO_FP_HIP put_smid_fast_path;
PUT_SMID_IO_FP_HIP put_smid_hi_priority;
--
2.5.5


[RFC 1/2] mpt3sas/megaraid_sas : irq poll to avoid CPU hard and soft lockups

2018-01-15 Thread Kashyap Desai
Patch for Fix-1 explained in PATCH 0.

Signed-off-by: Kashyap Desai < kashyap.de...@broadcom.com>
---
 mpt3sas/mpt3sas_base.c | 67
++
 mpt3sas/mpt3sas_base.h |  4 +++
 2 files changed, 55 insertions(+), 17 deletions(-)

diff --git a/mpt3sas/mpt3sas_base.c b/mpt3sas/mpt3sas_base.c index
08237b8..0b351d4 100644
--- a/mpt3sas/mpt3sas_base.c
+++ b/mpt3sas/mpt3sas_base.c
@@ -963,17 +963,15 @@ union reply_descriptor {  };

 /**
- * _base_interrupt - MPT adapter (IOC) specific interrupt handler.
- * @irq: irq number (not used)
- * @bus_id: bus identifier cookie == pointer to MPT_ADAPTER structure
- * @r: pt_regs pointer (not used)
+ * mpt3sas_process_reply_queue - Process the RDs from reply descriptor
+ queue
+ * @ reply_q- reply queue
+ * @ bugget- command completion budget
  *
- * Return IRQ_HANDLE if processed, else IRQ_NONE.
+ * Returns number of RDs processed.
  */
-static irqreturn_t
-_base_interrupt(int irq, void *bus_id)
+int
+mpt3sas_process_reply_queue(struct adapter_reply_queue *reply_q, u32
+budget)
 {
-   struct adapter_reply_queue *reply_q = bus_id;
union reply_descriptor rd;
u32 completed_cmds;
u8 request_desript_type;
@@ -985,18 +983,15 @@ _base_interrupt(int irq, void *bus_id)
Mpi2ReplyDescriptorsUnion_t *rpf;
u8 rc;

-   if (ioc->mask_interrupts)
-   return IRQ_NONE;
-
if (!atomic_add_unless(&reply_q->busy, 1, 1))
-   return IRQ_NONE;
+   return 0;

rpf = &reply_q->reply_post_free[reply_q->reply_post_host_index];
request_desript_type = rpf->Default.ReplyFlags
 & MPI2_RPY_DESCRIPT_FLAGS_TYPE_MASK;
if (request_desript_type == MPI2_RPY_DESCRIPT_FLAGS_UNUSED) {
atomic_dec(&reply_q->busy);
-   return IRQ_NONE;
+   return 0;
}

completed_cmds = 0;
@@ -1072,7 +1067,7 @@ _base_interrupt(int irq, void *bus_id)
 * So that FW can find enough entries to post the Reply
 * Descriptors in the reply descriptor post queue.
 */
-   if (completed_cmds > ioc->hba_queue_depth/3) {
+   if (completed_cmds == budget) {
if (ioc->combined_reply_queue) {
writel(reply_q->reply_post_host_index |
((msix_index  & 7) <<
@@ -1084,6 +1079,8 @@ _base_interrupt(int irq, void *bus_id)

MPI2_RPHI_MSIX_INDEX_SHIFT),

&ioc->chip->ReplyPostHostIndex);
}
+   if (ioc->irqpoll_weight)
+   break;
completed_cmds = 1;
}
if (request_desript_type ==
MPI2_RPY_DESCRIPT_FLAGS_UNUSED) @@ -1098,14 +1095,14 @@
_base_interrupt(int irq, void *bus_id)

if (!completed_cmds) {
atomic_dec(&reply_q->busy);
-   return IRQ_NONE;
+   return 0;
}

if (ioc->is_warpdrive) {
writel(reply_q->reply_post_host_index,
ioc->reply_post_host_index[msix_index]);
atomic_dec(&reply_q->busy);
-   return IRQ_HANDLED;
+   return completed_cmds;
}

/* Update Reply Post Host Index.
@@ -1132,6 +1129,27 @@ _base_interrupt(int irq, void *bus_id)
MPI2_RPHI_MSIX_INDEX_SHIFT),
&ioc->chip->ReplyPostHostIndex);
atomic_dec(&reply_q->busy);
+   return completed_cmds;
+}
+
+/**
+ * _base_interrupt - MPT adapter (IOC) specific interrupt handler.
+ * @irq: irq number (not used)
+ * @bus_id: bus identifier cookie == pointer to MPT_ADAPTER structure
+ * @r: pt_regs pointer (not used)
+ *
+ * Return IRQ_HANDLE if processed, else IRQ_NONE.
+ */
+static irqreturn_t
+_base_interrupt(int irq, void *bus_id)
+{
+   struct adapter_reply_queue *reply_q = bus_id;
+   struct MPT3SAS_ADAPTER *ioc = reply_q->ioc;
+
+   if (ioc->mask_interrupts)
+   return IRQ_NONE;
+
+   irq_poll_sched(&reply_q->irqpoll);
return IRQ_HANDLED;
 }

@@ -2285,6 +2303,20 @@ _base_check_enable_msix(struct MPT3SAS_ADAPTER
*ioc)
return 0;
 }

+int mpt3sas_irqpoll(struct irq_poll *irqpoll, int budget) {
+   struct adapter_reply_queue *reply_q;
+   int num_entries = 0;
+
+   reply_q = container_of(irqpoll, struct adapter_reply_queue,
irqpoll);
+
+   num_entries = mpt3sas_process_reply_queue(reply_q, budget);
+   if (num_entries < budget)
+   irq_poll_complete(irqpoll);
+
+   return num_entries;
+}
+
 /**
  * _base_free_irq - free irq
  * @ioc: per adapter object
@@ -2301,6 +2333,7 @@ _base_free_irq(struct MPT3SAS_ADAPTER *ioc)

list_for_each_entry_safe(reply_q,

[RFC 0/2] mpt3sas/megaraid_sas : irq poll and load balancing of reply queue

2018-01-15 Thread Kashyap Desai
Hi All -

We have seen cpu lock up issue from fields if system has greater (more
than 96) logical cpu count.
SAS3.0 controller (Invader series) supports at max 96 msix vector and
SAS3.5 product (Ventura) supports at max 128 msix vectors.

This may be a generic issue (if PCI device support  completion on multiple
reply queues). Let me explain it w.r.t to mpt3sas supported h/w just to
simplify the problem and possible changes to handle such issues. IT HBA
(mpt3sas) supports multiple reply queues in completion path. Driver
creates MSI-x vectors for controller as "min of ( FW supported Reply
queue, Logical CPUs)". If submitter is not interrupted via completion on
same CPU, there is a loop in the IO path. This behavior can cause
hard/soft CPU lockups, IO timeout, system sluggish etc.

Example - one CPU (e.g. CPU A) is busy submitting the IOs and another CPU
(e.g. CPU B) is busy with processing the corresponding IO's reply
descriptors from reply descriptor queue upon receiving the interrupts from
HBA. If the CPU A is continuously pumping the IOs then always CPU B (which
is executing the ISR) will see the valid reply descriptors in the reply
descriptor queue and it will be continuously processing those reply
descriptor in a loop without quitting the ISR handler.  Mpt3sas driver
will exit ISR handler if it finds unused reply descriptor in the reply
descriptor queue. Since CPU A will be continuously sending the IOs, CPU B
may always see a valid reply descriptor (posted by HBA Firmware after
processing the IO) in the reply descriptor queue. In worst case, driver
will not quit from this loop in the ISR handler. Eventually, CPU lockup
will be detected by watchdog.

Above mentioned behavior is not common if "rq_affinity" set to 2 or
affinity_hint is honored by irqbalance as "exact".
If rq_affinity is set to 2, submitter will be always interrupted via
completion on same CPU.
If irqbalance is using "exact" policy, interrupt will be delivered to
submitter CPU.

Problem statement -
If CPU counts to MSI-X vectors (reply descriptor Queues) count ratio is
not 1:1, we still have  exposure of issue explained above and for that we
don't have any solution.

Exposure of soft/hard lockup if CPU count is more than MSI-x supported by
device.

If CPUs count to MSI-x vectors count ratio is not 1:1, (Other way, if CPU
counts to MSI-x vector count ratio is something like X:1, where X > 1)
then 'exact' irqbalance policy OR rq_affinity = 2 won't help to avoid CPU
hard/soft lockups. There won't be any one to one mapping between CPU to
MSI-x vector instead one MSI-x interrupt (or reply descriptor queue) is
shared with group/set of CPUs and there is a possibility of having a loop
in the IO path within that CPU group and may observe lockups.

For example: Consider a system having two NUMA nodes and each node having
four logical CPUs and also consider that number of MSI-x vectors enabled
on the HBA is two, then CPUs count to MSI-x vector count ratio as 4:1.
e.g.
MSIx vector 0 is affinity to  CPU 0, CPU 1, CPU 2 & CPU 3 of NUMA node 0
and MSI-x vector 1 is affinity to CPU 4, CPU 5, CPU 6 & CPU 7 of NUMA node
1.

numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3-->
MSI-x 0
node 0 size: 65536 MB
node 0 free: 63176 MB
node 1 cpus: 4 5 6 7
-->MSI-x 1
node 1 size: 65536 MB
node 1 free: 63176 MB

Assume that user started an application which uses all the CPUs of NUMA
node 0 for issuing the IOs.
Only one CPU from affinity list (it can be any cpu since this behavior
depends upon irqbalance) CPU0 will receive the interrupts from MSIx vector
0 for all the IOs. Eventually, CPU 0 IO submission percentage will be
decreasing and ISR processing percentage will be increasing as it is more
busy with processing the interrupts. Gradually IO submission percentage on
CPU 0 will be zero and it's ISR processing percentage will be 100
percentage as IO loop has already formed within the NUMA node 0, i.e. CPU
1, CPU 2 & CPU 3 will be continuously busy with submitting the heavy IOs
and only CPU 0 is busy in the ISR path as it always find the valid reply
descriptor in the reply descriptor queue. Eventually, we will observe the
hard lockup here.

Chances of occurring of hard/soft lockups are directly proportional to
value of X. If value of X is high, then chances of observing CPU lockups
is high.

Solution -
Fix - 1 Use IRQ poll interface defined in " irq_poll.c". mpt3sas driver
will execute ISR routine in Softirq context and it will always quit the
loop based on budget provided in IRQ poll interface.

In these scenarios( i.e. where CPUs count to MSI-X vectors count ratio is
X:1 (where X >  1)),  IRQ poll interface will avoid CPU hard lockups due
to voluntary exit from the reply queue processing based on budget.  Note -
Only one MSI-x vector is busy doing processing. Irqstat ouput -

IRQs / 1 second(s)
IRQ#  TOTAL  NODE0   NODE1   NODE2   NODE3  NAME
  44122871   122871   0   0   0  IR-PCI-MSI-edge
m

RE: [PATCH 13/14] megaraid_sas: NVME passthru command support

2018-01-10 Thread Kashyap Desai
> -Original Message-
> From: Douglas Gilbert [mailto:dgilb...@interlog.com]
> Sent: Wednesday, January 10, 2018 2:21 AM
> To: Christoph Hellwig; Kashyap Desai
> Cc: Shivasharan Srikanteshwara; linux-scsi@vger.kernel.org; Sumit Saxena;
> linux-n...@lists.infradead.org; Peter Rivera
> Subject: Re: [PATCH 13/14] megaraid_sas: NVME passthru command support
>
> On 2018-01-09 11:45 AM, Christoph Hellwig wrote:
> > On Tue, Jan 09, 2018 at 10:07:28PM +0530, Kashyap Desai wrote:
> >> Chris -
> >>
> >> Overall NVME support behind MR controller is really a SCSI device. On
> >> top of that, for MegaRaid, NVME device can be part of Virtual Disk
> >> and those drive will not be exposed to the driver. User application
> >> may like to talk to hidden NVME devices (part of VDs). This patch
> >> will extend the existing interface for megaraid product in the same
> >> way it is currently supported for other protocols like SMP, SATA pass-
> through.
> >>
> >> Example - Current smartmon is using megaraid.h (MFI headers) to send
> >> SATA pass-through.
> >>
> >> https://github.com/mirror/smartmontools/blob/master/megaraid.h
> >
> > And that is exactly the example of why we should have never allowed
> > megaraid any private passthrough ioctls to start with.
>
> Christoph,
> Have you tried to do any serious work with  and say
> compared it with FreeBSD and Microsoft's approach? No prize for guessing
> which one is worst (and least extensible). Looks like the Linux
> pass-through
> was at the end of a ToDo list and was "designed"
> at 5 a.m in the morning.
>
> RAID cards need a pass-through that allows them to address one of many
> physical disks behind the virtual disk presented to OS.
> Pass-throughs need to have uncommited room for extra parameters that will
> be passed through as-is to the RAID LLD.

Doug - As you mentioned, I notice the same. This type of issue is common for
all RAID controllers vendors.
Whatever Christoph mentioned about NVMe type API to be used is possible, but
may need extra hit in firmware side to convert Linux NVME API to FW specific
OR deal the same in driver.
It may come with it's own pros/cons.  Also may not fulfil the end goal. For
other platforms, we still have to depend upon specialized pass-through code.
So having said that Firmware of RAID cannot use only one interface for
pass-through and they have to choose specialized pass-through code.

NVME-CLI interface is designed for NVME drives attached to block layer.
MegaRaid product is design to keep NVME protocol abstracted (much like SATA
drives behind SAS controller) and attach those drives/virtual disk to SCSI
layer.

>
> So until Christoph gives an example of how that can be done with
>  then I would like to see Christoph's objection
> ignored.
>
>
> And as a maintainer of smartmontools, I would like to point out that
> pretty
> well all supported RAIDs, on all platforms need specialized pass-through
> code.

If upstream community like to enhance nvme-cli type interface in
megaraid_sas driver, we may have to come up with one more layer in
megaraid_sas driver to convert NVME-API to specialized pass-through code.
It is really not simple to fit into existing design as NVME-CLI/API is
considering NVME drive associated with nvme.ko modules (/dev/nvmeX). Also we
don't have many sysfs entries nvme-cli is looking for NVME device etc.. We
don't have way to talk to Physical disks which is part of VD etc..

Specialized pass-through code is better to extend in application like
smartmontools etc.

> Start by looking at os_linux.cpp and then at the other OSes. And now
> smartmontools supports NVMe on most platforms and at the pass-through
> level, it is just another one, and not a particularly clean one.
>
> IMO Intel had their chance on the pass-through front, and blew it.
> It is now too late to fix it and that job (impossible ?) should not fall
> to
> MegaRaid maintainers.
>
> Douglas Gilbert


RE: [PATCH 13/14] megaraid_sas: NVME passthru command support

2018-01-10 Thread Kashyap Desai
> -Original Message-
> From: Keith Busch [mailto:keith.bu...@intel.com]
> Sent: Wednesday, January 10, 2018 4:53 AM
> To: Douglas Gilbert
> Cc: Christoph Hellwig; Kashyap Desai; Shivasharan Srikanteshwara; Sumit
> Saxena; Peter Rivera; linux-n...@lists.infradead.org; linux-
> s...@vger.kernel.org
> Subject: Re: [PATCH 13/14] megaraid_sas: NVME passthru command support
>
> On Tue, Jan 09, 2018 at 03:50:44PM -0500, Douglas Gilbert wrote:
> > Have you tried to do any serious work with  and
> > say compared it with FreeBSD and Microsoft's approach? No prize for
> > guessing which one is worst (and least extensible). Looks like the
> > Linux pass-through was at the end of a ToDo list and was "designed"
> > at 5 a.m in the morning.
>
> What the heck are you talking about? FreeBSD's NVMe passthrough is near
> identical to Linux, and Linux's existed years prior.
>
> You're not even touching the nvme subsystem, so why are you copying the
> linux-nvme mailing list to help you with a non-NVMe device? Please take
your
> ignorant and dubious claims elsewhere.

Keith -

As we discussed for mpt3sas driver NVME driver support, there was request
to add linux-n...@lists.infradead.org for NVME related discussion.
https://marc.info/?l=linux-kernel&m=149874673729467&w=2

As you mentioned, we are not touching NVME subsystem, we can skip to add
NVME mailing list for future submission w.r.t NVME drive behind MR
(megaraid_sas) and HBA (mpt3sas).
All the NVME drives behind MegaRaid controller is SCSI device irrespective
of transport.

Kashyap


RE: [PATCH 13/14] megaraid_sas: NVME passthru command support

2018-01-09 Thread Kashyap Desai
Chris -

Overall NVME support behind MR controller is really a SCSI device. On top
of that, for MegaRaid, NVME device can be part of Virtual Disk and those
drive will not be exposed to the driver. User application may like to talk
to hidden NVME devices (part of VDs). This patch will extend the existing
interface for megaraid product in the same way it is currently supported
for other protocols like SMP, SATA pass-through.

Example - Current smartmon is using megaraid.h (MFI headers) to send SATA
pass-through.

https://github.com/mirror/smartmontools/blob/master/megaraid.h

Any open source application is aware of above interface can extend the
similar support for NVME drives. I agree that current nvme-cli type
interface is not going to be supported using this method.  In current
patch, driver processing is very limited since most of the work is handled
in application + FW.

NVME behind MR controller is not really NVME device to the operating
system at block layer. Considering this, do you agree or still foresee any
issues ?

Kashyap

-Original Message-
From: Christoph Hellwig [mailto:h...@infradead.org]
Sent: Monday, January 8, 2018 3:36 PM
To: Shivasharan S
Cc: linux-scsi@vger.kernel.org; sumit.sax...@broadcom.com;
linux-n...@lists.infradead.org; kashyap.de...@broadcom.com
Subject: Re: [PATCH 13/14] megaraid_sas: NVME passthru command support

NAK.  Please implement the same ioctl interfaces as the nvme driver
instead of inventing your own incomaptible one.


RE: system hung up when offlining CPUs

2017-09-13 Thread Kashyap Desai
>
> On 09/12/2017 08:15 PM, YASUAKI ISHIMATSU wrote:
> > + linux-scsi and maintainers of megasas
> >
> > When offlining CPU, I/O stops. Do you have any ideas?
> >
> > On 09/07/2017 04:23 PM, YASUAKI ISHIMATSU wrote:
> >> Hi Mark and Christoph,
> >>
> >> Sorry for the late reply. I appreciated that you fixed the issue on kvm
> environment.
> >> But the issue still occurs on physical server.
> >>
> >> Here ares irq information that I summarized megasas irqs from
> >> /proc/interrupts and /proc/irq/*/smp_affinity_list on my server:
> >>
> >> ---
> >> IRQ affinity_list IRQ_TYPE
> >>  420-5IR-PCI-MSI 1048576-edge megasas
> >>  430-5IR-PCI-MSI 1048577-edge megasas
> >>  440-5IR-PCI-MSI 1048578-edge megasas
> >>  450-5IR-PCI-MSI 1048579-edge megasas
> >>  460-5IR-PCI-MSI 1048580-edge megasas
> >>  470-5IR-PCI-MSI 1048581-edge megasas
> >>  480-5IR-PCI-MSI 1048582-edge megasas
> >>  490-5IR-PCI-MSI 1048583-edge megasas
> >>  500-5IR-PCI-MSI 1048584-edge megasas
> >>  510-5IR-PCI-MSI 1048585-edge megasas
> >>  520-5IR-PCI-MSI 1048586-edge megasas
> >>  530-5IR-PCI-MSI 1048587-edge megasas
> >>  540-5IR-PCI-MSI 1048588-edge megasas
> >>  550-5IR-PCI-MSI 1048589-edge megasas
> >>  560-5IR-PCI-MSI 1048590-edge megasas
> >>  570-5IR-PCI-MSI 1048591-edge megasas
> >>  580-5IR-PCI-MSI 1048592-edge megasas
> >>  590-5IR-PCI-MSI 1048593-edge megasas
> >>  600-5IR-PCI-MSI 1048594-edge megasas
> >>  610-5IR-PCI-MSI 1048595-edge megasas
> >>  620-5IR-PCI-MSI 1048596-edge megasas
> >>  630-5IR-PCI-MSI 1048597-edge megasas
> >>  640-5IR-PCI-MSI 1048598-edge megasas
> >>  650-5IR-PCI-MSI 1048599-edge megasas
> >>  66  24-29IR-PCI-MSI 1048600-edge megasas
> >>  67  24-29IR-PCI-MSI 1048601-edge megasas
> >>  68  24-29IR-PCI-MSI 1048602-edge megasas
> >>  69  24-29IR-PCI-MSI 1048603-edge megasas
> >>  70  24-29IR-PCI-MSI 1048604-edge megasas
> >>  71  24-29IR-PCI-MSI 1048605-edge megasas
> >>  72  24-29IR-PCI-MSI 1048606-edge megasas
> >>  73  24-29IR-PCI-MSI 1048607-edge megasas
> >>  74  24-29IR-PCI-MSI 1048608-edge megasas
> >>  75  24-29IR-PCI-MSI 1048609-edge megasas
> >>  76  24-29IR-PCI-MSI 1048610-edge megasas
> >>  77  24-29IR-PCI-MSI 1048611-edge megasas
> >>  78  24-29IR-PCI-MSI 1048612-edge megasas
> >>  79  24-29IR-PCI-MSI 1048613-edge megasas
> >>  80  24-29IR-PCI-MSI 1048614-edge megasas
> >>  81  24-29IR-PCI-MSI 1048615-edge megasas
> >>  82  24-29IR-PCI-MSI 1048616-edge megasas
> >>  83  24-29IR-PCI-MSI 1048617-edge megasas
> >>  84  24-29IR-PCI-MSI 1048618-edge megasas
> >>  85  24-29IR-PCI-MSI 1048619-edge megasas
> >>  86  24-29IR-PCI-MSI 1048620-edge megasas
> >>  87  24-29IR-PCI-MSI 1048621-edge megasas
> >>  88  24-29IR-PCI-MSI 1048622-edge megasas
> >>  89  24-29IR-PCI-MSI 1048623-edge megasas
> >> ---
> >>
> >> In my server, IRQ#66-89 are sent to CPU#24-29. And if I offline
> >> CPU#24-29, I/O does not work, showing the following messages.
> >>
> >> ---
> >> [...] sd 0:2:0:0: [sda] tag#1 task abort called for
> >> scmd(8820574d7560) [...] sd 0:2:0:0: [sda] tag#1 CDB: Read(10) 28
> >> 00 0d e8 cf 78 00 00 08 00 [...] sd 0:2:0:0: task abort: FAILED
> >> scmd(8820574d7560) [...] sd 0:2:0:0: [sda] tag#0 task abort
> >> called for scmd(882057426560) [...] sd 0:2:0:0: [sda] tag#0 CDB:
> >> Write(10) 2a 00 0d 58 37 00 00 00 08 00 [...] sd 0:2:0:0: task abort:
> >> FAILED scmd(882057426560) [...] sd 0:2:0:0: target reset called
> >> for scmd(8820574d7560) [...] sd 0:2:0:0: [sda] tag#1 megasas:
> >> target
> reset FAILED!!
> >> [...] sd 0:2:0:0: [sda] tag#0 Controller reset is requested due to IO
> >> timeout
> >> [...] SCSI command pointer: (882057426560)   SCSI host state: 5
> >> SCSI
> >> [...] IO request frame:
> >> [...]
> >> 
> >> [...]
> >> [...] megaraid_sas :02:00.0: [ 0]waiting for 2 commands to
> >> complete for scsi0 [...] INFO: task auditd:1200 blocked for more than
> >> 120
> seconds.
> >> [...]   Not tainted 4.13.0+ #15
> >> [...] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> message.
> >> [...] auditd  D0  1200  1 0x
> >> [...] Call Trace:
> >> [...]  __schedule+0x28d/0x890
> >> [...]  schedule+0x36/0x80
> >> [...]  io_schedule+0x16/0x40
> >> [...]  wait_on_page_bit_common+0x109/0x1c0
> >> [...]  ? page_cache_tree_insert+0xf0/0xf0 [...]
> >> __filemap_fdatawait_range+0x127/0x190
> >> [...]  ? __filemap_fdatawrite_range+0xd1/0x100
> >> [...]  file_write_and_wait_range+0x60/0xb0
> >> [...]  xfs_file_fsync+0x67/0x1d0 [xfs] [...]
> >> vfs_fsync_range+0x3d/0x

RE: [PATCH v2 00/13] mpt3sas driver NVMe support:

2017-08-07 Thread Kashyap Desai
> -Original Message-
> From: James Bottomley [mailto:james.bottom...@hansenpartnership.com]
> Sent: Monday, August 07, 2017 7:48 PM
> To: Kashyap Desai; Christoph Hellwig; Hannes Reinecke
> Cc: Suganath Prabu Subramani; martin.peter...@oracle.com; linux-
> s...@vger.kernel.org; Sathya Prakash Veerichetty; linux-
> ker...@vger.kernel.org; Chaitra Basappa; Sreekanth Reddy; linux-
> n...@lists.infradead.org
> Subject: Re: [PATCH v2 00/13] mpt3sas driver NVMe support:
>
> On Mon, 2017-08-07 at 19:26 +0530, Kashyap Desai wrote:
> > >
> > > -Original Message-
> > > From: James Bottomley [mailto:james.bottom...@hansenpartnership.com
> > > ]
> > > Sent: Saturday, August 05, 2017 8:12 PM
> > > To: Christoph Hellwig; Hannes Reinecke
> > > Cc: Suganath Prabu S; martin.peter...@oracle.com; linux-
> > > s...@vger.kernel.org; sathya.prak...@broadcom.com;
> > > kashyap.de...@broadcom.com; linux-ker...@vger.kernel.org;
> > > chaitra.basa...@broadcom.com; sreekanth.re...@broadcom.com; linux-
> > > n...@lists.infradead.org
> > > Subject: Re: [PATCH v2 00/13] mpt3sas driver NVMe support:
> > >
> > > On Sat, 2017-08-05 at 06:53 -0700, Christoph Hellwig wrote:
> > > >
> > > > On Wed, Aug 02, 2017 at 10:14:40AM +0200, Hannes Reinecke wrote:
> > > > >
> > > > >
> > > > > I'm not happy with this approach.
> > > > > NVMe devices should _not_ appear as SCSI devices; this will just
> > > > > confuse matters _and_ will be incompatible with 'normal' NVMe
> > > > > devices.
> > > > >
> > > > > Rather I would like to see the driver to hook into the existing
> > > > > NVMe framework (which essentially means to treat the mpt3sas as
> > > > > a weird NVMe-over-Fabrics HBA), and expose the NVMe devices like
> > > > > any other NVMe HBA.
> > > >
> > > > That doesn't make any sense.  The devices behind the mpt adapter
> > > > don't look like NVMe devices at all for the hosts - there are no
> > > > NVMe commands or queues involved at all, they hide behind the same
> > > > somewhat leaky scsi abstraction as other devices behind the mpt
> > > > controller.
> > >
> > > You might think about what we did for SAS: split the generic handler
> > > into two pieces, libsas for driving the devices, which mpt didn't
> > > need because of the fat firmware and the SAS transport class so mpt
> > > could at least show the same sysfs files as everything else for SAS
> > > devices.
> >
> >  Ventura generation of controllers are adding connectivity of NVME
> >  drives seamlessly and protocol handling is in Firmware.
> >  Same as SCSI to ATA translation done in firmware, Ventura controller
> >  is doing SCSI to NVME translation and for end user protocol handling
> >  is abstracted.
> >
> >  This product handles new Transport protocol (NVME) same as ATA and
> >  transport is abstracted for end user.
> >
> > NVME pass-through related driver code, it is just a big tunnel for
> > user space application. It is just a basic framework like SATA PASS-
> > Through in existing mpt3sas driver.
>
> I know how it works ... and I'm on record as not liking your SATL approach
> because we keep tripping across bugs in the SATL that we have to fix in
> the
> driver.

We discussed about NVME device support behind  to Hannes and
he suggested to describe to product behavior to wider audience to be aware.
Just wanted to share the notes.

>
> However, at least for bot SAS and SATA they appear to the system as SCSI
> devices regardless of HBA, so we've largely smoothed over any problems if
> you
> transfer from mp3sas to another SAS/SATA controller.
>
> I believe your current proposal is to have NVMe devices appear as SCSI,
> which
> isn't how the native NVMe driver handles them at all.  This is going to
> have to
> be special cased in any tool designed to handle nvme devices and it's
> going to
> cause big problems if someone changes controller (or moves the
> device).  What's the proposal for making this as painless as possible?

We have to attempt this use case and see how it behaves. I have not tried
this, so not sure if things are really bad or just some tuning may be
helpful. I will revert back to you on this.

I understood request as -  We need some udev rules to be working well for
*same* NVME drives if it is behind  or native .
Example - If user has OS installed on NVME drive which is behind 
driver as SCSI disk should be able to boot if he/she hooked same NVME drive
which is detected by native  driver (and vice versa.)

>
> James


RE: [PATCH v2 00/13] mpt3sas driver NVMe support:

2017-08-07 Thread Kashyap Desai
> -Original Message-
> From: James Bottomley [mailto:james.bottom...@hansenpartnership.com]
> Sent: Saturday, August 05, 2017 8:12 PM
> To: Christoph Hellwig; Hannes Reinecke
> Cc: Suganath Prabu S; martin.peter...@oracle.com; linux-
> s...@vger.kernel.org; sathya.prak...@broadcom.com;
> kashyap.de...@broadcom.com; linux-ker...@vger.kernel.org;
> chaitra.basa...@broadcom.com; sreekanth.re...@broadcom.com; linux-
> n...@lists.infradead.org
> Subject: Re: [PATCH v2 00/13] mpt3sas driver NVMe support:
>
> On Sat, 2017-08-05 at 06:53 -0700, Christoph Hellwig wrote:
> > On Wed, Aug 02, 2017 at 10:14:40AM +0200, Hannes Reinecke wrote:
> > >
> > > I'm not happy with this approach.
> > > NVMe devices should _not_ appear as SCSI devices; this will just
> > > confuse matters _and_ will be incompatible with 'normal' NVMe
> > > devices.
> > >
> > > Rather I would like to see the driver to hook into the existing NVMe
> > > framework (which essentially means to treat the mpt3sas as a weird
> > > NVMe-over-Fabrics HBA), and expose the NVMe devices like any other
> > > NVMe HBA.
> >
> > That doesn't make any sense.  The devices behind the mpt adapter don't
> > look like NVMe devices at all for the hosts - there are no NVMe
> > commands or queues involved at all, they hide behind the same somewhat
> > leaky scsi abstraction as other devices behind the mpt controller.
>
> You might think about what we did for SAS: split the generic handler into
> two
> pieces, libsas for driving the devices, which mpt didn't need because of
> the fat
> firmware and the SAS transport class so mpt could at least show the same
> sysfs
> files as everything else for SAS devices.

 Ventura generation of controllers are adding connectivity of NVME
 drives seamlessly and protocol handling is in Firmware.
 Same as SCSI to ATA translation done in firmware, Ventura controller
 is doing SCSI to NVME translation and for end user protocol handling
 is abstracted.

 This product handles new Transport protocol (NVME) same as ATA and
 transport is abstracted for end user.

NVME pass-through related driver code, it is just a big tunnel for user
space application.
It is just a basic framework like SATA PASS-Through in existing mpt3sas
driver.

>
> Fortunately for NVMe it's very simple at the moment its just a couple of
> host
> files and wwid on the devices.
>
> James
>
>
> > The only additional leak is that the controller now supports NVMe-
> > like PRPs in additions to its existing two SGL formats.
> >


RE: [PATCH v2 11/15] megaraid_sas: Set device queue_depth same as HBA can_queue value in scsi-mq mode

2017-07-11 Thread Kashyap Desai
> -Original Message-
> From: Christoph Hellwig [mailto:h...@lst.de]
> Sent: Tuesday, July 11, 2017 7:28 PM
> To: Shivasharan S
> Cc: linux-scsi@vger.kernel.org; martin.peter...@oracle.com;
> the...@redhat.com; j...@linux.vnet.ibm.com;
> kashyap.de...@broadcom.com; sumit.sax...@broadcom.com;
> h...@suse.com; h...@lst.de
> Subject: Re: [PATCH v2 11/15] megaraid_sas: Set device queue_depth same
as
> HBA can_queue value in scsi-mq mode
>
> On Wed, Jul 05, 2017 at 05:00:25AM -0700, Shivasharan S wrote:
> > Currently driver sets default queue_depth for VDs at 256 and JBODs
> > based on interface type, ie., for SAS JBOD QD will be 64, for SATA
JBOD QD
> will be 32.
> > During performance runs with scsi-mq enabled, we are seeing better
> > results by setting QD same as HBA queue_depth.
>
> Please no scsi-mq specifics.  just do this unconditionally.

Chris -  Intent for mq specific check is mainly because of sequential work
load for HDD is having penalty due to mq scheduler issue.
We did this exercise prior to mq-deadline support.

Making generic change for non-mq and mq was good, but we may see some user
may not like to see regression.
E.a In case of, QD = 32 for SATA PD file system creation may be faster
compare to large QD. There may be a soft merger at block layer due to
queue depth throttling. Eventually, FS creation goes fast due to IO
merges, but same will not be true if we change queue depth logic (means,
increase device queue depth to HBA QD.)

We have choice to completely remove this patch and ask users to do sysfs
settings in case of scsi-mq performance issue for HDD sequential work
load.
Having this patch, we want to provide better QD settings as default from
driver.


Thanks, Kashyap


RE: [PATCH 13/47] megaraid: pass in NULL scb for host reset

2017-06-28 Thread Kashyap Desai
> -Original Message-
> From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi-
> ow...@vger.kernel.org] On Behalf Of Hannes Reinecke
> Sent: Wednesday, June 28, 2017 9:00 PM
> To: Sumit Saxena; Christoph Hellwig
> Cc: Martin K. Petersen; James Bottomley; linux-scsi@vger.kernel.org;
> Hannes
> Reinecke
> Subject: Re: [PATCH 13/47] megaraid: pass in NULL scb for host reset
>
> On 06/28/2017 03:41 PM, Sumit Saxena wrote:
> >> -Original Message-
> >> From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi-
> >> ow...@vger.kernel.org] On Behalf Of Hannes Reinecke
> >> Sent: Wednesday, June 28, 2017 2:03 PM
> >> To: Christoph Hellwig
> >> Cc: Martin K. Petersen; James Bottomley; linux-scsi@vger.kernel.org;
> > Hannes
> >> Reinecke; Hannes Reinecke
> >> Subject: [PATCH 13/47] megaraid: pass in NULL scb for host reset
> >>
> >> When calling a host reset we shouldn't rely on the command triggering
> >> the reset, so allow megaraid_abort_and_reset() to be called with a NULL
> scb.
> >> And drop the pointless 'bus_reset' and 'target_reset' handlers, which
> > just call
> >> the same function as host_reset.
> >
> > If this patch address any functional issue, then we should consider
> > this.
> > If it's code optimization, can we ignore this as this is being very
> > old driver and no more maintained by Broadcom/LSI ?
> >
> Sadly, ignoring is not an option.
> I'm planning to update the calling convention for SCSI EH, to resolve the
> long-
> standing problem with sg_reset ioctls.
> sg_reset ioctl will allocate an out-of-band SCSI command, which does no
> longer
> work with the new command allocation scheme in multiqueue.
> So it's not possible to just 'ignore' it, as then SCSI EH will cease to
> function with
> that driver.

Hannes - We are in process of sending megaraid and 3ware driver removal as
LSI/Broadcom  stopped supporting those products.
I agree we should review this closely, but lack of test coverage and end of
life cycle of product is requesting us to know the rational.
For now, let's consider NACK for this patch. We will be removing old
megaraid (mbox) driver and 3Ware drivers soon.

>
> Sorry.
>
> >>
> >> Signed-off-by: Hannes Reinecke 
> >> ---
> >> drivers/scsi/megaraid.c | 42
> >> --
> >> 1 file changed, 16 insertions(+), 26 deletions(-)
> >>
> >> diff --git a/drivers/scsi/megaraid.c b/drivers/scsi/megaraid.c index
> >> 3c63c29..7e504d3 100644
> >> --- a/drivers/scsi/megaraid.c
> >> +++ b/drivers/scsi/megaraid.c
> >> @@ -1909,7 +1909,7 @@ static DEF_SCSI_QCMD(megaraid_queue)
> >>
> >>spin_lock_irq(&adapter->lock);
> >>
> >> -  rval =  megaraid_abort_and_reset(adapter, cmd, SCB_RESET);
> >> +  rval =  megaraid_abort_and_reset(adapter, NULL, SCB_RESET);
> >
> > If cmd=NULL is passed, it will crash inside function
> > megaraid_abort_and_reset() while dereferencing "cmd" pointer.
> > Below is the code of function  megaraid_abort_and_reset() where it
> > will
> > crash-
> >
> > static int
> > megaraid_abort_and_reset(adapter_t *adapter, Scsi_Cmnd *cmd, int aor)
> > {
> > struct list_head*pos, *next;
> > scb_t   *scb;
> >
> > dev_warn(&adapter->dev->dev, "%s cmd=%x \n",
> >  (aor == SCB_ABORT)? "ABORTING":"RESET",
> >  cmd->cmnd[0],
> > cmd->device->channel,it should
> > cmd->device->crash
> > here
> >  cmd->device->id, (u32)cmd->device->lun);
> >
> > Please correct if I am missing something here.Ah, correct. Will be
> > fixing it up.
>
> Cheers,
>
> Hannes
> --
> Dr. Hannes Reinecke  Teamlead Storage & Networking
> h...@suse.de +49 911 74053 688
> SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
> GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284
> (AG
> Nürnberg)


RE: [RESEND][PATCH 07/10][SCSI]mpt2sas: Added Reply Descriptor Post Queue (RDPQ) Array support

2017-04-27 Thread Kashyap Desai
> -Original Message-
> From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi-
> ow...@vger.kernel.org] On Behalf Of Martin K. Petersen
> Sent: Thursday, April 27, 2017 3:55 AM
> To: Sreekanth Reddy
> Cc: Martin K. Petersen; j...@kernel.org; linux-scsi@vger.kernel.org;
linux-
> ker...@vger.kernel.org; Christoph Hellwig
> Subject: Re: [RESEND][PATCH 07/10][SCSI]mpt2sas: Added Reply Descriptor
> Post Queue (RDPQ) Array support
>
>
> Sreekanth,
>
> > We need to satisfy this condition on those system where 32 bit dma
> > consistent mask is not supported and it only supports 64 bit dma
> > consistent mask. So on these system we can't set
> > pci_set_consistent_dma_mask() to DMA_BIT_MASK(32).
>
> Which systems are you talking about?
>
> It seems a bit unrealistic to require all devices to support 64-bit DMA.

Martin - We have found all devices to support 64-bit DMA on certain ARM64
platform. I discussed @linux-arm-kern. Below is a thread.

http://marc.info/?l=linux-arm-kernel&m=148880763816046&w=2

For ARM64, it is not supporting SWIOTLB and that is a reason we need to
make all DMA pool above 4GB.
Ea. If I map crash kernel above 4GB in x86_64 platform, they owner DMA 32
bit mask since arch specific code in x86_64 support SWIOTLB.
Same settings on ARM64 platform fails DAM 32 bit mask.

In one particular setup of ARM64, I also see below 4GB is mapped to SoC
and kernel component mapped above 4GB region.

Can we add in MR/IT driver below logic to meet this requirement ?

- Driver will attempt DMA buffer above 4GB and check the start and end
address of the physical address.
If DMA buffer cross the "Same 4GB region" ( I mean High Address should be
constant for that region.), driver will hold that region and attempt one
more allocation.
If second allocation is also not meeting "Same 4GB region", we will give
up driver load.

Before we attempt above logic, we would like to understand if we have any
other reliable method ways to handle this in Linux.

Most of the time, we are going to get "same 4GB region", so we are OK to
have this corner case to detect and bail out driver load. There is no
report of issue from field, but wanted to protect failure for future.

Thanks, Kashyap

>
> --
> Martin K. PetersenOracle Linux Engineering


RE: out of range LBA using sg_raw

2017-03-08 Thread Kashyap Desai
> -Original Message-
> From: Martin K. Petersen [mailto:martin.peter...@oracle.com]
> Sent: Wednesday, March 08, 2017 10:03 PM
> To: Kashyap Desai
> Cc: Christoph Hellwig; linux-ker...@vger.kernel.org; linux-
> s...@vger.kernel.org
> Subject: Re: out of range LBA using sg_raw
>
> >>>>> "Kashyap" == Kashyap Desai  writes:
>
> Kashyap,
>
> Kashyap> I am just curious to know how badly we have to scrutinize each
> Kashyap> packet before sending to Fast Path as we are in IO path and
> Kashyap> recommend only important checks to be added.
>
> As Christoph pointed out, when the fast path is in use you assume the
role of
> the SCSI device. And therefore it is your responsibility to ensure that
the VD's
> capacity and other relevant constraints are being honored. Just like the
MR
> firmware and any attached disks would.

Martin -

Agree on this point. I am planning to study all possible such sanity in
driver for VD and not trying to fix one specific scenario as described
here.
Do you think fix  in this area is good  for kernel-stable as well  OR just
keep in linux-next as it is not so severe considering real time exposure ?
Trying to understand priority and severity of this issue.

>
> It is a feature that there is no sanity checking in the sg interface.
> The intent is to be able to pass through commands directly to a device
and
> have the device act upon them. Including fail them if they don't make
any
> sense.

Understood as sg_raw is not design to sanity check.

>
> PS. I'm really no fan of the fast path. It's super messy to have the VD
layout
> handled in two different places.
>
> --
> Martin K. PetersenOracle Linux Engineering


RE: out of range LBA using sg_raw

2017-03-08 Thread Kashyap Desai
> -Original Message-
> From: Christoph Hellwig [mailto:h...@infradead.org]
> Sent: Wednesday, March 08, 2017 9:37 PM
> To: Kashyap Desai
> Cc: Christoph Hellwig; linux-ker...@vger.kernel.org; linux-
> s...@vger.kernel.org
> Subject: Re: out of range LBA using sg_raw
>
> On Wed, Mar 08, 2017 at 09:29:28PM +0530, Kashyap Desai wrote:
> > Thanks Chris. It is understood to have sanity in driver, but how
> > critical such checks where SG_IO type interface send pass-through
request.
> ?
> > Are you suggesting as good to have sanity or very important as there
> > may be a real-time exposure other than SG_IO interface ? I am confused
> > over must or good to have check.
> > Also one more fault I can generate using below sg_raw command -
>
> SCSI _devices_ need to sanity check any input and fail commands instead
of
> crashing or causing other problems.  Normal SCSI HBA drivers don't need
to
> do that as they don't interpret CDBs.  Megaraid (and a few other raid
drivers)
> are special in that they take on part of the device functionality and do
> interpret CDBs sometimes.  In that case you'll need to do all that
sanity
> checking and generate proper errors.
>
> It would be nice to have come common helpers for this shared between
> everyone interpreting SCSI CBD (e.g. the SCSI target code, the NVMe SCSI
> emulation and the various RAID drivers).

Thanks Chris.  I will  continue on this and will come back with changes.
Let me check with Broadcom internally and figure out all possible
scenarios for megaraid_sas.

Thanks, Kashyap


RE: out of range LBA using sg_raw

2017-03-08 Thread Kashyap Desai
> -Original Message-
> From: Bart Van Assche [mailto:bart.vanass...@sandisk.com]
> Sent: Wednesday, March 08, 2017 9:35 PM
> To: h...@infradead.org; kashyap.de...@broadcom.com
> Cc: linux-scsi@vger.kernel.org; linux-ker...@vger.kernel.org
> Subject: Re: out of range LBA using sg_raw
>
> On Wed, 2017-03-08 at 21:29 +0530, Kashyap Desai wrote:
> > Also one more fault I can generate using below sg_raw command -
> >
> > "sg_raw -r 32k /dev/sdx 28 00 01 4f ff ff 00 00 08 00"
> >
> > Provide more scsi data length compare to actual SG buffer. Do you
> > suggest such SG_IO interface vulnerability is good to be captured in
driver.
>
> That's not a vulnerability of the SG I/O interface. A SCSI device has to
set the
> residual count correctly if the SCSI data length does not match the size
of the
> data buffer.

Thanks Bart.  I will pass this information to Broadcom firmware dev. May
be a Tx/Rx (DMA) related code in MR (also for Fusion IT HBA)  cannot
handle due to some sanity checks are not passed.

>
> Bart.


RE: out of range LBA using sg_raw

2017-03-08 Thread Kashyap Desai
> -Original Message-
> From: Christoph Hellwig [mailto:h...@infradead.org]
> Sent: Wednesday, March 08, 2017 8:41 PM
> To: Kashyap Desai
> Cc: linux-ker...@vger.kernel.org; linux-scsi@vger.kernel.org
> Subject: Re: out of range LBA using sg_raw
>
> Hi Kashyap,
>
> for SG_IO passthrough requests we can't validate command validity for
> commands as the block layer treats them as opaque.  The SCSI device
> implementation needs to handle incorrect parameter to be robust.
>
> For your fast path bypass the megaraid driver assumes part of the SCSI
device
> implementation, so it will have to check for validity.

Thanks Chris. It is understood to have sanity in driver, but how critical
such checks where SG_IO type interface send pass-through request. ?
Are you suggesting as good to have sanity or very important as there may
be a real-time exposure other than SG_IO interface ? I am confused over
must or good to have check.
Also one more fault I can generate using below sg_raw command -

"sg_raw -r 32k /dev/sdx 28 00 01 4f ff ff 00 00 08 00"

Provide more scsi data length compare to actual SG buffer. Do you suggest
such SG_IO interface vulnerability is good to be captured in driver.

I am just curious to know how badly we have to scrutinize each packet
before sending to Fast Path  as we are in IO path and recommend only
important checks to be added.

Thanks, Kashyap


RE: [PATCH] megaraid_sas: enable intx only if msix request fails

2017-03-08 Thread Kashyap Desai
> > ---
> >  drivers/scsi/megaraid/megaraid_sas_base.c | 6 +-
> >  1 file changed, 1 insertion(+), 5 deletions(-)
> >
> > diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c
> > b/drivers/scsi/megaraid/megaraid_sas_base.c
> > index 7ac9a9e..82a8ec8 100644
> > --- a/drivers/scsi/megaraid/megaraid_sas_base.c
> > +++ b/drivers/scsi/megaraid/megaraid_sas_base.c
> > @@ -4990,6 +4990,7 @@ int megasas_set_crash_dump_params(struct
> megasas_instance *instance,
> > struct pci_dev *pdev;
> >
> > pdev = instance->pdev;
> > +   pci_intx(pdev, 1);
>
> Please use pci_alloc_irq_vectors with the PCI_IRQ_LEGACY flag here, I'd
like to
> phase out the pci_intx API.

I will resubmit v2 patch using pci_alloc_irq_vectors.


RE: [PATCH] megaraid_sas: enable intx only if msix request fails

2017-03-08 Thread Kashyap Desai
Any feedback ? We have few more patches to be submitted, so looking for
review of this pending patch.

Thanks, Kashyap

> -Original Message-
> From: Kashyap Desai [mailto:kashyap.de...@broadcom.com]
> Sent: Thursday, March 02, 2017 4:24 PM
> To: linux-scsi@vger.kernel.org
> Cc: martin.peter...@oracle.com; the...@redhat.com;
> j...@linux.vnet.ibm.com; h...@suse.de; Kashyap Desai
> Subject: [PATCH] megaraid_sas: enable intx only if msix request fails
>
> Without this fix, driver will enable INTx Interrupt pin even  though
MSI-x
> vectors are enabled. See below lspci output. DisINTx is unset  for MSIx
setup.
>
> lspci -s 85:00.0 -vvv |grep INT |grep Control
>   Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
> ParErr- Stepping- SERR+ FastB2B- DisINTx-
>
> After applying this fix, driver will enable INTx Interrupt pin only if
Legacy
> interrupt method is required.
> See below lspci output. DisINTx is unset for MSIx setup.
> lspci -s 85:00.0 -vvv |grep INT |grep Control
>   Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
> ParErr- Stepping- SERR+ FastB2B- DisINTx+
>
> Signed-off-by: Kashyap Desai 
> ---
>  drivers/scsi/megaraid/megaraid_sas_base.c | 6 +-
>  1 file changed, 1 insertion(+), 5 deletions(-)
>
> diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c
> b/drivers/scsi/megaraid/megaraid_sas_base.c
> index 7ac9a9e..82a8ec8 100644
> --- a/drivers/scsi/megaraid/megaraid_sas_base.c
> +++ b/drivers/scsi/megaraid/megaraid_sas_base.c
> @@ -4990,6 +4990,7 @@ int megasas_set_crash_dump_params(struct
> megasas_instance *instance,
>   struct pci_dev *pdev;
>
>   pdev = instance->pdev;
> + pci_intx(pdev, 1);
>   instance->irq_context[0].instance = instance;
>   instance->irq_context[0].MSIxIndex = 0;
>   if (request_irq(pci_irq_vector(pdev, 0), @@ -5277,10 +5278,6 @@
> static int megasas_init_fw(struct megasas_instance *instance)
>   MPI2_REPLY_POST_HOST_INDEX_OFFSET);
>   }
>
> - i = pci_alloc_irq_vectors(instance->pdev, 1, 1, PCI_IRQ_LEGACY);
> - if (i < 0)
> - goto fail_setup_irqs;
> -
>   dev_info(&instance->pdev->dev,
>   "firmware supports msix\t: (%d)", fw_msix_count);
>   dev_info(&instance->pdev->dev,
> @@ -5494,7 +5491,6 @@ static int megasas_init_fw(struct
> megasas_instance *instance)
>   instance->instancet->disable_intr(instance);
>  fail_init_adapter:
>   megasas_destroy_irqs(instance);
> -fail_setup_irqs:
>   if (instance->msix_vectors)
>   pci_free_irq_vectors(instance->pdev);
>   instance->msix_vectors = 0;
> --
> 1.8.3.1


out of range LBA using sg_raw

2017-03-08 Thread Kashyap Desai
Hi -

Need help to understand  if below is something we should consider to be
fixed in megaraid_sas driver or call as unreal exposure.

I have created slice VD of size 10GB (raid 1) using 2 drives.  Each
Physical Drive size is 256GB.

Last LBA of the VD and  actual Physical disk associated with that VD is
different. Actual Physical disk has larger range of LBA compare VD.

Below is readcap detail of VD0

# sg_readcap /dev/sdu
Read Capacity results:
   Last logical block address=20971519 (0x13f), Number of
blocks=20971520
   Logical block length=512 bytes
Hence:
   Device size: 10737418240 bytes, 10240.0 MiB, 10.74 GB

Using below sg_raw command, we should see "LBA out of range" sense.  In
CDB 0x28, pass LBA beyond last lba of VD 0x13f.

sg_raw -r 4k /dev/sdx 28 00 01 4f ff ff 00 00 08 00

It works if VD created behind MR controller does not support Fast Path
Write.
In case of Fast Path Write, driver convert LBA of VD to underlying
Physical disk and send IO direct to the physical disk. Since Physical disk
has enough LBA range to respond, it will not send "LBA out of range
sense".

Megaraid_Sas driver never validate range of LBA for VD as it assume to be
validated by upper layer in scsi stack. Other sg_tool method like sg_dd,
sg_write, dd etc has checks of LBA range and driver never receive out of
range LBA.

What is a suggestion ? Shall I add check in megaraid_sas driver or it is
not a valid scenario as "sg_raw" tool can send any type of command which
does not require multiple sanity in driver.

Thanks, Kashyap


[PATCH] megaraid_sas: enable intx only if msix request fails

2017-03-02 Thread Kashyap Desai
Without this fix, driver will enable INTx Interrupt pin even
 though MSI-x vectors are enabled. See below lspci output. DisINTx is unset
 for MSIx setup.

lspci -s 85:00.0 -vvv |grep INT |grep Control
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR+ FastB2B- DisINTx-

After applying this fix, driver will enable INTx Interrupt pin only if Legacy 
interrupt method is required.
See below lspci output. DisINTx is unset for MSIx setup.
lspci -s 85:00.0 -vvv |grep INT |grep Control
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR+ FastB2B- DisINTx+

Signed-off-by: Kashyap Desai 
---
 drivers/scsi/megaraid/megaraid_sas_base.c | 6 +-
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c 
b/drivers/scsi/megaraid/megaraid_sas_base.c
index 7ac9a9e..82a8ec8 100644
--- a/drivers/scsi/megaraid/megaraid_sas_base.c
+++ b/drivers/scsi/megaraid/megaraid_sas_base.c
@@ -4990,6 +4990,7 @@ int megasas_set_crash_dump_params(struct megasas_instance 
*instance,
struct pci_dev *pdev;
 
pdev = instance->pdev;
+   pci_intx(pdev, 1);
instance->irq_context[0].instance = instance;
instance->irq_context[0].MSIxIndex = 0;
if (request_irq(pci_irq_vector(pdev, 0),
@@ -5277,10 +5278,6 @@ static int megasas_init_fw(struct megasas_instance 
*instance)
MPI2_REPLY_POST_HOST_INDEX_OFFSET);
}
 
-   i = pci_alloc_irq_vectors(instance->pdev, 1, 1, PCI_IRQ_LEGACY);
-   if (i < 0)
-   goto fail_setup_irqs;
-
dev_info(&instance->pdev->dev,
"firmware supports msix\t: (%d)", fw_msix_count);
dev_info(&instance->pdev->dev,
@@ -5494,7 +5491,6 @@ static int megasas_init_fw(struct megasas_instance 
*instance)
instance->instancet->disable_intr(instance);
 fail_init_adapter:
megasas_destroy_irqs(instance);
-fail_setup_irqs:
if (instance->msix_vectors)
pci_free_irq_vectors(instance->pdev);
instance->msix_vectors = 0;
-- 
1.8.3.1



RE: [PATCH 00/10] mpt3sas: full mq support

2017-02-16 Thread Kashyap Desai
> > - Later we can explore if nr_hw_queue more than one really add benefit.
> > From current limited testing, I don't see major performance boost if
> > we have nr_hw_queue more than one.
> >
> Well, the _actual_ code to support mq is rather trivial, and really serves
> as a
> good testbed for scsi-mq.
> I would prefer to leave it in, and disable it via a module parameter.

I am thinking as adding extra code for more than one nr_hw_queue will add
maintenance overhead and support. Especially IO error handling code become
complex with nr_hw_queues > 1 case.  If we really like to see performance
boost, we should attempt and bare other side effect.

For time being we should drop this nr_hw_queue > 1 support is what I choose
(not even module parameter base).

>
> But in either case, I can rebase the patches to leave any notions of
> 'nr_hw_queues' to patch 8 for implementing full mq support.

Thanks Hannes. It was just heads up...We are not sure when we can submit
upcoming patch set from Broadcom. May be we can syncup with you offline in
case any rebase requires.

>
> And we need to discuss how to handle MPI2_FUNCTION_SCSI_IO_REQUEST;
> the current method doesn't work with blk-mq.
> I really would like to see that go, especially as sg/bsg supports the same
> functionality ...
>


RE: [PATCH 00/10] mpt3sas: full mq support

2017-02-16 Thread Kashyap Desai
> -Original Message-
> From: Hannes Reinecke [mailto:h...@suse.de]
> Sent: Wednesday, February 15, 2017 3:35 PM
> To: Kashyap Desai; Sreekanth Reddy
> Cc: Christoph Hellwig; Martin K. Petersen; James Bottomley; linux-
> s...@vger.kernel.org; Sathya Prakash Veerichetty; PDL-MPT-FUSIONLINUX
> Subject: Re: [PATCH 00/10] mpt3sas: full mq support
>
> On 02/15/2017 10:18 AM, Kashyap Desai wrote:
> >>
> >>
> >> Hannes,
> >>
> >> Result I have posted last time is with merge operation enabled in
> >> block layer. If I disable merge operation then I don't see much
> >> improvement with multiple hw request queues. Here is the result,
> >>
> >> fio results when nr_hw_queues=1,
> >> 4k read when numjobs=24: io=248387MB, bw=1655.1MB/s, iops=423905,
> >> runt=150003msec
> >>
> >> fio results when nr_hw_queues=24,
> >> 4k read when numjobs=24: io=263904MB, bw=1759.4MB/s, iops=450393,
> >> runt=150001msec
> >
> > Hannes -
> >
> >  I worked with Sreekanth and also understand pros/cons of Patch #10.
> > " [PATCH 10/10] mpt3sas: scsi-mq interrupt steering"
> >
> > In above patch, can_queue of HBA is divided based on logic CPU, it
> > means we want to mimic as if mpt3sas HBA support multi queue
> > distributing actual resources which is single Submission H/W Queue.
> > This approach badly impact many performance areas.
> >
> > nr_hw_queues = 1 is what I observe as best performance approach since
> > it never throttle IO if sdev->queue_depth is set to HBA queue depth.
> > In case of nr_hw_queues = "CPUs" throttle IO at SCSI level since we
> > never allow more than "updated can_queue" in LLD.
> >
> True.
> And this was actually one of the things I wanted to demonstrate with this
> patchset :-) ATM blk-mq really works best when having a distinct tag space
> per port/device. As soon as the hardware provides a _shared_ tag space you
> end up with tag starvation issues as blk-mq only allows you to do a static
> split of the available tagspace.
> While this patchset demonstrates that the HBA itself _does_ benefit from
> using block-mq (especially on highly parallel loads), it also demonstrates
> that
> _block-mq_ has issues with singlethreaded loads on this HBA (or, rather,
> type of HBA, as I doubt this issue is affecting mpt3sas only).
>
> > Below code bring actual HBA can_queue very low ( Ea on 96 logical core
> > CPU new can_queue goes to 42, if HBA queue depth is 4K). It means we
> > will see lots of IO throttling in scsi mid layer due to
> > shost->can_queue reach the limit very soon if you have  jobs with
> higher QD.
> >
> > if (ioc->shost->nr_hw_queues > 1) {
> > ioc->shost->nr_hw_queues = ioc->msix_vector_count;
> > ioc->shost->can_queue /= ioc->msix_vector_count;
> > }
> > I observe negative performance if I have 8 SSD drives attached to
> > Ventura (latest IT controller). 16 fio jobs at QD=128 gives ~1600K
> > IOPs and the moment I switch to nr_hw_queues = "CPUs", it gave hardly
> > ~850K IOPs. This is mainly because of host_busy stuck at very low ~169
> > on
> my setup.
> >
> Which actually might be an issue with the way scsi is hooked into blk-mq.
> The SCSI stack is using 'can_queue' as a check for 'host_busy', ie if the
> host is
> capable of accepting more commands.
> As we're limiting can_queue (to get the per-queue command depth
> correctly) we should be using the _overall_ command depth for the
> can_queue value itself to make the host_busy check work correctly.
>
> I've attached a patch for that; can you test if it makes a difference?
Hannes -
Attached patch works fine for me. FYI -  We need to set device queue depth
to can_queue as we are currently not doing in mpt3sas driver.

With attached patch when I tried, I see ~2-3% improvement running multiple
jobs. Single job profile no difference.

So looks like we are good to reach performance with single nr_hw_queues.

We have some patches to be send so want to know how to rebase this patch
series as few patches coming from Broadcom. Can we consider below as plan ?

- Patches from 1-7 will be reposted. Also Sreekanth will complete review on
existing patch 1-7.
- We need blk_tag support only for nr_hw_queue = 1.

With that say, we will have many code changes/function without "
shost_use_blk_mq" check and assume it is single nr_hw_queue supported
 driver.

Ea - Below function can be simplify - just refer tag from scmd->request and
don't need check of shost_use_blk_mq + nr_hw_queue etc..

u16
mpt

RE: [PATCH 00/10] mpt3sas: full mq support

2017-02-15 Thread Kashyap Desai
>
>
> Hannes,
>
> Result I have posted last time is with merge operation enabled in block
> layer. If I disable merge operation then I don't see much improvement
> with
> multiple hw request queues. Here is the result,
>
> fio results when nr_hw_queues=1,
> 4k read when numjobs=24: io=248387MB, bw=1655.1MB/s, iops=423905,
> runt=150003msec
>
> fio results when nr_hw_queues=24,
> 4k read when numjobs=24: io=263904MB, bw=1759.4MB/s, iops=450393,
> runt=150001msec

Hannes -

 I worked with Sreekanth and also understand pros/cons of Patch #10.
" [PATCH 10/10] mpt3sas: scsi-mq interrupt steering"

In above patch, can_queue of HBA is divided based on logic CPU, it means we
want to mimic as if mpt3sas HBA support multi queue distributing actual
resources which is single Submission H/W Queue. This approach badly impact
many performance areas.

nr_hw_queues = 1 is what I observe as best performance approach since it
never throttle IO if sdev->queue_depth is set to HBA queue depth.
In case of nr_hw_queues = "CPUs" throttle IO at SCSI level since we never
allow more than "updated can_queue" in LLD.

Below code bring actual HBA can_queue very low ( Ea on 96 logical core CPU
new can_queue goes to 42, if HBA queue depth is 4K). It means we will see
lots of IO throttling in scsi mid layer due to shost->can_queue reach the
limit very soon if you have  jobs with higher QD.

if (ioc->shost->nr_hw_queues > 1) {
ioc->shost->nr_hw_queues = ioc->msix_vector_count;
ioc->shost->can_queue /= ioc->msix_vector_count;
}
I observe negative performance if I have 8 SSD drives attached to Ventura
(latest IT controller). 16 fio jobs at QD=128 gives ~1600K IOPs and the
moment I switch to nr_hw_queues = "CPUs", it gave hardly ~850K IOPs. This is
mainly because of host_busy stuck at very low ~169 on my setup.

May be as Sreekanth mentioned, performance improvement you have observed is
due to nomerges=2 is not set and OS will attempt soft back/front merge.

I debug live machine and understood we never see parallel instance of
"scsi_dispatch_cmd" as we expect due to can_queue is less. If we really has
*very* large HBA QD, this patch #10 to expose multiple SQ may be useful.

For now, we are looking for updated version of patch which will only keep IT
HBA in SQ mode (like we are doing in  driver) and add
interface to use blk_tag in both scsi.mq and !scsi.mq mode.  Sreekanth has
already started working on it, but we may need to check full performance
test run to post the actual patch.
May be we can cherry pick few patches from this series and get blk_tag
support to improve performance of  later which will not allow use
to choose nr_hw_queue to be tunable.

Thanks, Kashyap


>
> Thanks,
> Sreekanth


[PATCH] return valid data buffer length in scsi_bufflen() API using RQF_SPECIAL_PAYLOAD

2017-02-13 Thread Kashyap Desai
Regression due to  commit f9d03f96b988002027d4b28ea1b7a24729a4c9b5
block: improve handling of the magic discard payload

 and  HBA FW encounter FW fault in DMA operation
while creating File system on SSDs.
Below CDB cause FW fault.
CDB: Write same(16) 93 08 00 00 00 00 00 00 00 00 00 00 80 00 00 00

Root cause is SCSI buffer length and DMA buffer length miss match for
WRITE SAME command.

Fix - return valid data buffer length in scsi_bufflen() API using
RQF_SPECIAL_PAYLOAD

Signed-off-by: Kashyap Desai 
---
diff --git a/include/scsi/scsi_cmnd.h b/include/scsi/scsi_cmnd.h
index 9fc1aec..1f796fc 100644
--- a/include/scsi/scsi_cmnd.h
+++ b/include/scsi/scsi_cmnd.h
@@ -180,7 +180,8 @@ static inline struct scatterlist *scsi_sglist(struct
scsi_cmnd *cmd)

 static inline unsigned scsi_bufflen(struct scsi_cmnd *cmd)
 {
-   return cmd->sdb.length;
+   return (cmd->request->rq_flags & RQF_SPECIAL_PAYLOAD) ?
+   cmd->request->special_vec.bv_len :
cmd->sdb.length;
 }

 static inline void scsi_set_resid(struct scsi_cmnd *cmd, int resid)


RE: [PATCH v2 21/39] megaraid_sas: big endian support changes

2017-02-09 Thread Kashyap Desai
> +static inline void set_num_sge(struct RAID_CONTEXT_G35 rctx_g35,
> +u16 sge_count)
> +{
> + rctx_g35.u.bytes[0] = (u8)(sge_count & NUM_SGE_MASK_LOWER);
> + rctx_g35.u.bytes[1] |= (u8)((sge_count >> NUM_SGE_SHIFT_UPPER)
> + &
> NUM_SGE_MASK_UPPER);
> +}

This function and below get_num_sge() need fix.  We have supposed to pass
pointer of struct RAID_CONTEXT_G35 to get correct setting reflected in IO
frame, otherwise it just set in stack local memory and that is not a
intent here. We will fix this patch and resend. Only fixing this patch and
resend works fine with complete series (there is no hunk failure observe),
so just going to push one particular patch with below title.

[PATCH v2 21/39 RESEND] megaraid_sas: big endian support changes

> +
> +static inline u16 get_num_sge(struct RAID_CONTEXT_G35 rctx_g35) {
> + u16 sge_count;
> +
> + sge_count = (u16)(((rctx_g35.u.bytes[1] & NUM_SGE_MASK_UPPER)
> + << NUM_SGE_SHIFT_UPPER) | (rctx_g35.u.bytes[0]));
> + return sge_count;
> +}
> +
> +#define SET_STREAM_DETECTED(rctx_g35) \
> + (rctx_g35.u.bytes[1] |= STREAM_DETECT_MASK)
> +
> +#define CLEAR_STREAM_DETECTED(rctx_g35) \
> + (rctx_g35.u.bytes[1] &= ~(STREAM_DETECT_MASK))
> +
> +static inline bool is_stream_detected(struct RAID_CONTEXT_G35
> +*rctx_g35) {
> + return ((rctx_g35->u.bytes[1] & STREAM_DETECT_MASK)); }
> +
>  union RAID_CONTEXT_UNION {
>   struct RAID_CONTEXT raid_context;
>   struct RAID_CONTEXT_G35 raid_context_g35;
> --
> 2.8.3


RE: [PATCH v2 19/39] megaraid_sas: MR_TargetIdToLdGet u8 to u16 and avoid invalid raid-map access

2017-02-09 Thread Kashyap Desai
> Signed-off-by: Shivasharan S 
> Signed-off-by: Kashyap Desai 

In this patch series, we are done with review but this particular patch
missed Review-by tag.

Kashyap


RE: [PATCH v2 03/39] megaraid_sas: raid 1 fast path code optimize

2017-02-08 Thread Kashyap Desai
> > +static inline void
> > +megasas_complete_r1_command(struct megasas_instance *instance,
> > +   struct megasas_cmd_fusion *cmd) {
> > +   u8 *sense, status, ex_status;
> > +   u32 data_length;
> > +   u16 peer_smid;
> > +   struct fusion_context *fusion;
> > +   struct megasas_cmd_fusion *r1_cmd = NULL;
> > +   struct scsi_cmnd *scmd_local = NULL;
> > +   struct RAID_CONTEXT_G35 *rctx_g35;
> > +
> > +   rctx_g35 = &cmd->io_request->RaidContext.raid_context_g35;
> > +   fusion = instance->ctrl_context;
> > +   peer_smid = le16_to_cpu(rctx_g35->smid.peer_smid);
> > +
> > +   r1_cmd = fusion->cmd_list[peer_smid - 1];
> > +   scmd_local = cmd->scmd;
> > +   status = rctx_g35->status;
> > +   ex_status = rctx_g35->ex_status;
> > +   data_length = cmd->io_request->DataLength;
> > +   sense = cmd->sense;
> > +
> > +   cmd->cmd_completed = true;
>
> Please help me understand how this works
> - there are two peer commands sent to the controller
> - both are completed and the later calls scsi_done and returns both
r1_cmd
> + cmd
> - if both commands can be completed at the same time, is it possible
that
> the
>   above line is executed at the same moment for both completions ?
> How is the code  protected against a double completion when both
> completed commands see the peer cmd_completed as set ?


Tomas,  cmd and r1_cmd (part of  same Raid 1 FP) will be always completed
on same reply queue by firmware. That is one of the key requirement here
for raid 1 fast path.
What you ask is possible if FW completes cmd and r1_cmd on different reply
queue. If you notice when we clone r1_cmd, we also clone MSI-x index from
parent command.
So eventually, FW is aware of binding of both cmd and r1_cmd w.r.t reply
queue index.

` Kashyap

>
> > +


RE: [PATCH 13/39] megaraid_sas : set residual bytes count during IO compeltion

2017-02-07 Thread Kashyap Desai
> -Original Message-
> From: Martin K. Petersen [mailto:martin.peter...@oracle.com]
> Sent: Tuesday, February 07, 2017 5:22 AM
> To: Shivasharan S
> Cc: linux-scsi@vger.kernel.org; martin.peter...@oracle.com;
> the...@redhat.com; j...@linux.vnet.ibm.com;
> kashyap.de...@broadcom.com; sumit.sax...@broadcom.com;
> h...@suse.com
> Subject: Re: [PATCH 13/39] megaraid_sas : set residual bytes count
during IO
> compeltion
>
> > "Shivasharan" == Shivasharan S
>  writes:
>
> Shivasharan> Fixing issue of not setting residual bytes correctly.
>
> @@ -1464,6 +1465,15 @@ map_cmd_status(struct fusion_context *fusion,
>  SCSI_SENSE_BUFFERSIZE);
>   scmd->result |= DRIVER_SENSE << 24;
>   }
> +
> + /*
> +  * If the  IO request is partially completed, then MR FW
will
> +  * update "io_request->DataLength" field with actual
number
> of
> +  * bytes transferred.Driver will set residual bytes count
in
> +  * SCSI command structure.
> +  */
> + resid = (scsi_bufflen(scmd) - data_length);
> + scsi_set_resid(scmd, resid);
>
> Is data_length guaranteed to be a multiple of the logical block size?
> Otherwise you need to tweak the residual like we just did for mpt3sas.

Martin, Data length will be always guaranteed to be a multiple of the
logical block size until and unless we have some firmware defect.
In past, We have seen  some partial/complete DMA data length return from
firmware was not aligned with logical block size. Eventually, root caused
+ fixed in firmware.

>
> --
> Martin K. PetersenOracle Linux Engineering


RE: [PATCH 33/39] megaraid_sas: call flush_scheduled_work during controller shutdown/detach

2017-02-07 Thread Kashyap Desai
> -Original Message-
> From: Kashyap Desai [mailto:kashyap.de...@broadcom.com]
> Sent: Monday, February 06, 2017 10:48 PM
> To: 'Tomas Henzl'; Shivasharan Srikanteshwara;
'linux-scsi@vger.kernel.org'
> Cc: 'martin.peter...@oracle.com'; 'j...@linux.vnet.ibm.com'; Sumit
Saxena;
> 'h...@suse.com'
> Subject: RE: [PATCH 33/39] megaraid_sas: call flush_scheduled_work
during
> controller shutdown/detach
>
> > -Original Message-
> > From: Tomas Henzl [mailto:the...@redhat.com]
> > Sent: Monday, February 06, 2017 9:35 PM
> > To: Shivasharan S; linux-scsi@vger.kernel.org
> > Cc: martin.peter...@oracle.com; j...@linux.vnet.ibm.com;
> > kashyap.de...@broadcom.com; sumit.sax...@broadcom.com;
> h...@suse.com
> > Subject: Re: [PATCH 33/39] megaraid_sas: call flush_scheduled_work
> > during controller shutdown/detach
> >
> > On 6.2.2017 11:00, Shivasharan S wrote:
> > > Signed-off-by: Kashyap Desai 
> > > Signed-off-by: Shivasharan S
> > 
> > > ---
> > >  drivers/scsi/megaraid/megaraid_sas_base.c | 2 ++
> > >  1 file changed, 2 insertions(+)
> > >
> > > diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c
> > b/drivers/scsi/megaraid/megaraid_sas_base.c
> > > index 04ef0a0..b29cfd3 100644
> > > --- a/drivers/scsi/megaraid/megaraid_sas_base.c
> > > +++ b/drivers/scsi/megaraid/megaraid_sas_base.c
> > > @@ -6393,6 +6393,7 @@ megasas_suspend(struct pci_dev *pdev,
> > pm_message_t state)
> > >   if (instance->ev != NULL) {
> > >   struct megasas_aen_event *ev = instance->ev;
> > >   cancel_delayed_work_sync(&ev->hotplug_work);
> > > + flush_scheduled_work();
> > >   instance->ev = NULL;
> > >   }
> > >
> > > @@ -6619,6 +6620,7 @@ static void megasas_detach_one(struct pci_dev
> > *pdev)
> > >   if (instance->ev != NULL) {
> > >   struct megasas_aen_event *ev = instance->ev;
> > >   cancel_delayed_work_sync(&ev->hotplug_work);
> > > + flush_scheduled_work();
> > >   instance->ev = NULL;
> > >   }
> > >
> >
> > Why is cancel_delayed_work_sync not good enough?
>
> Megaraid_sas driver use certain work on global work queue.
>
> Below are the listed one -
>
>   if (instance->ctrl_context) {
>   INIT_WORK(&instance->work_init, megasas_fusion_ocr_wq);
>   INIT_WORK(&instance->crash_init,
> megasas_fusion_crash_dump_wq);
>   }
>   else
>   INIT_WORK(&instance->work_init,
> process_fw_state_change_wq)
>
> Cancel_delayed_work_sync() was mainly targeted for only hotplug AEN
work.
> Calling flush_scheduled_work() we want above listed work to be completed
> as well.

Tomas - Here is one more update. I agree with your assessment. We don't
need this patch.

In our local repo code was like below and as part of sync up activity, I
did not realize that upstream is using cancel_delayed_work_sync() which is
internally doing the same as below.

cancel_delayed_work(&ev->hotplug_work);
flush_scheduled_work();

Just for info - Similar patch was posted for mpt2sas long time ago to
replace above combination with cancel_delayed_work_sync()

https://lkml.org/lkml/2010/12/21/127

We will accommodate removal of this patch in V2 submission.



>
> >
> > tomash


RE: [PATCH 33/39] megaraid_sas: call flush_scheduled_work during controller shutdown/detach

2017-02-06 Thread Kashyap Desai
> -Original Message-
> From: Tomas Henzl [mailto:the...@redhat.com]
> Sent: Monday, February 06, 2017 9:35 PM
> To: Shivasharan S; linux-scsi@vger.kernel.org
> Cc: martin.peter...@oracle.com; j...@linux.vnet.ibm.com;
> kashyap.de...@broadcom.com; sumit.sax...@broadcom.com;
> h...@suse.com
> Subject: Re: [PATCH 33/39] megaraid_sas: call flush_scheduled_work
during
> controller shutdown/detach
>
> On 6.2.2017 11:00, Shivasharan S wrote:
> > Signed-off-by: Kashyap Desai 
> > Signed-off-by: Shivasharan S
> 
> > ---
> >  drivers/scsi/megaraid/megaraid_sas_base.c | 2 ++
> >  1 file changed, 2 insertions(+)
> >
> > diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c
> b/drivers/scsi/megaraid/megaraid_sas_base.c
> > index 04ef0a0..b29cfd3 100644
> > --- a/drivers/scsi/megaraid/megaraid_sas_base.c
> > +++ b/drivers/scsi/megaraid/megaraid_sas_base.c
> > @@ -6393,6 +6393,7 @@ megasas_suspend(struct pci_dev *pdev,
> pm_message_t state)
> > if (instance->ev != NULL) {
> > struct megasas_aen_event *ev = instance->ev;
> > cancel_delayed_work_sync(&ev->hotplug_work);
> > +   flush_scheduled_work();
> > instance->ev = NULL;
> > }
> >
> > @@ -6619,6 +6620,7 @@ static void megasas_detach_one(struct pci_dev
> *pdev)
> > if (instance->ev != NULL) {
> > struct megasas_aen_event *ev = instance->ev;
> > cancel_delayed_work_sync(&ev->hotplug_work);
> > +   flush_scheduled_work();
> > instance->ev = NULL;
> > }
> >
>
> Why is cancel_delayed_work_sync not good enough?

Megaraid_sas driver use certain work on global work queue.

Below are the listed one -

if (instance->ctrl_context) {
INIT_WORK(&instance->work_init, megasas_fusion_ocr_wq);
INIT_WORK(&instance->crash_init,
megasas_fusion_crash_dump_wq);
}
else
INIT_WORK(&instance->work_init,
process_fw_state_change_wq)

Cancel_delayed_work_sync() was mainly targeted for only hotplug AEN work.
Calling flush_scheduled_work() we want above listed work to be completed
as well.

>
> tomash


RE: [PATCH 03/39] megaraid_sas: raid 1 fast path code optimize

2017-02-06 Thread Kashyap Desai
> >
> >  /**
> > + * megasas_complete_r1_command -
> > + * completes R1 FP write commands which has valid peer smid
> > + * @instance:  Adapter soft state
> > + * @cmd_fusion:MPT command frame
> > + *
> > + */
> > +static inline void
> > +megasas_complete_r1_command(struct megasas_instance *instance,
> > +   struct megasas_cmd_fusion *cmd) {
> > +   u8 *sense, status, ex_status;
> > +   u32 data_length;
> > +   u16 peer_smid;
> > +   struct fusion_context *fusion;
> > +   struct megasas_cmd_fusion *r1_cmd = NULL;
> > +   struct scsi_cmnd *scmd_local = NULL;
> > +   struct RAID_CONTEXT_G35 *rctx_g35;
> > +
> > +   rctx_g35 = &cmd->io_request->RaidContext.raid_context_g35;
> > +   fusion = instance->ctrl_context;
> > +   peer_smid = le16_to_cpu(rctx_g35->smid.peer_smid);
> > +
> > +   r1_cmd = fusion->cmd_list[peer_smid - 1];
> > +   scmd_local = cmd->scmd;
> > +   status = rctx_g35->status;
> > +   ex_status = rctx_g35->ex_status;
> > +   data_length = cmd->io_request->DataLength;
> > +   sense = cmd->sense;
> > +
> > +   cmd->cmd_completed = true;
> > +
> > +   /* Check if peer command is completed or not*/
> > +   if (r1_cmd->cmd_completed) {
> > +   if (rctx_g35->status != MFI_STAT_OK) {
> > +   status = rctx_g35->status;
> > +   ex_status = rctx_g35->ex_status;
>
> Both status + ex_status were already set to the same value, why is it
> repeated here ?

Tomas, This need a fix. Raid context should be switch to r1_cmd, but it
that is not done here.
We want if r1 cmd is completed with failure, check status and extended
status from r1_cmd to send final status to mid layer.

We will fix this and resend patch. It will be like this -

if (r1_cmd->cmd_completed) {
rctx_g35 =
&r1_cmd->io_request->RaidContext.raid_context_g35;<< -This line
should be added.
if (rctx_g35->status != MFI_STAT_OK) {
status = rctx_g35->status;
ex_status = rctx_g35->ex_status;

Thanks, Kashyap

>
> Tomas
>


RE: [PATCH 00/10] mpt3sas: full mq support

2017-01-31 Thread Kashyap Desai
> -Original Message-
> From: Hannes Reinecke [mailto:h...@suse.de]
> Sent: Wednesday, February 01, 2017 12:21 PM
> To: Kashyap Desai; Christoph Hellwig
> Cc: Martin K. Petersen; James Bottomley; linux-scsi@vger.kernel.org;
> Sathya
> Prakash Veerichetty; PDL-MPT-FUSIONLINUX; Sreekanth Reddy
> Subject: Re: [PATCH 00/10] mpt3sas: full mq support
>
> On 01/31/2017 06:54 PM, Kashyap Desai wrote:
> >> -Original Message-
> >> From: Hannes Reinecke [mailto:h...@suse.de]
> >> Sent: Tuesday, January 31, 2017 4:47 PM
> >> To: Christoph Hellwig
> >> Cc: Martin K. Petersen; James Bottomley; linux-scsi@vger.kernel.org;
> > Sathya
> >> Prakash; Kashyap Desai; mpt-fusionlinux@broadcom.com
> >> Subject: Re: [PATCH 00/10] mpt3sas: full mq support
> >>
> >> On 01/31/2017 11:02 AM, Christoph Hellwig wrote:
> >>> On Tue, Jan 31, 2017 at 10:25:50AM +0100, Hannes Reinecke wrote:
> >>>> Hi all,
> >>>>
> >>>> this is a patchset to enable full multiqueue support for the
> >>>> mpt3sas
> >> driver.
> >>>> While the HBA only has a single mailbox register for submitting
> >>>> commands, it does have individual receive queues per MSI-X
> >>>> interrupt and as such does benefit from converting it to full
> >>>> multiqueue
> > support.
> >>>
> >>> Explanation and numbers on why this would be beneficial, please.
> >>> We should not need multiple submissions queues for a single register
> >>> to benefit from multiple completion queues.
> >>>
> >> Well, the actual throughput very strongly depends on the blk-mq-sched
> >> patches from Jens.
> >> As this is barely finished I didn't post any numbers yet.
> >>
> >> However:
> >> With multiqueue support:
> >> 4k seq read : io=60573MB, bw=1009.2MB/s, iops=258353, runt=
> 60021msec
> >> With scsi-mq on 1 queue:
> >> 4k seq read : io=17369MB, bw=296291KB/s, iops=74072, runt= 60028msec
> >> So yes, there _is_ a benefit.
> >>
> >> (Which is actually quite cool, as these tests were done on a SAS3
> >> HBA,
> > so
> >> we're getting close to the theoretical maximum of 1.2GB/s).
> >> (Unlike the single-queue case :-)
> >
> > Hannes -
> >
> > Can you share detail about setup ? How many drives do you have and how
> > is connection (enclosure -> drives. ??) ?
> > To me it looks like current mpt3sas driver might be taking more hit in
> > spinlock operation (penalty on NUMA arch is more compare to single
> > core
> > server) unlike we have in megaraid_sas driver use of shared blk tag.
> >
> The tests were done with a single LSI SAS3008 connected to a NetApp E-
> series (2660), using 4 LUNs under MD-RAID0.
>
> Megaraid_sas is even worse here; due to the odd nature of the 'fusion'
> implementation we're ending up having _two_ sets of tags, making it really
> hard to use scsi-mq here.

Current megaraid_sas as single submission queue exposed to the blk-mq will
not encounter similar performance issue.
We may not see significant improvement of performance if we attempt the same
for megaraid_sas driver.
We had similar discussion for megaraid_sas and hpsa.
http://www.spinics.net/lists/linux-scsi/msg101838.html

I am seeing this patch series is similar attempt for mpt3sas..Am I missing
anything ?

Megaraid_sas driver just do indexing from blk_tag and fire IO quick enough
unlike mpt3sas where we have  lock contention @driver level as bottleneck.

> (Not that I didn't try; but lacking a proper backend it's really hard to
> evaluate
> the benefit of those ... spinning HDDs simply don't cut it here)
>
> > I mean " [PATCH 08/10] mpt3sas: lockless command submission for scsi-
> mq"
> > patch is improving performance removing spinlock overhead and
> > attempting to get request using blk_tags.
> > Are you seeing performance improvement  if you hard code nr_hw_queues
> > = 1 in below code changes part of "[PATCH 10/10] mpt3sas: scsi-mq
> > interrupt steering"
> >
> No. The numbers posted above are generated with exactly that patch; the
> first line is running with nr_hw_queues=32 and the second line with
> nr_hw_queues=1.

Thanks Hannes. That clarifies.  Can you share  script you have used ?

If my  understanding correct, you will see theoretical maximum of 1.2GBp/s
if you restrict your work load to single numa node. This is just for
understanding if  driver spinlocks are adding overhead. We have
seen such overhead on multi-socket server an

RE: [PATCH 00/10] mpt3sas: full mq support

2017-01-31 Thread Kashyap Desai
> -Original Message-
> From: Hannes Reinecke [mailto:h...@suse.de]
> Sent: Tuesday, January 31, 2017 4:47 PM
> To: Christoph Hellwig
> Cc: Martin K. Petersen; James Bottomley; linux-scsi@vger.kernel.org;
Sathya
> Prakash; Kashyap Desai; mpt-fusionlinux@broadcom.com
> Subject: Re: [PATCH 00/10] mpt3sas: full mq support
>
> On 01/31/2017 11:02 AM, Christoph Hellwig wrote:
> > On Tue, Jan 31, 2017 at 10:25:50AM +0100, Hannes Reinecke wrote:
> >> Hi all,
> >>
> >> this is a patchset to enable full multiqueue support for the mpt3sas
> driver.
> >> While the HBA only has a single mailbox register for submitting
> >> commands, it does have individual receive queues per MSI-X interrupt
> >> and as such does benefit from converting it to full multiqueue
support.
> >
> > Explanation and numbers on why this would be beneficial, please.
> > We should not need multiple submissions queues for a single register
> > to benefit from multiple completion queues.
> >
> Well, the actual throughput very strongly depends on the blk-mq-sched
> patches from Jens.
> As this is barely finished I didn't post any numbers yet.
>
> However:
> With multiqueue support:
> 4k seq read : io=60573MB, bw=1009.2MB/s, iops=258353, runt= 60021msec
> With scsi-mq on 1 queue:
> 4k seq read : io=17369MB, bw=296291KB/s, iops=74072, runt= 60028msec So
> yes, there _is_ a benefit.
>
> (Which is actually quite cool, as these tests were done on a SAS3 HBA,
so
> we're getting close to the theoretical maximum of 1.2GB/s).
> (Unlike the single-queue case :-)

Hannes -

Can you share detail about setup ? How many drives do you have and how is
connection (enclosure -> drives. ??) ?
To me it looks like current mpt3sas driver might be taking more hit in
spinlock operation (penalty on NUMA arch is more compare to single core
server) unlike we have in megaraid_sas driver use of shared blk tag.

I mean " [PATCH 08/10] mpt3sas: lockless command submission for scsi-mq"
patch is improving performance removing spinlock overhead and attempting
to get request using blk_tags.
Are you seeing performance improvement  if you hard code nr_hw_queues = 1
in below code changes part of "[PATCH 10/10] mpt3sas: scsi-mq interrupt
steering"

@@ -9054,6 +9071,8 @@ static void sas_device_make_active(struct
MPT3SAS_ADAPTER *ioc,
shost->max_lun = max_lun;
shost->transportt = mpt3sas_transport_template;
shost->unique_id = ioc->id;
+   if (shost->use_blk_mq)
+   shost->nr_hw_queues = num_online_cpus();


Thanks, Kashyap

>
> Cheers,
>
> Hannes
> --
> Dr. Hannes Reinecke  Teamlead Storage & Networking
> h...@suse.de +49 911 74053 688
> SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
> GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB
21284
> (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Device or HBA level QD throttling creates randomness in sequetial workload

2017-01-30 Thread Kashyap Desai
> -Original Message-
> From: Jens Axboe [mailto:ax...@kernel.dk]
> Sent: Monday, January 30, 2017 10:03 PM
> To: Bart Van Assche; osan...@osandov.com; kashyap.de...@broadcom.com
> Cc: linux-scsi@vger.kernel.org; linux-ker...@vger.kernel.org;
> h...@infradead.org; linux-bl...@vger.kernel.org; paolo.vale...@linaro.org
> Subject: Re: Device or HBA level QD throttling creates randomness in
> sequetial workload
>
> On 01/30/2017 09:30 AM, Bart Van Assche wrote:
> > On Mon, 2017-01-30 at 19:22 +0530, Kashyap Desai wrote:
> >> -   if (atomic_inc_return(&instance->fw_outstanding) >
> >> -   instance->host->can_queue) {
> >> -   atomic_dec(&instance->fw_outstanding);
> >> -   return SCSI_MLQUEUE_HOST_BUSY;
> >> -   }
> >> +   if (atomic_inc_return(&instance->fw_outstanding) >
safe_can_queue) {
> >> +   is_nonrot = blk_queue_nonrot(scmd->device->request_queue);
> >> +   /* For rotational device wait for sometime to get fusion
> >> + command
> >> from pool.
> >> +* This is just to reduce proactive re-queue at mid layer
> >> + which is
> >> not
> >> +* sending sorted IO in SCSI.MQ mode.
> >> +*/
> >> +   if (!is_nonrot)
> >> +   udelay(100);
> >> +   }
> >
> > The SCSI core does not allow to sleep inside the queuecommand()
> > callback function.
>
> udelay() is a busy loop, so it's not sleeping. That said, it's obviously
NOT a
> great idea. We want to fix the reordering due to requeues, not introduce
> random busy delays to work around it.

Thanks for feedback. I do realize that udelay() is going to be very odd
in queue_command call back.   I will keep this note. Preferred solution is
blk mq scheduler patches.
>
> --
> Jens Axboe
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Device or HBA level QD throttling creates randomness in sequetial workload

2017-01-30 Thread Kashyap Desai
Hi Jens/Omar,

I used git.kernel.dk/linux-block branch - blk-mq-sched (commit
0efe27068ecf37ece2728a99b863763286049ab5) and confirm that issue reported in
this thread is resolved.

Now I am seeing MQ and  SQ mode both are resulting in sequential IO pattern
while IO is getting re-queued in block layer.

To make similar performance without blk-mq-sched feature, is it good to
pause IO for few usec in LLD?
I mean, I want to avoid driver asking SML/Block layer to re-queue the IO (if
it is Sequential on Rotational media.)

Explaining w.r.t megaraid_sas driver.  This driver expose can_queue, but it
internally consume commands for raid 1, fast  path.
In worst case, can_queue/2 will consume all firmware resources and driver
will re-queue further IOs to SML as below -

   if (atomic_inc_return(&instance->fw_outstanding) >
   instance->host->can_queue) {
   atomic_dec(&instance->fw_outstanding);
   return SCSI_MLQUEUE_HOST_BUSY;
   }

I want to avoid above SCSI_MLQUEUE_HOST_BUSY.

Need your suggestion for below changes -

diff --git a/drivers/scsi/megaraid/megaraid_sas_fusion.c
b/drivers/scsi/megaraid/megaraid_sas_fusion.c
index 9a9c84f..a683eb0 100644
--- a/drivers/scsi/megaraid/megaraid_sas_fusion.c
+++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c
@@ -54,6 +54,7 @@
 #include 
 #include 
 #include 
+#include 

 #include "megaraid_sas_fusion.h"
 #include "megaraid_sas.h"
@@ -2572,7 +2573,15 @@ void megasas_prepare_secondRaid1_IO(struct
megasas_instance *instance,
struct megasas_cmd_fusion *cmd, *r1_cmd = NULL;
union MEGASAS_REQUEST_DESCRIPTOR_UNION *req_desc;
u32 index;
-   struct fusion_context *fusion;
+   boolis_nonrot;
+   u32 safe_can_queue;
+   u32 num_cpus;
+   struct fusion_context *fusion;
+
+   fusion = instance->ctrl_context;
+
+   num_cpus = num_online_cpus();
+   safe_can_queue = instance->cur_can_queue - num_cpus;

fusion = instance->ctrl_context;

@@ -2584,11 +2593,15 @@ void megasas_prepare_secondRaid1_IO(struct
megasas_instance *instance,
return SCSI_MLQUEUE_DEVICE_BUSY;
}

-   if (atomic_inc_return(&instance->fw_outstanding) >
-   instance->host->can_queue) {
-   atomic_dec(&instance->fw_outstanding);
-   return SCSI_MLQUEUE_HOST_BUSY;
-   }
+   if (atomic_inc_return(&instance->fw_outstanding) > safe_can_queue) {
+   is_nonrot = blk_queue_nonrot(scmd->device->request_queue);
+   /* For rotational device wait for sometime to get fusion command
from pool.
+* This is just to reduce proactive re-queue at mid layer which is
not
+* sending sorted IO in SCSI.MQ mode.
+*/
+   if (!is_nonrot)
+   udelay(100);
+   }

cmd = megasas_get_cmd_fusion(instance, scmd->request->tag);

` Kashyap

> -Original Message-
> From: Kashyap Desai [mailto:kashyap.de...@broadcom.com]
> Sent: Tuesday, November 01, 2016 11:11 AM
> To: 'Jens Axboe'; 'Omar Sandoval'
> Cc: 'linux-scsi@vger.kernel.org'; 'linux-ker...@vger.kernel.org'; 'linux-
> bl...@vger.kernel.org'; 'Christoph Hellwig'; 'paolo.vale...@linaro.org'
> Subject: RE: Device or HBA level QD throttling creates randomness in
> sequetial workload
>
> Jens- Replied inline.
>
>
> Omar -  I tested your WIP repo and figure out System hangs only if I pass
> "
> scsi_mod.use_blk_mq=Y". Without this, your WIP branch works fine, but I
> am looking for scsi_mod.use_blk_mq=Y.
>
> Also below is snippet of blktrace. In case of higher per device QD, I see
> Requeue request in blktrace.
>
> 65,128 10 6268 2.432404509 18594  P   N [fio]
>  65,128 10 6269 2.432405013 18594  U   N [fio] 1
>  65,128 10 6270 2.432405143 18594  I  WS 148800 + 8 [fio]
>  65,128 10 6271 2.432405740 18594  R  WS 148800 + 8 [0]
>  65,128 10 6272 2.432409794 18594  Q  WS 148808 + 8 [fio]
>  65,128 10 6273 2.432410234 18594  G  WS 148808 + 8 [fio]
>  65,128 10 6274 2.432410424 18594  S  WS 148808 + 8 [fio]
>  65,128 23 3626 2.432432595 16232  D  WS 148800 + 8
> [kworker/23:1H]
>  65,128 22 3279 2.432973482 0  C  WS 147432 + 8 [0]
>  65,128  7 6126 2.433032637 18594  P   N [fio]
>  65,128  7 6127 2.433033204 18594  U   N [fio] 1
>  65,128  7 6128 2.433033346 18594  I  WS 148808 + 8 [fio]
>  65,128  7 6129 2.433033871 18594  D  WS 148808 + 8 [fio]
>  65,128  7 6130 2.433034559 18594  R  WS 148808 + 8 [0]
>  65,128  7 6131 2.433039796 18594  Q  WS 148816 + 8 [fio]
>  65,128  7 6132 2.433040206 18594  G  WS 148816 + 8 [fio]
>  65,128  7 6133 2.433040351 18594  S  WS 148816 + 8 [fio]
>  65,128  9 6392 2.433133729 0  C  WS 147240 + 8 [0]
>  65,128  9 6393

RE: [PATCH] preview - block layer help to detect sequential IO

2017-01-16 Thread Kashyap Desai
> Hi, Kashyap,
>
> I'm CC-ing Kent, seeing how this is his code.

Hi Jeff and Kent, See my reply inline.

>
> Kashyap Desai  writes:
>
> > Objective of this patch is -
> >
> > To move code used in bcache module in block layer which is used to
> > find IO stream.  Reference code @drivers/md/bcache/request.c
> > check_should_bypass().  This is a high level patch for review and
> > understand if it is worth to follow ?
> >
> > As of now bcache module use this logic, but good to have it in block
> > layer and expose function for external use.
> >
> > In this patch, I move logic of sequential IO search in block layer and
> > exposed function blk_queue_rq_seq_cutoff.  Low level driver just need
> > to call if they want stream detection per request queue.  For my
> > testing I just added call blk_queue_rq_seq_cutoff(sdev->request_queue,
> > 4) megaraid_sas driver.
> >
> > In general, code of bcache module was referred and they are doing
> > almost same as what we want to do in megaraid_sas driver below patch -
> >
> > http://marc.info/?l=linux-scsi&m=148245616108288&w=2
> >
> > bcache implementation use search algorithm (hashed based on bio start
> > sector) and detects 128 streams.  wanted those implementation
> > to skip sequential IO to be placed on SSD and move it direct to the
> > HDD.
> >
> > Will it be good design to keep this algorithm open at block layer (as
> > proposed in patch.) ?
>
> It's almost always a good idea to avoid code duplication, but this patch
> definitely needs some work.

Jeff, I was not aware of the actual block layer module, so created just a
working patch to explain my point.
Check new patch. This patch is driver changes only in 
driver.

1. Below MR driver patch does similar things but code is Array base linear
lookup.
 http://marc.info/?l=linux-scsi&m=148245616108288&w=2

2. I thought to improve this using appended patch. It is similar of what
 is doing. This patch has duplicate code as  is doing the
same.

>
> I haven't looked terribly closely at the bcache implementaiton, so do
let me
> know if I've misinterpreted something.
>
> We should track streams per io_context/queue pair.  We already have a
data
> structure for that, the io_cq.  Right now that structure is tailored for
use by the
> I/O schedulers, but I'm sure we could rework that.  That would also get
rid of the
> tremedous amount of bloat this patch adds to the request_queue.  It will
also
> allow us to remove the bcache-specific fields that were added to
task_struct.
> Overall, it should be a good simplification, unless I've completely
missed the
> point (which happens).

Your understanding of requirement is correct. What we need is tracker of
 in block layer and check the tracker for every request to know
if this is a random or sequential IO.  As you explained, there is a
similar logic in  ..I search the kernel code and figure out below
code section @ block/elevator.c

/*
 * See if our hash lookup can find a potential backmerge.
 */
__rq = elv_rqhash_find(q, bio->bi_iter.bi_sector);


I am looking for similar logic done in elv_rqhash_find() for all the IOs
and provide information in request, if this particular request is a
potential back-merge candidate (Having new req_flags_t e.a  RQF_SEQ) . It
is OK, even thought it was not merged due to other checks in IO path.

Safer side (to avoid any performance issues), we can opt for API to be
called by low level driver on particular request queue/sdev, if someone is
interested in this request queue such help ?

I need help (some level of patch to work on) or pointer, if this path is
good. I can drive this, but need to understand direction.

>
> I don't like that you put sequential I/O detection into bio_check_eod.
> Split it out into its own function.

Sorry for this. I thought of sending patch to get better understanding. My
first patch was very high level and not complaint with many design or
coding issue.
For my learning - BTW, for such post (if I have high level patch) ..what
shall I do ?

> You've added a member to struct bio that isn't referenced.  It would
have been
> nice of you to put enough work into this RFC so that we could at least
see how
> the common code was used by bcache and your driver.

See my second patch appended here. I can work on block layer generic
changes, if we have some another area (as mentioned elevator/cfq) doing
the stuffs which I am looking for.

>
> EWMA (exponentially weighted moving average) is not an acronym I keep
handy
> in my head.  It would be nice to add documentation on the algorithm and
design
> choices.  More comments in the code would also be appreciated.  CFQ does
> some 

RE: [PATCH] preview - block layer help to detect sequential IO

2017-01-12 Thread Kashyap Desai
> -Original Message-
> From: kbuild test robot [mailto:l...@intel.com]
> Sent: Thursday, January 12, 2017 1:18 AM
> To: Kashyap Desai
> Cc: kbuild-...@01.org; linux-scsi@vger.kernel.org;
linux-bl...@vger.kernel.org;
> ax...@kernel.dk; martin.peter...@oracle.com; j...@linux.vnet.ibm.com;
> sumit.sax...@broadcom.com; Kashyap desai
> Subject: Re: [PATCH] preview - block layer help to detect sequential IO
>
> Hi Kashyap,
>
> [auto build test ERROR on v4.9-rc8]
> [cannot apply to block/for-next linus/master linux/master next-20170111]
[if
> your patch is applied to the wrong git tree, please drop us a note to
help
> improve the system]
>
> url:
https://github.com/0day-ci/linux/commits/Kashyap-Desai/preview-block-
> layer-help-to-detect-sequential-IO/20170112-024228
> config: i386-randconfig-a0-201702 (attached as .config)
> compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
> reproduce:
> # save the attached .config to linux build tree
> make ARCH=i386
>
> All errors (new ones prefixed by >>):
>
>block/blk-core.c: In function 'add_sequential':
> >> block/blk-core.c:1899:16: error: 'struct task_struct' has no member
named
> 'sequential_io_avg'
>  blk_ewma_add(t->sequential_io_avg,


This error fixable. For now, I just wanted to get high level review of the
idea.
Below defines are required to use sequential_io and sequential_io_avg. I
have enable BCACHE for my testing in .config.

#if defined(CONFIG_BCACHE) || defined(CONFIG_BCACHE_MODULE)
unsigned intsequential_io;
unsigned intsequential_io_avg;
#endif

Looking for high level review comment.

` Kashyap


>^
>block/blk-core.c:1893:10: note: in definition of macro 'blk_ewma_add'
> (ewma) *= (weight) - 1;
\
>  ^~~~
> >> block/blk-core.c:1899:16: error: 'struct task_struct' has no member
named
> 'sequential_io_avg'
>  blk_ewma_add(t->sequential_io_avg,
>^
>block/blk-core.c:1894:10: note: in definition of macro 'blk_ewma_add'
> (ewma) += (val) << factor;
\
>  ^~~~
> >> block/blk-core.c:1900:5: error: 'struct task_struct' has no member
named
> 'sequential_io'
>t->sequential_io, 8, 0);
> ^
>block/blk-core.c:1894:20: note: in definition of macro 'blk_ewma_add'
> (ewma) += (val) << factor;
\
>^~~
> >> block/blk-core.c:1899:16: error: 'struct task_struct' has no member
named
> 'sequential_io_avg'
>  blk_ewma_add(t->sequential_io_avg,
>^
>block/blk-core.c:1895:10: note: in definition of macro 'blk_ewma_add'
> (ewma) /= (weight);
\
>  ^~~~
> >> block/blk-core.c:1899:16: error: 'struct task_struct' has no member
named
> 'sequential_io_avg'
>  blk_ewma_add(t->sequential_io_avg,
>^
>block/blk-core.c:1896:10: note: in definition of macro 'blk_ewma_add'
> (ewma) >> factor;
\
>  ^~~~
>block/blk-core.c:1902:3: error: 'struct task_struct' has no member
named
> 'sequential_io'
>  t->sequential_io = 0;
>   ^~
>block/blk-core.c: In function 'generic_make_request_checks':
>block/blk-core.c:2012:7: error: 'struct task_struct' has no member
named
> 'sequential_io'
>   task->sequential_io  = i->sequential;
>   ^~
>In file included from block/blk-core.c:14:0:
>block/blk-core.c:2020:21: error: 'struct task_struct' has no member
named
> 'sequential_io'
>   sectors = max(task->sequential_io,
> ^
>include/linux/kernel.h:747:2: note: in definition of macro '__max'
>  t1 max1 = (x); \
>  ^~
>block/blk-core.c:2020:13: note: in expansion of macro 'max'
>   sectors = max(task->sequential_io,
> ^~~
>block/blk-core.c:2020:21: error: 'struct task_struct' has no member
named
> 'sequential_io'
>   sectors = max(task->sequential_io,
> ^
>include/linux/kernel.h:747:13: note: in definition of macro '__max'
>  t1 max1 = (x); \
> ^
>block/blk-core.c:2020:13: note: in expansion of macro 'max'
>   sectors = max(task->sequential_io,
> ^~~
>block/blk-core.c:2021:14: error: 'struct task_struct' has no member
named
> 'sequential_io_avg'
> 

[PATCH] preview - block layer help to detect sequential IO

2017-01-11 Thread Kashyap Desai
Objective of this patch is - 

To move code used in bcache module in block layer which is used to find IO 
stream. 
Reference code @drivers/md/bcache/request.c check_should_bypass().
This is a high level patch for review and understand if it is worth to follow ?

As of now bcache module use this logic, but good to have it in block layer and 
expose function for external use.

In this patch, I move logic of sequential IO search in block layer and exposed 
function blk_queue_rq_seq_cutoff.
Low level driver just need to call if they want stream detection per request 
queue. 
For my testing I just added call blk_queue_rq_seq_cutoff(sdev->request_queue, 
4) megaraid_sas driver.
 
In general, code of bcache module was referred and they are doing almost same 
as what we want to do in 
megaraid_sas driver below patch -

http://marc.info/?l=linux-scsi&m=148245616108288&w=2
 
bcache implementation use search algorithm (hashed based on bio start sector)
and detects 128 streams.  wanted those implementation to skip 
sequential IO 
to be placed on SSD and move it direct to the HDD. 

Will it be good design to keep this algorithm open at block layer (as proposed 
in patch.) ?

Signed-off-by: Kashyap desai 
---
diff --git a/block/blk-core.c b/block/blk-core.c
index 14d7c07..2e93d14 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -693,6 +693,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, 
int node_id)
 {
struct request_queue *q;
int err;
+   struct seq_io_tracker *io;
 
q = kmem_cache_alloc_node(blk_requestq_cachep,
gfp_mask | __GFP_ZERO, node_id);
@@ -761,6 +762,15 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, 
int node_id)
 
if (blkcg_init_queue(q))
goto fail_ref;
+   
+   q->sequential_cutoff = 0;
+   spin_lock_init(&q->io_lock);
+   INIT_LIST_HEAD(&q->io_lru);
+
+   for (io = q->io; io < q->io + BLK_RECENT_IO; io++) {
+   list_add(&io->lru, &q->io_lru);
+   hlist_add_head(&io->hash, q->io_hash + BLK_RECENT_IO);
+   }
 
return q;
 
@@ -1876,6 +1886,26 @@ static inline int bio_check_eod(struct bio *bio, 
unsigned int nr_sectors)
return 0;
 }
 
+static void add_sequential(struct task_struct *t)
+{
+#define blk_ewma_add(ewma, val, weight, factor) \
+({  \
+(ewma) *= (weight) - 1; \
+(ewma) += (val) << factor;  \
+(ewma) /= (weight); \
+(ewma) >> factor;   \
+})
+
+   blk_ewma_add(t->sequential_io_avg,
+t->sequential_io, 8, 0);
+
+   t->sequential_io = 0;
+}
+static struct hlist_head *blk_iohash(struct request_queue *q, uint64_t k)
+{
+   return &q->io_hash[hash_64(k, BLK_RECENT_IO_BITS)];
+}
+
 static noinline_for_stack bool
 generic_make_request_checks(struct bio *bio)
 {
@@ -1884,6 +1914,7 @@ static inline int bio_check_eod(struct bio *bio, unsigned 
int nr_sectors)
int err = -EIO;
char b[BDEVNAME_SIZE];
struct hd_struct *part;
+   struct task_struct *task = current;
 
might_sleep();
 
@@ -1957,6 +1988,42 @@ static inline int bio_check_eod(struct bio *bio, 
unsigned int nr_sectors)
if (!blkcg_bio_issue_check(q, bio))
return false;
 
+   if (q->sequential_cutoff) {
+   struct seq_io_tracker *i;
+   unsigned sectors;
+
+   spin_lock(&q->io_lock);
+
+   hlist_for_each_entry(i, blk_iohash(q, bio->bi_iter.bi_sector), 
hash)
+   if (i->last == bio->bi_iter.bi_sector &&
+   time_before(jiffies, i->jiffies))
+   goto found;
+
+   i = list_first_entry(&q->io_lru, struct seq_io_tracker, lru);
+
+   add_sequential(task);
+   i->sequential = 0;
+found:
+   if (i->sequential + bio->bi_iter.bi_size > i->sequential)
+   i->sequential   += bio->bi_iter.bi_size;
+
+   i->last  = bio_end_sector(bio);
+   i->jiffies   = jiffies + msecs_to_jiffies(5000);
+   task->sequential_io  = i->sequential;
+
+   hlist_del(&i->hash);
+   hlist_add_head(&i->hash, blk_iohash(q, i->last));
+   list_move_tail(&i->lru, &q->io_lru);
+
+   spin_unlock(&q->io_lock);
+
+   sectors = max(task->sequential_io,
+ task->sequential_io_avg) >> 9;
+   

RE: SCSI: usage of DID_REQUEUE vs DID_RESET for returning SCSI commands to be retried

2016-12-14 Thread Kashyap Desai
> -Original Message-
> From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi-
> ow...@vger.kernel.org] On Behalf Of Hannes Reinecke
> Sent: Wednesday, December 14, 2016 9:07 PM
> To: Sumit Saxena; linux-scsi
> Subject: Re: SCSI: usage of DID_REQUEUE vs DID_RESET for returning SCSI
> commands to be retried
>
> On 12/13/2016 02:19 PM, Sumit Saxena wrote:
> > Hi all,
> >
> > I have query regarding usage of host_byte DID_REQUEUE vs DID_RESET
> > returned by LLD to SCSI mid layer.
> >
> > Let me give some background here.
> > I am using megaraid_sas controller. megaraid_sas driver returns all
> > outstanding SCSI commands back to SCSI layer with DID_RESET host_byte
> > before resetting controller.
> > The intent is- all these commands returned with DID_RESET before
> > controller reset should be retried by SCSI.
> >
> > In few distros, I have observed that if SYNCHRONIZE_CACHE
> > command(should be applicable for all non Read write commands) is
> > outstanding before resetting controller  and driver returns those
> > commands back with DID_RESET then SYNCHRONIZE_CACHE command not
> > retried but failed to upper layer but other READ/WRITE commands were
> > not failed but retried. I was running filesystem IOs and
> > SYNHRONIZE_CACHE returned with error cause filesystem to go in READ
> > only mode.
> >
> > Later I checked and realized below commit was missing in that distro
> > kernel and seems to cause the problem-
> >
> > a621bac scsi_lib: correctly retry failed zero length REQ_TYPE_FS
> > commands
> >
> > But distro kernel has below commit-
> >
> > 89fb4cd scsi: handle flush errors properly
> >
> > Issue does not hit on the kernels which don't have both commits but
> > hits when commit- "89fb4cd scsi: handle flush errors properly " is
> > there but commit-  "a621bac scsi_lib: correctly retry failed zero
> > length REQ_TYPE_FS commands" is missing.
> >
> > This issue is observed with mpt3sas driver as well and should be
> > applicable to all LLDs returning non read write commands with DID_RESET.
> >
> > Returning DID_REQUEUE instead of DID_RESET from driver solves the
> > problem irrespective of whether these above mentioned commits are
> > there or not in kernel.
> > So I am thinking to use DID_REQUEUE instead of DID_RESET in
> > megaraid_sas driver for all SCSI commands(not only limited to
> > SYNCHRONIZE_CACHE or non read write commands) before resetting
> > controller. As I mentioned intent is those outstanding commands
> > returned by driver before doing controller reset should be retried and
> > as soon as reset is complete, driver will be processing those
> > commands. There is no sense key associated with SCSI commands
> > returned.
> >
> > I browsed SCSI code and get to know DID_REQUEUE causes command to be
> > retried by calling- scsi_queue_insert(cmd, SCSI_MLQUEUE_DEVICE_BUSY).
> > This seems to be good enough for our requirement of commands to be
> > re-tried by SCSI layer.
> >
> > Please provide feedback if anyone forsee any issue with using
> > DID_REQUEUE for this use case.
> > I will be doing some testing with DID_REQUEUE in place of DID_RESET in
> > megaraid_sas driver.
> >
> > I have attached here a proposed patch for megaraid_sas driver.
> > If there are no concerns, I will send this patch to SCSI mailing list.
> >
> Hmm.
>
> DID_RESET is meant to be an indicator to the SCSI midlayer that the host /
> device was reset, and the command _should_ be retried.
> DID_REQUEUE OTOH is an indicator to the SCSI midlayer to retry the
> command.
>
> The problem with DID_RESET is that it's slightly underspecified; there is
> no
> indication if the command has been processed (and if so, up to which
> parts?)
> DID_REQUEUE, OTOH, will always cause a retry.
>
> So yeah, I guess DID_REQUEUE is a better choice here.

Thanks Hannes. We thought DID_REQUEUE functionality will be a better choice
and we are planning to move that with proposed patch.

Thanks for your feedback.  We will be sending the final patch to upstream.
Now, queuing this change for internal testing.


>
> Cheers,
>
> Hannes
> --
> Dr. Hannes Reinecke  Teamlead Storage & Networking
> h...@suse.de +49 911 74053 688
> SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
> GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284
> (AG
> Nürnberg)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of
> a message to majord...@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] Update 3ware driver email addresses

2016-12-07 Thread Kashyap Desai
> -Original Message-
> From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi-
> ow...@vger.kernel.org] On Behalf Of Martin K. Petersen
> Sent: Thursday, December 08, 2016 4:43 AM
> To: adam radford
> Cc: linux-scsi; Kashyap Desai
> Subject: Re: [PATCH] Update 3ware driver email addresses
>
> >>>>> "Adam" == adam radford  writes:
>
> Adam,
>
> Adam> This maintainers/email update patch didn't get picked up.  Do I
> Adam> need to fix it or re-send ?
>
> I still have it in the queue. Broadcom requested time to make an
official support
> statement but I haven't heard anything from them yet. Kashyap?


Martin -
Official support statement from Broadcom -
LSI/Broadcom stopped supporting 3Ware controllers.  If Adam volunteer
keeping it alive, we would like to remove www.lsi.com references and make
it purely his driver.

Adam - Can you resend patch where "www.lsi.com" is removed from 3Ware
drivers. ? Me/Sumit will ack and Martin can take pick for next release.

>
> --
> Martin K. PetersenOracle Linux Engineering
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of
> a message to majord...@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 5/5] megaraid_sas: add mmio barrier after register writes

2016-11-29 Thread Kashyap Desai
> -Original Message-
> From: Tomas Henzl [mailto:the...@redhat.com]
> Sent: Monday, November 21, 2016 9:27 PM
> To: Kashyap Desai; Hannes Reinecke; Martin K. Petersen
> Cc: Christoph Hellwig; James Bottomley; Sumit Saxena; linux-
> s...@vger.kernel.org; Hannes Reinecke
> Subject: Re: [PATCH 5/5] megaraid_sas: add mmio barrier after register
> writes
>
> On 18.11.2016 17:48, Kashyap Desai wrote:
> >> -Original Message-
> >> From: linux-scsi-ow...@vger.kernel.org [mailto:linux-scsi-
> >> ow...@vger.kernel.org] On Behalf Of Tomas Henzl
> >> Sent: Friday, November 18, 2016 9:23 PM
> >> To: Hannes Reinecke; Martin K. Petersen
> >> Cc: Christoph Hellwig; James Bottomley; Sumit Saxena; linux-
> >> s...@vger.kernel.org; Hannes Reinecke
> >> Subject: Re: [PATCH 5/5] megaraid_sas: add mmio barrier after
> >> register
> > writes
> >> On 11.11.2016 10:44, Hannes Reinecke wrote:
> >>> The megaraid_sas HBA only has a single register for I/O submission,
> >>> which will be hit pretty hard with scsi-mq. To ensure that the PCI
> >>> writes have made it across we need to add a mmio barrier after each
> >>> write; otherwise I've been seeing spurious command completions and
> >>> I/O stalls.
> >> Why is it needed that the PCI write reaches the hw exactly at this
> > point?
> >> Is it possible that this is a hw deficiency like that the hw can't
> > handle
> >> communication without tiny pauses, and so possible to remove in next
> >> generation?
> >> Thanks,
> >> Tomas
> > I think this is good to have mmiowb as we are already doing  for
> > writel() version of megasas_return_cmd_fusion.
> > May be not for x86, but for some other CPU arch it is useful. I think
> > it become more evident while scs-mq support for more than one
> > submission queue patch of Hannes expose this issue.
> >
> > Probably this patch is good. Intermediate PCI device (PCI bridge ?)
> > may cache PCI packet and eventually out of order PCI packet to
> > MegaRAID HBA can cause this type of spurious completion.
>
> Usually drivers do not add a write barrier after every pci write, unless
> like here in
> megasas_fire_cmd_fusion in the 32bit part where are two paired writes and
> it
> must be ensured that the pair arrives without any other write in between.
>
> Why is it wrong when a pci write is overtaken by another write or when
> happens
> a bit later and if it is wrong - don't we need an additional locking too ?
> The execution of  megasas_fire_cmd_fusion might be interrupted and a delay
> can happen at any time.

Since Hannes mentioned that with his experiment of mq megaraid_sas patch and
creating more Submission queue to SML cause invalid/suprious completion in
his code, I am trying to understand if " mmiowb" after writeq() call is safe
?  My understanding is "writeq" is atomic and it will have two 32bit PCI
WRITE in same sequence reaching to PCI end device. Assuming there will not
be PCI level caching on Intel x86 platform.  E.g if we have two CPU
executing writeq(), PCI write will always reach to end device in same
sequence. Assume ...Tag-1, Tag-2, Tag-3 and Tag-4 is expected sequence.  In
case of any optimization at system level, if device see Tag-1, Tag-3, Tag-2,
Tag-4 arrives,  then we may see the issue as Hannes experience.  We have
experience very rare instance of dual/spurious SMID completion  on released
megaraid_sas driver and not easy to be reproduced...so thinking on those
line, is this easy to reproduce such issue opening more submission queue to
SML (just to reproduce spurious completion effectively). We will apply all
the patches posted by Hannes *just* to understand this particular spurious
completion issue and understand in what condition it will impact.

We will post if this mmiowb() call after writeq() is good to have.

~ Kashyap

>
> tomash
>
> >
> >>> Signed-off-by: Hannes Reinecke 
> >>> ---
> >>>  drivers/scsi/megaraid/megaraid_sas_fusion.c | 1 +
> >>>  1 file changed, 1 insertion(+)
> >>>
> >>> diff --git a/drivers/scsi/megaraid/megaraid_sas_fusion.c
> >>> b/drivers/scsi/megaraid/megaraid_sas_fusion.c
> >>> index aba53c0..729a654 100644
> >>> --- a/drivers/scsi/megaraid/megaraid_sas_fusion.c
> >>> +++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c
> >>> @@ -196,6 +196,7 @@ inline void megasas_return_cmd_fusion(struct
> >> megasas_instance *instance,
> >>>   le32_to_cpu(req_desc->u.low));
> >>>
> >>>   writeq(req_data, &instance->re

RE: [PATCH][V2] scsi: megaraid-sas: fix spelling mistake of "outstanding"

2016-11-29 Thread Kashyap Desai
> -Original Message-
> From: Bart Van Assche [mailto:bart.vanass...@sandisk.com]
> Sent: Wednesday, November 30, 2016 12:50 AM
> To: Colin King; Kashyap Desai; Sumit Saxena; Shivasharan S; James E . J .
> Bottomley; Martin K . Petersen; megaraidlinux@broadcom.com; linux-
> s...@vger.kernel.org
> Cc: linux-ker...@vger.kernel.org
> Subject: Re: [PATCH][V2] scsi: megaraid-sas: fix spelling mistake of
> "outstanding"
>
> On 11/29/2016 11:13 AM, Colin King wrote:
> > Trivial fix to spelling mistake "oustanding" to "outstanding" in
> > dev_info and scmd_printk messages. Also join wrapped literal string in
> > the scmd_printk.
>
> Reviewed-by: Bart Van Assche 

Please hold this commit as we have patch set work in progress for MegaRaid
Driver. It will conflict it this patch commits.
We will be re-submitting this patch with appropriate Signed-off and
Submitted-by tags.

Thanks, Kashyap
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >