Re: Should a raid-0 array immediately stop if a component disk is removed?

2018-04-27 Thread Wols Lists
On 27/04/18 22:49, Guilherme G. Piccoli wrote:
> Hello, we've noticed an interesting behavior when using a raid-0 md
> array. Suppose we have a 2-disk raid-0 array that has a mount point
> set - in our tests, we've used ext4 filesystem. If we remove one of
> the component disks via sysfs[0], userspace is notified, but mdadm tool
> fails to stop the array[1] (it cannot open the array device node with
> O_EXCL flag, hence it fails to issue the STOP_ARRAY ioctl). Even if we
> circumvent the mdadm O_EXCL open, md driver will fail to execute the
> ioctl given the array is mounted.

Sounds like you're not using mdadm to remove the disk. So why do you
expect mdadm to stop the array immediately? It doesn't know anything is
wrong until it trips over the missing disk.
> 
> As a result, the array keeps mounted and we can even read/write from
> it, although it's possible to observe filesystem errors on dmesg[2].
> Eventually, after some _minutes_, the filesystem gets remounted as
> read-only.

Is your array linear or striped? If it's striped, I would expect it to
fall over in a heap very quickly. If it's linear, it depends whether you
remove drive 0 or drive 1. If you remove drive 0, it will fall over very
quickly. If you remove drive 1, the fuller your array the quicker it
will fall over (if your array isn't very full, drive 1 may well not be
used in which case the array might not fall over at all!)
> 
> During this weird window in which the array had a component disk removed
> but is still mounted/active (and accepting read/writes), we tried to
> perform reads and writes and sync command, which "succeed" (meaning the
> commands themselves didn't fail, although the errors were observed in
> dmesg). When "dd" was executed with "oflag=direct", the writes failed
> immediately. This was observed with both nvme and scsi disks composing
> the raid-0 array.
> 
> We've started to pursue a solution to this, which seems to be an odd
> behavior. But worth to check in the CC'ed lists if perhaps this is "by
> design" or if it was already discussed in the past (maybe an idea was
> proposed). Tests were executed with v4.17-rc2 and upstream mdadm tool.

Note that raid-0 is NOT redundant. Standard advice is "if a drive fails,
expect to lose your data". So the fact that your array limps on should
be the pleasant surprise, not that it blows up in ways you didn't expect.
> 
> Thanks in advance,
> 
> 
> Guilherme

Cheers,
Wol



Re: Should a raid-0 array immediately stop if a component disk is removed?

2018-04-27 Thread Guilherme G. Piccoli
Thanks for your quick reply Anthony!
Inline comments below:


On 27/04/2018 19:11, Wols Lists wrote:
> On 27/04/18 22:49, Guilherme G. Piccoli wrote:
> [...]
> Sounds like you're not using mdadm to remove the disk. So why do you
> expect mdadm to stop the array immediately? It doesn't know anything is
> wrong until it trips over the missing disk.

In fact, mdadm is aware something is wrong - it tries to stop the array,
running "mdadm -If ", but it fails because
the mount point prevents it to stop the array.
And the question lies exactly in this point: should it be (successfully)
stopped? I think it should, since we can continue writing on disks
causing data corruption.


> [...]
> Is your array linear or striped? If it's striped, I would expect it to
> fall over in a heap very quickly. If it's linear, it depends whether you

It's stripped. I was able to keep writing for some time (minutes).


> [...] 
> Note that raid-0 is NOT redundant. Standard advice is "if a drive fails,
> expect to lose your data". So the fact that your array limps on should
> be the pleasant surprise, not that it blows up in ways you didn't expect.

OK, I understand that. But imagine the following scenario: a regular
user gets for some reason a component disk removed, and they don't look
the logs before (or after) writes - the user can write stuff thinking
everything is fine, and that data is corrupted. I'd expect userspace
writes to fail as soon as possible in case one of raid-0 components is gone.

Thanks,


Guilherme


Should a raid-0 array immediately stop if a component disk is removed?

2018-04-27 Thread Guilherme G. Piccoli
Hello, we've noticed an interesting behavior when using a raid-0 md
array. Suppose we have a 2-disk raid-0 array that has a mount point
set - in our tests, we've used ext4 filesystem. If we remove one of
the component disks via sysfs[0], userspace is notified, but mdadm tool
fails to stop the array[1] (it cannot open the array device node with
O_EXCL flag, hence it fails to issue the STOP_ARRAY ioctl). Even if we
circumvent the mdadm O_EXCL open, md driver will fail to execute the
ioctl given the array is mounted.

As a result, the array keeps mounted and we can even read/write from
it, although it's possible to observe filesystem errors on dmesg[2].
Eventually, after some _minutes_, the filesystem gets remounted as
read-only.

During this weird window in which the array had a component disk removed
but is still mounted/active (and accepting read/writes), we tried to
perform reads and writes and sync command, which "succeed" (meaning the
commands themselves didn't fail, although the errors were observed in
dmesg). When "dd" was executed with "oflag=direct", the writes failed
immediately. This was observed with both nvme and scsi disks composing
the raid-0 array.

We've started to pursue a solution to this, which seems to be an odd
behavior. But worth to check in the CC'ed lists if perhaps this is "by
design" or if it was already discussed in the past (maybe an idea was
proposed). Tests were executed with v4.17-rc2 and upstream mdadm tool.

Thanks in advance,


Guilherme


[0] To remove the array component disk, we've used:

a) For nvme: "echo 1 > /sys/block/nvme0n1/device/device/remove"
b) For scsi: "echo 1 > /sys/block/sda/device/delete"


[1] mdadm tool tries to fail the array by executing: "mdadm -If
"


[2] Example:
[...]
[88.719] Buffer I/O error on device md0, logical block 157704
[88.722] Buffer I/O error on device md0, logical block 157705
[88.725] EXT4-fs warning (device md0): ext4_end_bio:323: I/O error 10
writing to inode 14 (offset 0 size 8388608 starting block 79744)
[88.725] EXT4-fs warning (device md0): ext4_end_bio:323: I/O error 10
writing to inode 14 (offset 0 size 8388608 starting block 8)
[...]


[PATCH] scsi: esas2r: fix spelling mistake: "asynchromous" -> "asynchronous"

2018-04-27 Thread Colin King
From: Colin Ian King 

Trivial fix to spelling mistake in module description text

Signed-off-by: Colin Ian King 
---
 drivers/scsi/esas2r/esas2r_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/esas2r/esas2r_main.c 
b/drivers/scsi/esas2r/esas2r_main.c
index e07eac5be087..c07118617d89 100644
--- a/drivers/scsi/esas2r/esas2r_main.c
+++ b/drivers/scsi/esas2r/esas2r_main.c
@@ -283,7 +283,7 @@ MODULE_PARM_DESC(num_requests,
 int num_ae_requests = 4;
 module_param(num_ae_requests, int, 0);
 MODULE_PARM_DESC(num_ae_requests,
-"Number of VDA asynchromous event requests.  Default 4.");
+"Number of VDA asynchronous event requests.  Default 4.");
 
 int cmd_per_lun = ESAS2R_DEFAULT_CMD_PER_LUN;
 module_param(cmd_per_lun, int, 0);
-- 
2.17.0



Re: [LSF/MM TOPIC NOTES] x86 ZONE_DMA love

2018-04-27 Thread Michal Hocko
On Fri 27-04-18 11:07:07, Cristopher Lameter wrote:
> On Fri, 27 Apr 2018, Michal Hocko wrote:
> 
> > On Thu 26-04-18 22:35:56, Christoph Hellwig wrote:
> > > On Thu, Apr 26, 2018 at 09:54:06PM +, Luis R. Rodriguez wrote:
> > > > In practice if you don't have a floppy device on x86, you don't need 
> > > > ZONE_DMA,
> > >
> > > I call BS on that, and you actually explain later why it it BS due
> > > to some drivers using it more explicitly.  But even more importantly
> > > we have plenty driver using it through dma_alloc_* and a small DMA
> > > mask, and they are in use - we actually had a 4.16 regression due to
> > > them.
> >
> > Well, but do we need a zone for that purpose? The idea was to actually
> > replace the zone by a CMA pool (at least on x86). With the current
> > implementation of the CMA we would move the range [0-16M] pfn range into
> > zone_movable so it can be used and we would get rid of all of the
> > overhead each zone brings (a bit in page flags, kmalloc caches and who
> > knows what else)
> 
> Well it looks like what we are using it for is to force allocation from
> low physical memory if we fail to obtain proper memory through a normal
> channel.  The use of ZONE_DMA is only there for emergency purposes.
> I think we could subsitute ZONE_DMA32 on x87 without a problem.
> 
> Which means that ZONE_DMA has no purpose anymore.

We still need to make sure the low 16MB is available on request. And
that is what CMA can help with. We do not really seem to need the whole
zone infreastructure for that.

> Can we make ZONE_DMA on x86 refer to the low 32 bit physical addresses
> instead and remove ZONE_DMA32?

Why that would be an advantage? If anything I would rename ZONE_DMA32 to
ZONE_ADDR32 or something like that.

> That would actually improve the fallback because you have more memory for
> the old devices.

I do not really understand how is that related to removal ZONE_DMA. We
are really talking about the lowest 16MB...

-- 
Michal Hocko
SUSE Labs


Re: [LSF/MM TOPIC NOTES] x86 ZONE_DMA love

2018-04-27 Thread Christopher Lameter
On Fri, 27 Apr 2018, Matthew Wilcox wrote:

> Some devices have incredibly bogus hardware like 28 bit addressing
> or 39 bit addressing.  We don't have a good way to allocate memory by
> physical address other than than saying "GFP_DMA for anything less than
> 32, GFP_DMA32 (or GFP_KERNEL on 32-bit) for anything less than 64 bit".
>
> Even CMA doesn't have a "cma_alloc_phys()".  Maybe that's the right place
> to put such an allocation API.

The other way out of this would be to require a IOMMU?



Important Message

2018-04-27 Thread Engr. Peter Jensen

Hi,

Compliment of the day. My name is Peter Jensen, I work with ExxonMobil 
as a project manager here in Leatherhead, UK. I have an important 
business transaction I will like to share with you. Do find it proper to 
contact me for more details.


Thank you.

Best Regards,
Engr. Peter Jensen


Re: [LSF/MM TOPIC NOTES] x86 ZONE_DMA love

2018-04-27 Thread Matthew Wilcox
On Fri, Apr 27, 2018 at 04:14:56PM +, Luis R. Rodriguez wrote:
> > Not really.  We have unchecked_isa_dma to support about 4 drivers,
> 
> Ah very neat:
> 
>   * CONFIG_CHR_DEV_OSST - "SCSI OnStream SC-x0 tape support"

That's an upper level driver, like cdrom, disk and regular tapes.

>   * CONFIG_SCSI_ADVANSYS - "AdvanSys SCSI support"

If we ditch support for the ISA boards, this can go away.

>   * CONFIG_SCSI_AHA1542 - "Adaptec AHA1542 support"

Probably true.

>   * CONFIG_SCSI_ESAS2R - "ATTO Technology's ExpressSAS RAID adapter driver"

That's being set to 0.

You missed BusLogic.c and gdth.c



Re: [PATCH 0/3] scsi: scsi-mq: don't hold host_busy in IO path

2018-04-27 Thread Jens Axboe
On 4/27/18 9:48 AM, Bart Van Assche wrote:
> On Fri, 2018-04-27 at 09:39 -0600, Jens Axboe wrote:
>> blk_mq_tagset_busy_iter(>tag_set, scsi_host_check_in_flight,
>>  _flight);
>> return in_flight.cnt + atomic_read(>host_busy);
>>
>> The atomic read is basically free, once we get rid of the dirty of that
>> variable on each IO.
> 
> Hello Jens,
> 
> What makes you think that " + atomic_read(>host_busy)" is necessary?
> I am not aware of any code outside the SCSI core that modifies the host_busy
> member.

It's a (scalable) hack to count those as well. Going forward they should be
converted to just using reserved tags, of course.

-- 
Jens Axboe



Re: [LSF/MM TOPIC NOTES] x86 ZONE_DMA love

2018-04-27 Thread Matthew Wilcox
On Fri, Apr 27, 2018 at 11:07:07AM -0500, Christopher Lameter wrote:
> Well it looks like what we are using it for is to force allocation from
> low physical memory if we fail to obtain proper memory through a normal
> channel.  The use of ZONE_DMA is only there for emergency purposes.
> I think we could subsitute ZONE_DMA32 on x87 without a problem.
> 
> Which means that ZONE_DMA has no purpose anymore.
> 
> Can we make ZONE_DMA on x86 refer to the low 32 bit physical addresses
> instead and remove ZONE_DMA32?
> 
> That would actually improve the fallback because you have more memory for
> the old devices.

Some devices have incredibly bogus hardware like 28 bit addressing
or 39 bit addressing.  We don't have a good way to allocate memory by
physical address other than than saying "GFP_DMA for anything less than
32, GFP_DMA32 (or GFP_KERNEL on 32-bit) for anything less than 64 bit".

Even CMA doesn't have a "cma_alloc_phys()".  Maybe that's the right place
to put such an allocation API.


Re: [PATCH 3/3] scsi: avoid to hold host-wide counter of host_busy for scsi_mq

2018-04-27 Thread Bart Van Assche
On Fri, 2018-04-20 at 14:57 +0800, Ming Lei wrote:
> +struct scsi_host_mq_in_flight {
> + int cnt;
> +};
> +
> +static void scsi_host_check_in_flight(struct request *rq, void *data,
> + bool reserved)
> +{
> + struct scsi_host_mq_in_flight *in_flight = data;
> +
> + if (blk_mq_request_started(rq))
> + in_flight->cnt++;
> +}
> +
>  /**
>   * scsi_host_busy - Return the host busy counter
>   * @shost:   Pointer to Scsi_Host to inc.
>   **/
>  int scsi_host_busy(struct Scsi_Host *shost)
>  {
> - return atomic_read(>host_busy);
> + struct scsi_host_mq_in_flight in_flight = {
> + .cnt = 0,
> + };
> +
> + if (!shost->use_blk_mq)
> + return atomic_read(>host_busy);
> +
> + blk_mq_tagset_busy_iter(>tag_set, scsi_host_check_in_flight,
> + _flight);
> + return in_flight.cnt;
>  }
>  EXPORT_SYMBOL(scsi_host_busy);

This patch introduces a subtle behavior change that has not been explained
in the commit message. If a SCSI request gets requeued that results in a
decrease of the .host_busy counter by scsi_device_unbusy() before the request
is requeued and an increase of the host_busy counter when scsi_queue_rq() is
called again. During that time such requests have the state MQ_RQ_COMPLETE and
hence blk_mq_request_started() will return true and scsi_host_check_in_flight()
will include these requests. In other words, this patch introduces a subtle
behavior change that has not been explained in the commit message. Hence I'm
doubt that this change is correct.

Bart.





Re: [LSF/MM TOPIC NOTES] x86 ZONE_DMA love

2018-04-27 Thread Christopher Lameter
On Fri, 27 Apr 2018, Michal Hocko wrote:

> On Thu 26-04-18 22:35:56, Christoph Hellwig wrote:
> > On Thu, Apr 26, 2018 at 09:54:06PM +, Luis R. Rodriguez wrote:
> > > In practice if you don't have a floppy device on x86, you don't need 
> > > ZONE_DMA,
> >
> > I call BS on that, and you actually explain later why it it BS due
> > to some drivers using it more explicitly.  But even more importantly
> > we have plenty driver using it through dma_alloc_* and a small DMA
> > mask, and they are in use - we actually had a 4.16 regression due to
> > them.
>
> Well, but do we need a zone for that purpose? The idea was to actually
> replace the zone by a CMA pool (at least on x86). With the current
> implementation of the CMA we would move the range [0-16M] pfn range into
> zone_movable so it can be used and we would get rid of all of the
> overhead each zone brings (a bit in page flags, kmalloc caches and who
> knows what else)

Well it looks like what we are using it for is to force allocation from
low physical memory if we fail to obtain proper memory through a normal
channel.  The use of ZONE_DMA is only there for emergency purposes.
I think we could subsitute ZONE_DMA32 on x87 without a problem.

Which means that ZONE_DMA has no purpose anymore.

Can we make ZONE_DMA on x86 refer to the low 32 bit physical addresses
instead and remove ZONE_DMA32?

That would actually improve the fallback because you have more memory for
the old devices.



Re: [LSF/MM TOPIC NOTES] x86 ZONE_DMA love

2018-04-27 Thread Luis R. Rodriguez
On Thu, Apr 26, 2018 at 10:35:56PM -0700, Christoph Hellwig wrote:
> On Thu, Apr 26, 2018 at 09:54:06PM +, Luis R. Rodriguez wrote:
> > In practice if you don't have a floppy device on x86, you don't need 
> > ZONE_DMA,
> 
> I call BS on that, 

I did not explain though that it was not me who claimed this though.
The list displayed below is the result of trying to confirm/deny this,
and what could be done, and also evaluating if there is *any* gain
about doing something about it.

But curious, on a standard qemu x86_x64 KVM guest, which of the
drivers do we know for certain *are* being used from the ones
listed?

What about Xen guests, I wonder?

> and you actually explain later why it it BS due
> to some drivers using it more explicitly.

Or implicitly. The list I showed is the work to show that the users
of GFP_DMA on x86 is *much* more wide spread than expected from the
above claim.

I however did not also answer the above qemu x86_64 question, but
would be good to know. Note I stated that the claim was *in practice*.

> But even more importantly
> we have plenty driver using it through dma_alloc_* and a small DMA
> mask, and they are in use 

Do we have a list of users for x86 with a small DMA mask?
Or, given that I'm not aware of a tool to be able to look
for this in an easy way, would it be good to find out which
x86 drivers do have a small mask?

> - we actually had a 4.16 regression due to them.

Ah what commit was the culprit? Is that fixed already? If so what
commit?

> > SCSI is *severely* affected:
> 
> Not really.  We have unchecked_isa_dma to support about 4 drivers,

Ah very neat:

  * CONFIG_CHR_DEV_OSST - "SCSI OnStream SC-x0 tape support"
  * CONFIG_SCSI_ADVANSYS - "AdvanSys SCSI support"
  * CONFIG_SCSI_AHA1542 - "Adaptec AHA1542 support"
  * CONFIG_SCSI_ESAS2R - "ATTO Technology's ExpressSAS RAID adapter driver"

> and less than a hand ful of drivers doing stupid things, which can
> be fixed easily, and just need a volunteer.

Care to list what needs to be done? Can an eager beaver student do it?

> > That's the end of the review of all current explicit callers on x86.
> > 
> > # dma_alloc_coherent_gfp_flags() and dma_generic_alloc_coherent()
> > 
> > dma_alloc_coherent_gfp_flags() and dma_generic_alloc_coherent() set
> > GFP_DMA if if (dma_mask <= DMA_BIT_MASK(24))
> 
> All that code is long gone and replaced with dma-direct.  Which still
> uses GFP_DMA based on the dma mask, though - see above.

And that's mostly IOMMU code, on the alloc() dma_map_ops.

  Luis


Re: [PATCH 0/3] scsi: scsi-mq: don't hold host_busy in IO path

2018-04-27 Thread Laurence Oberman
On Fri, 2018-04-27 at 15:48 +, Bart Van Assche wrote:
> On Fri, 2018-04-27 at 09:39 -0600, Jens Axboe wrote:
> > blk_mq_tagset_busy_iter(>tag_set, scsi_host_check_in_flight,
> > _flight);
> > return in_flight.cnt + atomic_read(>host_busy);
> > 
> > The atomic read is basically free, once we get rid of the dirty of
> > that
> > variable on each IO.
> 
> Hello Jens,
> 
> What makes you think that " + atomic_read(>host_busy)" is
> necessary?
> I am not aware of any code outside the SCSI core that modifies the
> host_busy
> member.
> 
> Thanks,
> 
> Bart.
> 
> 
> 

As part of testing latest upstream in MQ and non-MQ I intend to test
this patch series fully on actual hardware
F/C 8G to memory backed array LUNS and of course SRP/RDMA
I have started working on this and will report back.
Thanks
Laurence


Re: [PATCH 2/3] scsi: read host_busy via scsi_host_busy()

2018-04-27 Thread Bart Van Assche
On Fri, 2018-04-20 at 14:57 +0800, Ming Lei wrote:
>  show_host_busy(struct device *dev, struct device_attribute *attr, char *buf)
>  {
>   struct Scsi_Host *shost = class_to_shost(dev);
> - return snprintf(buf, 20, "%d\n", atomic_read(>host_busy));
> + return snprintf(buf, 20, "%d\n", scsi_host_busy(shost));
>  }
>  static DEVICE_ATTR(host_busy, S_IRUGO, show_host_busy, NULL);

The ", 20" part is cargo-cult programming. Since you have to touch this code,
please either use "sprintf(buf, ...)" or use "scnprintf(buf, PAGE_SIZE, ...)".

Thanks,

Bart.





Re: [PATCH 0/3] scsi: scsi-mq: don't hold host_busy in IO path

2018-04-27 Thread Bart Van Assche
On Fri, 2018-04-27 at 09:39 -0600, Jens Axboe wrote:
> blk_mq_tagset_busy_iter(>tag_set, scsi_host_check_in_flight,
>   _flight);
> return in_flight.cnt + atomic_read(>host_busy);
> 
> The atomic read is basically free, once we get rid of the dirty of that
> variable on each IO.

Hello Jens,

What makes you think that " + atomic_read(>host_busy)" is necessary?
I am not aware of any code outside the SCSI core that modifies the host_busy
member.

Thanks,

Bart.





Re: [PATCH 1/3] scsi: introduce scsi_host_busy()

2018-04-27 Thread Bart Van Assche
On Fri, 2018-04-20 at 14:57 +0800, Ming Lei wrote:
> This patch introduces SCSI middle layer API of scsi_host_busy() for
> drivers to read the host-wide counter of scsi_host->host_busy.

This patch introduces a function that has no callers so I think this patch
should be combined with patch 2/3 from this series into a single patch.

Bart.




Re: [PATCH 0/3] scsi: scsi-mq: don't hold host_busy in IO path

2018-04-27 Thread Jens Axboe
On 4/27/18 9:31 AM, Bart Van Assche wrote:
> On Fri, 2018-04-20 at 14:57 +0800, Ming Lei wrote:
>> This patches removes the expensive atomic opeation on host-wide counter
>> of .host_busy for scsi-mq, and it is observed that IOPS can be increased by
>> 15% with this change in IO test over scsi_debug.
>>
>>
>> Ming Lei (3):
>>   scsi: introduce scsi_host_busy()
>>   scsi: read host_busy via scsi_host_busy()
>>   scsi: avoid to hold host-wide counter of host_busy for scsi_mq
>>
>>  drivers/scsi/advansys.c   |  8 
>>  drivers/scsi/hosts.c  | 32 
>> +++
>>  drivers/scsi/libsas/sas_scsi_host.c   |  4 ++--
>>  drivers/scsi/megaraid/megaraid_sas_base.c |  2 +-
>>  drivers/scsi/mpt3sas/mpt3sas_base.c   |  4 ++--
>>  drivers/scsi/qlogicpti.c  |  2 +-
>>  drivers/scsi/scsi.c   |  2 +-
>>  drivers/scsi/scsi_error.c |  6 +++---
>>  drivers/scsi/scsi_lib.c   | 23 --
>>  drivers/scsi/scsi_sysfs.c |  2 +-
>>  include/scsi/scsi_host.h  |  1 +
>>  11 files changed, 65 insertions(+), 21 deletions(-)\
> 
> Hello Ming,
> 
> From the MAINTAINERS file:
> 
> SCSI SUBSYSTEM
> M:  "James E.J. Bottomley" 
> T:  git git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi.git
> M:  "Martin K. Petersen" 
> T:  git git://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git
> L:  linux-scsi@vger.kernel.org
> S:  Maintained
> F:  Documentation/devicetree/bindings/scsi/
> F:  drivers/scsi/
> F:  include/scsi/
> 
> Hence my surprise when I saw that you sent this patch series to Jens instead
> of James and Martin?

Martin and James are both on the CC as well. For what it's worth, the patch
seems like a good approach to me. To handle the case that Hannes was concerned
about (older drivers doing internal command issue), I would suggest that those
drivers get instrumented to include a inc/dec of the host busy count for
internal commands that bypass the normal tagging. That means the mq case needs
to be

blk_mq_tagset_busy_iter(>tag_set, scsi_host_check_in_flight,
_flight);
return in_flight.cnt + atomic_read(>host_busy);

The atomic read is basically free, once we get rid of the dirty of that
variable on each IO.

-- 
Jens Axboe



Re: [PATCH 0/3] scsi: scsi-mq: don't hold host_busy in IO path

2018-04-27 Thread Bart Van Assche
On Fri, 2018-04-20 at 14:57 +0800, Ming Lei wrote:
> This patches removes the expensive atomic opeation on host-wide counter
> of .host_busy for scsi-mq, and it is observed that IOPS can be increased by
> 15% with this change in IO test over scsi_debug.
> 
> 
> Ming Lei (3):
>   scsi: introduce scsi_host_busy()
>   scsi: read host_busy via scsi_host_busy()
>   scsi: avoid to hold host-wide counter of host_busy for scsi_mq
> 
>  drivers/scsi/advansys.c   |  8 
>  drivers/scsi/hosts.c  | 32 
> +++
>  drivers/scsi/libsas/sas_scsi_host.c   |  4 ++--
>  drivers/scsi/megaraid/megaraid_sas_base.c |  2 +-
>  drivers/scsi/mpt3sas/mpt3sas_base.c   |  4 ++--
>  drivers/scsi/qlogicpti.c  |  2 +-
>  drivers/scsi/scsi.c   |  2 +-
>  drivers/scsi/scsi_error.c |  6 +++---
>  drivers/scsi/scsi_lib.c   | 23 --
>  drivers/scsi/scsi_sysfs.c |  2 +-
>  include/scsi/scsi_host.h  |  1 +
>  11 files changed, 65 insertions(+), 21 deletions(-)\

Hello Ming,

From the MAINTAINERS file:

SCSI SUBSYSTEM
M:  "James E.J. Bottomley" 
T:  git git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi.git
M:  "Martin K. Petersen" 
T:  git git://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git
L:  linux-scsi@vger.kernel.org
S:  Maintained
F:  Documentation/devicetree/bindings/scsi/
F:  drivers/scsi/
F:  include/scsi/

Hence my surprise when I saw that you sent this patch series to Jens instead
of James and Martin?

Bart.

Re: [PATCH] usb-storage: stop using block layer bounce buffers

2018-04-27 Thread Alan Stern
On Fri, 27 Apr 2018, Christoph Hellwig wrote:

> On Sun, Apr 15, 2018 at 11:24:11AM -0400, Alan Stern wrote:
> > On Sun, 15 Apr 2018, Christoph Hellwig wrote:
> > 
> > > USB host controllers now must handle highmem, so we can get rid of bounce
> > > buffering highmem pages in the block layer.
> > 
> > Sorry, I don't quite understand what you are saying.  Do you mean that
> > all USB host controllers now magically _do_ handle highmem?  Or do you
> > mean that if they _don't_ handle highmem, we will not support them any
> > more?
> 
> USB controller themselves never cared about highmem, drivers did.  For
> PIO based controllers they'd have to kmap any address no in the kernel
> drirect mapping.
> 
> Nothing in drivers/usb/host or the other diretories related to host
> drivers calls page_address (only used in a single gadget) or sg_virt
> (only used in a few upper level drivers), which makes me assume
> semi-confidently that no USB host driver is not highmem aware these
> days.

sg_virt is called in drivers/usb/core/message.c.  (Maybe that's what
you meant by "upper level drivers".)  I'm not sure just how important
that usage is.

Alan Stern

> Greg, does this match your observation as the USB maintainer?



Re: [RFC PATCH] qedf: qedf_dcbx_no_wait can be static

2018-04-27 Thread Chad Dupuis

On Thu, 26 Apr 2018, 9:47pm, kbuild test robot wrote:

> 
> Fixes: d9867ecbae88 ("qedf: Add dcbx_not_wait module parameter so we won't 
> wait for DCBX convergence to start discovery.")
> Signed-off-by: Fengguang Wu 
> ---
>  qedf_main.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/scsi/qedf/qedf_main.c b/drivers/scsi/qedf/qedf_main.c
> index 8df151e..b96c928 100644
> --- a/drivers/scsi/qedf/qedf_main.c
> +++ b/drivers/scsi/qedf/qedf_main.c
> @@ -89,7 +89,7 @@ module_param_named(retry_delay, qedf_retry_delay, bool, 
> S_IRUGO | S_IWUSR);
>  MODULE_PARM_DESC(retry_delay, " Enable/disable handling of FCP_RSP IU retry "
>   "delay handling (default off).");
>  
> -bool qedf_dcbx_no_wait;
> +static bool qedf_dcbx_no_wait;
>  module_param_named(dcbx_no_wait, qedf_dcbx_no_wait, bool, S_IRUGO | S_IWUSR);
>  MODULE_PARM_DESC(dcbx_no_wait, " Do not wait for DCBX convergence to start "
>   "sending FIP VLAN requests on link up (Default: off).");
> 

Acked-by: Chad Dupuis 


Re: [LSF/MM TOPIC NOTES] x86 ZONE_DMA love

2018-04-27 Thread Michal Hocko
On Thu 26-04-18 22:35:56, Christoph Hellwig wrote:
> On Thu, Apr 26, 2018 at 09:54:06PM +, Luis R. Rodriguez wrote:
> > In practice if you don't have a floppy device on x86, you don't need 
> > ZONE_DMA,
> 
> I call BS on that, and you actually explain later why it it BS due
> to some drivers using it more explicitly.  But even more importantly
> we have plenty driver using it through dma_alloc_* and a small DMA
> mask, and they are in use - we actually had a 4.16 regression due to
> them.

Well, but do we need a zone for that purpose? The idea was to actually
replace the zone by a CMA pool (at least on x86). With the current
implementation of the CMA we would move the range [0-16M] pfn range into 
zone_movable so it can be used and we would get rid of all of the
overhead each zone brings (a bit in page flags, kmalloc caches and who
knows what else)
-- 
Michal Hocko
SUSE Labs


RE: MegaCli fails to communicate with Raid-Controller

2018-04-27 Thread Kashyap Desai
> -Original Message-
> From: Volker Schwicking [mailto:volker.schwick...@godaddy.com]
> Sent: Thursday, April 26, 2018 8:22 PM
> To: Kashyap Desai
> Cc: Martin K. Petersen; linux-scsi@vger.kernel.org; Sumit Saxena;
> Shivasharan
> Srikanteshwara
> Subject: Re: MegaCli fails to communicate with Raid-Controller
>
> On 23. Apr 2018, at 11:03, Volker Schwicking
>  wrote:
> >
> > I will add the printk to dma_alloc_coherent() as well to see, which
> > request
> actually fails. But i have to be a bit patient since its a production
> system and
> the customers aren’t to happy about reboots.
>
> Alright, here are some results.
>
> Looking at my debug lines i can tell, that requesting either 2048 or 4
> regularly
> fail. Other values don’t ever show up as failed, but there are several  as
> you
> can see in the attached log.
>
> The failed requests:
> ###
> $ grep 'GD IOV-len FAILED' /var/log/kern.log  | awk '{ print $9, $10 }' |
> sort |
> uniq -c
>  59 FAILED: 2048
>  64 FAILED: 4
> ###

Thanks.! This helps to understand the problem. Few question -

What is a frequency of this failure ? Can you reproduce on demand ?
Are you able to see no failure on 4.6 kernel ?
How your setup looks like ? Are you running VM or this failure is on host
OS. Can you share full dmesg logs ?

>
> I attached full debugging output from several executions of
> “megacli -ldpdinfo
> -a0” in 5 second intervals, successful and failed and content from
> /proc/buddyinfo again.
>
>  Can you make any sense of that? Where should i go from here?

May be better to find out call trace of dma_alloc_coherent using ftrace.
Depending upon DMA engine configured, failure may be related to those DMA
engine code changes.
Can you get those ftrace logs as well. ? You may have to cherry pick ftrace
filter around dma_alloc_coherent().

I quickly grep in arch/xen to see something related to memory allocation and
found that pci_xen_swiotlb_detect() has some methods to enable/disable
certain features and one of the key factor is DMA range 32 bit or 64 bit.
Since older controller is requesting DMA buffer below 4GB region, some kind
of code changes in those are from 4.6 -> 4.14.x might be a possible reason
of the frequent memory allocation failure. This is my wild guess based on
the info that 4.6 is  *not at all* exposured to memory failure at the same
frequency of 4.14.

Kashyap