date:20170112

Re: [Regression] fstrim hangs on Hyper-V: caused by "block: improve handling of the magic discard payload"

2017-01-12 Thread Christoph Hellwig

On Fri, Jan 13, 2017 at 06:16:02AM +, Dexuan Cui wrote:
> IMO this means not only SCSI Unmap command is affected, but
> some other SCSI commands can be affected too? 
> And it looks the bare metal can be affected too?

This affects all drivers looking at the sdb.length field for the
total I/O length - many drivers don't need it but just the SGL, including
both that I tested d this change on - one being virtualized and one
bare metal.

It also only affects commands where the data transfer length is different
from the length of the written blocks, so only affects WRITE SAME and
UNMAP commands, used for discard or zeroing.

I'll submit a cleaned up version with a proper block layer helper today.
Thanks for reporting and debugging this issue!
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [Regression] fstrim hangs on Hyper-V: caused by "block: improve handling of the magic discard payload"

2017-01-12 Thread Dexuan Cui

> From: Dexuan Cui
> Sent: Friday, January 13, 2017 11:05
> To: 'Christoph Hellwig' 
> Cc: linux-block@vger.kernel.org; KY Srinivasan ; Chris
> Valean (Cloudbase Solutions SRL) 
> Subject: RE: [Regression] fstrim hangs on Hyper-V: caused by "block: improve
> handling of the magic discard payload"
> 
> > From: Christoph Hellwig [mailto:h...@lst.de]
> > Sent: Friday, January 13, 2017 02:19
> > To: Dexuan Cui 
> > Cc: linux-block@vger.kernel.org; KY Srinivasan ; Chris
> > Valean (Cloudbase Solutions SRL) 
> > Subject: Re: [Regression] fstrim hangs on Hyper-V: caused by "block:
> improve
> > handling of the magic discard payload"
> >
> > Next try:  (I've also dropped most of the Cc list)
> >
> > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> > index c35b6de..2f358f7 100644
> > --- a/drivers/scsi/scsi_lib.c
> > +++ b/drivers/scsi/scsi_lib.c
> > @@ -1018,7 +1018,10 @@ static int scsi_init_sgtable(struct request *req,
> > struct scsi_data_buffer *sdb)
> > count = blk_rq_map_sg(req->q, req, sdb->table.sgl);
> > BUG_ON(count > sdb->table.nents);
> > sdb->table.nents = count;
> > -   sdb->length = blk_rq_bytes(req);
> > +   if (req->rq_flags & RQF_SPECIAL_PAYLOAD)
> > +   sdb->length = req->special_vec.bv_len;
> > +   else
> > +   sdb->length = blk_rq_bytes(req);
> > return BLKPREP_OK;
> >  }
> 
> Hi Christoph,
> The patch works like a charm!
> fstrim can work now.
> Chris may help to do more test.
> 
> FWIW:
> If (req->rq_flags & RQF_SPECIAL_PAYLOAD) is true,
> req->special_vec.bv_len is always 24 in my test.
> 
> Thanks really a lot for your quick patch! :-)
> 
> Can the patch make it into v4.10?
> IMO It's a really important fix.
> 
> Thanks,
> -- Dexuan

FYI:  I did more tests and the patch worked just great!

BTW, fstrim/mkfs are not the only affected tools: I put a WARN_ON
before the new line and found python too (see the below calltrace).

IMO this means not only SCSI Unmap command is affected, but
some other SCSI commands can be affected too? 
And it looks the bare metal can be affected too?

Thanks,
-- Dexuan

//Dexcuan: in this case: req->special_vec.bv_len is 24 and 
// blk_rq_bytes(req) is 4096.

[   17.862939] CPU: 2 PID: 1430 Comm: python3 Tainted: GW   
4.10.0-rc3+ #1
[   17.862940] Hardware name: Microsoft Corporation Virtual Machine/Virtual 
Machine, BIOS 090006  05/23/2012
[   17.862941] Call Trace:
[   17.862947]  dump_stack+0x63/0x90
[   17.862952]  __warn+0xcb/0xf0
[   17.862954]  warn_slowpath_fmt+0x5f/0x80
[   17.862955]  scsi_init_sgtable+0x92/0xc0
[   17.862956]  scsi_init_io+0x4f/0x1e0
[   17.862959]  sd_init_command+0x55b/0xdb0
[   17.862963]  ? scsi_host_alloc_command+0x44/0xc0
[   17.862965]  scsi_setup_cmnd+0xf0/0x150
[   17.862966]  scsi_prep_fn+0xef/0x170
[   17.862968]  blk_peek_request+0x180/0x2b0
[   17.862970]  scsi_request_fn+0x3e/0x620
[   17.862973]  ? elv_rb_add+0x61/0x70
[   17.862977]  ? deadline_add_request+0x36/0x80
[   17.862978]  __blk_run_queue+0x33/0x40
[   17.862979]  blk_queue_bio+0x3c8/0x3e0
[   17.862980]  generic_make_request+0xf2/0x1d0
[   17.862981]  submit_bio+0x73/0x150
[   17.862985]  submit_bh_wbc+0x14c/0x180
[   17.862987]  ll_rw_block+0x78/0xb0
[   17.862988]  __block_write_begin_int+0x4d6/0x5c0
[   17.863002]  ? ext4_inode_attach_jinode.part.67+0xb0/0xb0
[   17.863004]  ? ext4_da_write_begin+0x122/0x400
[   17.863006]  __block_write_begin+0x11/0x20
[   17.863007]  ext4_da_write_begin+0x178/0x400
[   17.863012]  generic_perform_write+0xc9/0x1c0
[   17.863015]  ? file_update_time+0xc8/0x110
[   17.863017]  __generic_file_write_iter+0x1a6/0x1f0
[   17.863020]  ext4_file_write_iter+0x89/0x370
[   17.863023]  ? _copy_to_user+0x2e/0x40
[   17.863026]  ? cp_new_stat+0x153/0x180
[   17.863030]  __vfs_write+0xe3/0x160
[   17.863031]  vfs_write+0xb8/0x1b0
[   17.863032]  SyS_write+0x55/0xc0
[   17.863036]  entry_SYSCALL_64_fastpath+0x1e/0xad
[   17.863037] RIP: 0033:0x7f27314bf4bd

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: blk_queue_bounce_limit() broken for mask=0xffffffff on 64bit archs

2017-01-12 Thread Nikita Yushchenko


>> There is a use cases when architecture is 64-bit but hardware supports
>> only DMA to lower 4G of address space. E.g. NVMe device on RCar PCIe host.
>>
>> For such cases, it looks proper to call blk_queue_bounce_limit() with
>> mask set to 0x - thus making block layer to use bounce buffers
>> for any addresses beyond 4G.  To support that, architecture provides
>> GFP_DMA zone that covers exactly low 4G on arm64.
>>
>> However setting this limit does not work:
>>
>>   if (b_pfn < (min_t(u64, 0xUL, BLK_BOUNCE_HIGH) >> PAGE_SHIFT))
>>   dma = 1;
>>
>> When mask is 0x that condition is false
> 
> That should have been true in your case, since the b_pfn is smaller than
> 0x.

b_pfn is exactly 0xUL >> SHIFT, thus contition is false

>>   q->limits.bounce_pfn = max(max_low_pfn, b_pfn);
>>
>> this line is executed and replaces any limit with end of memory (on
>> 64bit arch all memory is low).
> 
> I don't understand why max() is used? And why not min()?
> 
> Looks the above line just disables bounce for 64bit arch, doesn't it?

Effectively yes. And I don't understand logic behind this code.

Nikita
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: blk_queue_bounce_limit() broken for mask=0xffffffff on 64bit archs

2017-01-12 Thread Ming Lei

Hi,

On Tue, Jan 10, 2017 at 4:48 AM, Nikita Yushchenko
 wrote:
> Hi
>
> There is a use cases when architecture is 64-bit but hardware supports
> only DMA to lower 4G of address space. E.g. NVMe device on RCar PCIe host.
>
> For such cases, it looks proper to call blk_queue_bounce_limit() with
> mask set to 0x - thus making block layer to use bounce buffers
> for any addresses beyond 4G.  To support that, architecture provides
> GFP_DMA zone that covers exactly low 4G on arm64.
>
> However setting this limit does not work:
>
>   if (b_pfn < (min_t(u64, 0xUL, BLK_BOUNCE_HIGH) >> PAGE_SHIFT))
>   dma = 1;
>
> When mask is 0x that condition is false

That should have been true in your case, since the b_pfn is smaller than
0x.

>
>   q->limits.bounce_pfn = max(max_low_pfn, b_pfn);
>
> this line is executed and replaces any limit with end of memory (on
> 64bit arch all memory is low).

I don't understand why max() is used? And why not min()?

Looks the above line just disables bounce for 64bit arch, doesn't it?

Thanks,
Ming

>
>
> Not sure how to fix this properly. Any hints?
> --
> To unsubscribe from this list: send the line "unsubscribe linux-block" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Lsf-pc] [LSF/MM TOPIC] [LSF/MM ATTEND] md raid general discussion

2017-01-12 Thread Coly Li

On 2017/1/12 下午11:09, Sagi Grimberg wrote:
> Hey Coly,
> 
>> Also I receive reports from users that raid1 performance is desired when
>> it is built on NVMe SSDs as a cache (maybe bcache or dm-cache). I am
>> working on some raid1 performance improvement (e.g. new raid1 I/O
>> barrier and lockless raid1 I/O submit), and have some more ideas to
>> discuss.
> 
> Do you have some performance measurements to share?
> 
> Mike used null devices to simulate very fast devices which
> led to nice performance enhancements in dm-multipath code.

I have several performance data of raid1 and raid0, which is still work
in progress.

- md raid1
  Current md raid1 read performance is not ideal. A raid1 with 2 NVMe
SSD, only observe 2.6GB/s throughput for multi I/O and depth reading.
Most of the time spending on I/O barrier locking. Now I am working on a
lockless I/O submit patch (the original idea is from Hannes Reinecke),
which improves reading throughput to 4.7~5GB/s. When using md raid1 as a
cache device, reading performance improvement is critical.
  On my hardware, the ideal reading throughput of 2 NVMe is 6GB/s,
currently the reading performance number is 4.7~5GB/s, still have a
little some space to improve.
- md raid0
  People reports on linux-raid mailing list that DISCARD/TRIM
performance on raid0 is slow. In my reproducing, a raid0 built by 4x3TB
NVMe SSD, formatting a XFS volume on top of it takes 306 seconds. Most
of the time is inside md raid0 code to issue DISCARD/TRIM request in
chunk size range. I compose a POC patch to re-combine a large
DISCARD/TRIM command into per-device request, which reduces the
formatting time to 15 seconds. Now I work on patch simplifying by the
suggestions from upstream maintainers.

For raid1, currently most of feed backs are from read performance.

Coly
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [Regression] fstrim hangs on Hyper-V: caused by "block: improve handling of the magic discard payload"

2017-01-12 Thread Dexuan Cui

> From: Christoph Hellwig [mailto:h...@lst.de]
> Sent: Friday, January 13, 2017 02:19
> To: Dexuan Cui 
> Cc: linux-block@vger.kernel.org; KY Srinivasan ; Chris
> Valean (Cloudbase Solutions SRL) 
> Subject: Re: [Regression] fstrim hangs on Hyper-V: caused by "block: improve
> handling of the magic discard payload"
> 
> Next try:  (I've also dropped most of the Cc list)
> 
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index c35b6de..2f358f7 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -1018,7 +1018,10 @@ static int scsi_init_sgtable(struct request *req,
> struct scsi_data_buffer *sdb)
>   count = blk_rq_map_sg(req->q, req, sdb->table.sgl);
>   BUG_ON(count > sdb->table.nents);
>   sdb->table.nents = count;
> - sdb->length = blk_rq_bytes(req);
> + if (req->rq_flags & RQF_SPECIAL_PAYLOAD)
> + sdb->length = req->special_vec.bv_len;
> + else
> + sdb->length = blk_rq_bytes(req);
>   return BLKPREP_OK;
>  }

Hi Christoph,
The patch works like a charm!
fstrim can work now.
Chris may help to do more test.

FWIW:
If (req->rq_flags & RQF_SPECIAL_PAYLOAD) is true,
req->special_vec.bv_len is always 24 in my test.

Thanks really a lot for your quick patch! :-)

Can the patch make it into v4.10? 
IMO It's a really important fix.

Thanks,
-- Dexuan

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RFC: 512e ZBC host-managed disks

2017-01-12 Thread Damien Le Moal



On 1/13/17 00:02, Jeff Moyer wrote:
> Christoph Hellwig  writes:
> 
>> On Thu, Jan 12, 2017 at 05:13:52PM +0900, Damien Le Moal wrote:
>>> (3) Any other idea ?
>>
>> Do nothing and ignore the problem.  This whole idea so braindead that
>> the person coming up with the T10 language should be shot.  Either a device
>> has 511 logical sectors or 4k but not this crazy mix.
>>
>> And make sure no one ships such a piece of crap because we are hell
>> not going to support it.
> 
> Agreed.  This is insane.

Christoph, Jeff,

Thank you for the feedback.

-- 
Damien Le Moal, Ph.D.
Sr. Manager, System Software Research Group,
Western Digital Corporation
damien.lem...@wdc.com
(+81) 0466-98-3593 (ext. 513593)
1 kirihara-cho, Fujisawa,
Kanagawa, 252-0888 Japan
www.wdc.com, www.hgst.com
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/2] Rename blk_queue_zone_size and bdev_zone_size

2017-01-12 Thread Damien Le Moal

Jens,

On 1/13/17 05:49, Jens Axboe wrote:
> Just in case you missed it, I had to fold your two patches. Looking at
> it again, what is going on? You rename a function, and then patch #2
> renames the use of that function in a different spot? How did that ever
> pass your testing? For something intended for the current series, please
> be more careful than that, that's just sloppy.

I created two patches, one for each component/maintainer involved:
block/you and f2fs/Jaegeuk. That was indeed a very stupid idea as the 2
patches must always go together (and that is how I tested). A single
patch was the right way to do things. My apologies for this mistake and
thank you very much for fixing it. I will be more careful in the future.

Best regards.

-- 
Damien Le Moal, Ph.D.
Sr. Manager, System Software Research Group,
Western Digital Corporation
damien.lem...@wdc.com
(+81) 0466-98-3593 (ext. 513593)
1 kirihara-cho, Fujisawa,
Kanagawa, 252-0888 Japan
www.wdc.com, www.hgst.com
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [LSF/MM TOPIC] Memory hotplug, ZONE_DEVICE, and the future of struct page

2017-01-12 Thread Dan Williams

On Thu, Jan 12, 2017 at 3:14 PM, Jerome Glisse  wrote:
> On Thu, Jan 12, 2017 at 02:43:03PM -0800, Dan Williams wrote:
>> Back when we were first attempting to support DMA for DAX mappings of
>> persistent memory the plan was to forgo 'struct page' completely and
>> develop a pfn-to-scatterlist capability for the dma-mapping-api. That
>> effort died in this thread:
>>
>> https://lkml.org/lkml/2015/8/14/3
>>
>> ...where we learned that the dependencies on struct page for dma
>> mapping are deeper than a PFN_PHYS() conversion for some
>> architectures. That was the moment we pivoted to ZONE_DEVICE and
>> arranged for a 'struct page' to be available for any persistent memory
>> range that needs to be the target of DMA. ZONE_DEVICE enables any
>> device-driver that can target "System RAM" to also be able to target
>> persistent memory through a DAX mapping.
>>
>> Since that time the "page-less" DAX path has continued to mature [1]
>> without growing new dependencies on struct page, but at the same time
>> continuing to rely on ZONE_DEVICE to satisfy get_user_pages().
>>
>> Peer-to-peer DMA appears to be evolving from a niche embedded use case
>> to something general purpose platforms will need to comprehend. The
>> "map_peer_resource" [2] approach looks to be headed to the same
>> destination as the pfn-to-scatterlist effort. It's difficult to avoid
>> 'struct page' for describing DMA operations without custom driver
>> code.
>>
>> With that background, a statement and a question to discuss at LSF/MM:
>>
>> General purpose DMA, i.e. any DMA setup through the dma-mapping-api,
>> requires pfn_to_page() support across the entire physical address
>> range mapped.
>
> Note that in my case it is even worse. The pfn of the page does not
> correspond to anything so it need to go through a special function
> to find if a page can be mapped for another device and to provide a
> valid pfn at which the page can be access by other device.

I still haven't quite wrapped my head about how these pfn ranges are
created. Would this be a use case for a new pfn_t flag? It doesn't
sound like something we'd want to risk describing with raw 'unsigned
long' pfns.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [LSF/MM TOPIC] Memory hotplug, ZONE_DEVICE, and the future of struct page

2017-01-12 Thread Jerome Glisse

On Thu, Jan 12, 2017 at 02:43:03PM -0800, Dan Williams wrote:
> Back when we were first attempting to support DMA for DAX mappings of
> persistent memory the plan was to forgo 'struct page' completely and
> develop a pfn-to-scatterlist capability for the dma-mapping-api. That
> effort died in this thread:
> 
> https://lkml.org/lkml/2015/8/14/3
> 
> ...where we learned that the dependencies on struct page for dma
> mapping are deeper than a PFN_PHYS() conversion for some
> architectures. That was the moment we pivoted to ZONE_DEVICE and
> arranged for a 'struct page' to be available for any persistent memory
> range that needs to be the target of DMA. ZONE_DEVICE enables any
> device-driver that can target "System RAM" to also be able to target
> persistent memory through a DAX mapping.
> 
> Since that time the "page-less" DAX path has continued to mature [1]
> without growing new dependencies on struct page, but at the same time
> continuing to rely on ZONE_DEVICE to satisfy get_user_pages().
> 
> Peer-to-peer DMA appears to be evolving from a niche embedded use case
> to something general purpose platforms will need to comprehend. The
> "map_peer_resource" [2] approach looks to be headed to the same
> destination as the pfn-to-scatterlist effort. It's difficult to avoid
> 'struct page' for describing DMA operations without custom driver
> code.
> 
> With that background, a statement and a question to discuss at LSF/MM:
> 
> General purpose DMA, i.e. any DMA setup through the dma-mapping-api,
> requires pfn_to_page() support across the entire physical address
> range mapped.

Note that in my case it is even worse. The pfn of the page does not
correspond to anything so it need to go through a special function
to find if a page can be mapped for another device and to provide a
valid pfn at which the page can be access by other device.

Basicly the PCIE bar is like a window into the device memory that is
dynamicly remap to specific page of the device memory. Not all device
memory can be expose through PCIE bar because of PCIE issues.

> 
> Is ZONE_DEVICE the proper vehicle for this? We've already seen that it
> collides with platform alignment assumptions [3], and if there's a
> wider effort to rework memory hotplug [4] it seems DMA support should
> be part of the discussion.

Obvioulsy i would like to join this discussion :)

Cheers,
Jérôme
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[LSF/MM TOPIC] Memory hotplug, ZONE_DEVICE, and the future of struct page

2017-01-12 Thread Dan Williams

Back when we were first attempting to support DMA for DAX mappings of
persistent memory the plan was to forgo 'struct page' completely and
develop a pfn-to-scatterlist capability for the dma-mapping-api. That
effort died in this thread:

https://lkml.org/lkml/2015/8/14/3

...where we learned that the dependencies on struct page for dma
mapping are deeper than a PFN_PHYS() conversion for some
architectures. That was the moment we pivoted to ZONE_DEVICE and
arranged for a 'struct page' to be available for any persistent memory
range that needs to be the target of DMA. ZONE_DEVICE enables any
device-driver that can target "System RAM" to also be able to target
persistent memory through a DAX mapping.

Since that time the "page-less" DAX path has continued to mature [1]
without growing new dependencies on struct page, but at the same time
continuing to rely on ZONE_DEVICE to satisfy get_user_pages().

Peer-to-peer DMA appears to be evolving from a niche embedded use case
to something general purpose platforms will need to comprehend. The
"map_peer_resource" [2] approach looks to be headed to the same
destination as the pfn-to-scatterlist effort. It's difficult to avoid
'struct page' for describing DMA operations without custom driver
code.

With that background, a statement and a question to discuss at LSF/MM:

General purpose DMA, i.e. any DMA setup through the dma-mapping-api,
requires pfn_to_page() support across the entire physical address
range mapped.

Is ZONE_DEVICE the proper vehicle for this? We've already seen that it
collides with platform alignment assumptions [3], and if there's a
wider effort to rework memory hotplug [4] it seems DMA support should
be part of the discussion.

---

This topic focuses on the mechanism to enable pfn_to_page() for an
arbitrary physical address range, and the proposed peer-to-peer DMA
topic [5] touches on the userspace presentation of this mechanism. I
might be good to combine these topics if there's interest? In any
event, I'm interested in both as well Michal's concern about memory
hotplug in general.

[1]: https://lists.01.org/pipermail/linux-nvdimm/2016-November/007672.html
[2]: http://www.spinics.net/lists/linux-pci/msg44560.html
[3]: https://lkml.org/lkml/2016/12/1/740
[4]: http://www.spinics.net/lists/linux-mm/msg119369.html
[5]: http://marc.info/?l=linux-mm&m=148156541804940&w=2
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 05/15] dm: remove incomple BLOCK_PC support

2017-01-12 Thread Mike Snitzer

On Thu, Jan 12 2017 at  3:00am -0500,
Christoph Hellwig  wrote:

> On Wed, Jan 11, 2017 at 08:09:37PM -0500, Mike Snitzer wrote:
> > I'm not following your reasoning.
> > 
> > dm_blk_ioctl calls __blkdev_driver_ioctl and will call scsi_cmd_ioctl
> > (sd_ioctl -> scsi_cmd_blk_ioctl -> scsi_cmd_ioctl) if DM's underlying
> > block device is a scsi device.
> 
> Yes, it it does.  But scsi_cmd_ioctl as called from sd_ioctl will
> operate entirely on the SCSI request_queue - dm-mpath will never see
> the BLOCK_PC request generated by it.

I lost sight of the fact that BLOCK_PC requests are sent down via the
normal request submission (and not the ioctl path).  So my previous
reply wasn't relevant.

What is "incomplete" about request-based DM's BLOCK_PC support?

This code goes back to when request-based DM multipath was first
introduced via commit cec47e3d4a -- but I've never used the BLOCK_PC
requests for SCSI pass through myself.  I don't know who is using
it.. are you aware of some upper layer filesystem or userspace
submission path for these BLOCK_PC requests that they'd be passing
through DM?

I'm also missing how you're saying the new blk-mq request-based DM will
work with your new model.  I appreciate that we get the request from the
underlying blk-mq request_queue and it'll be properly sized.  But
wouldn't we need to pass data back up for these SCSI pass-through
requests?  So wouldn't the top-level multipath request_queue need to
setup cmd_size?

Sorry for the naive questions (that clearly speak to me not
understanding how this aspect of the block and SCSI code work).. but I'd
like to understand where DM will be lacking going forward.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 06/10] blk-mq-tag: cleanup the normal/reserved tag allocation

2017-01-12 Thread Jens Axboe

On Thu, Jan 12 2017, Bart Van Assche wrote:
> On Wed, 2017-01-11 at 14:39 -0700, Jens Axboe wrote:
> > This is in preparation for having another tag set available. Cleanup
> > the parameters, and allow passing in of tags fo blk_mq_put_tag().
> 
> It seems like an 'r' is missing from the description ("tags fo")?

Indeed, good eye. Added.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 08/10] blk-mq-sched: add framework for MQ capable IO schedulers

2017-01-12 Thread Jens Axboe

On Thu, Jan 12 2017, Bart Van Assche wrote:
> On Wed, 2017-01-11 at 14:40 -0700, Jens Axboe wrote:
> > @@ -451,11 +456,11 @@ void blk_insert_flush(struct request *rq)
> >  * processed directly without going through flush machinery.  Queue
> >  * for normal execution.
> >  */
> > -   if ((policy & REQ_FSEQ_DATA) &&
> > -   !(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) {
> > -   if (q->mq_ops) {
> > -   blk_mq_insert_request(rq, false, true, false);
> > -   } else
> > +   if (((policy & REQ_FSEQ_DATA) &&
> > +!(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH {
> > +   if (q->mq_ops)
> > +   blk_mq_sched_insert_request(rq, false, true, false);
> > +   else
> > list_add_tail(&rq->queuelist, &q->queue_head);
> > return;
> > }
> 
> Not that it really matters, but this change adds a pair of parentheses --
> "if (e)" is changed into "if ((e))". Is this necessary?

I fixed that up earlier today, as I noticed the same. So that's gone in
the current -git tree.

> > +void blk_mq_sched_free_hctx_data(struct request_queue *q,
> > +void (*exit)(struct blk_mq_hw_ctx *))
> > +{
> > +   struct blk_mq_hw_ctx *hctx;
> > +   int i;
> > +
> > +   queue_for_each_hw_ctx(q, hctx, i) {
> > +   if (exit)
> > +   exit(hctx);
> > +   kfree(hctx->sched_data);
> > +   hctx->sched_data = NULL;
> > +   }
> > +}
> > +EXPORT_SYMBOL_GPL(blk_mq_sched_free_hctx_data);
> > +
> > +int blk_mq_sched_init_hctx_data(struct request_queue *q, size_t size,
> > +   int (*init)(struct blk_mq_hw_ctx *),
> > +   void (*exit)(struct blk_mq_hw_ctx *))
> > +{
> > +   struct blk_mq_hw_ctx *hctx;
> > +   int ret;
> > +   int i;
> > +
> > +   queue_for_each_hw_ctx(q, hctx, i) {
> > +   hctx->sched_data = kmalloc_node(size, GFP_KERNEL, 
> > hctx->numa_node);
> > +   if (!hctx->sched_data) {
> > +   ret = -ENOMEM;
> > +   goto error;
> > +   }
> > +
> > +   if (init) {
> > +   ret = init(hctx);
> > +   if (ret) {
> > +   /*
> > +* We don't want to give exit() a partially
> > +* initialized sched_data. init() must clean up
> > +* if it fails.
> > +*/
> > +   kfree(hctx->sched_data);
> > +   hctx->sched_data = NULL;
> > +   goto error;
> > +   }
> > +   }
> > +   }
> > +
> > +   return 0;
> > +error:
> > +   blk_mq_sched_free_hctx_data(q, exit);
> > +   return ret;
> > +}
> 
> If one of the init() calls by blk_mq_sched_init_hctx_data() fails then
> blk_mq_sched_free_hctx_data() will call exit() even for hctx's for which
> init() has not been called. How about changing "if (exit)" into "if (exit &&
> hctx->sched_data)" such that exit() is only called for hctx's for which
> init() has been called?

Good point, I'll make that change to the exit function.

> > +struct request *blk_mq_sched_get_request(struct request_queue *q,
> > +struct bio *bio,
> > +unsigned int op,
> > +struct blk_mq_alloc_data *data)
> > +{
> > +   struct elevator_queue *e = q->elevator;
> > +   struct blk_mq_hw_ctx *hctx;
> > +   struct blk_mq_ctx *ctx;
> > +   struct request *rq;
> > +
> > +   blk_queue_enter_live(q);
> > +   ctx = blk_mq_get_ctx(q);
> > +   hctx = blk_mq_map_queue(q, ctx->cpu);
> > +
> > +   blk_mq_set_alloc_data(data, q, 0, ctx, hctx);
> > +
> > +   if (e) {
> > +   data->flags |= BLK_MQ_REQ_INTERNAL;
> > +   if (e->type->ops.mq.get_request)
> > +   rq = e->type->ops.mq.get_request(q, op, data);
> > +   else
> > +   rq = __blk_mq_alloc_request(data, op);
> > +   } else {
> > +   rq = __blk_mq_alloc_request(data, op);
> > +   if (rq) {
> > +   rq->tag = rq->internal_tag;
> > +   rq->internal_tag = -1;
> > +   }
> > +   }
> > +
> > +   if (rq) {
> > +   rq->elv.icq = NULL;
> > +   if (e && e->type->icq_cache)
> > +   blk_mq_sched_assign_ioc(q, rq, bio);
> > +   data->hctx->queued++;
> > +   return rq;
> > +   }
> > +
> > +   blk_queue_exit(q);
> > +   return NULL;
> > +}
> 
> The "rq->tag = rq->internal_tag; rq->internal_tag = -1;" occurs not only
> here but also in blk_mq_alloc_request_hctx(). Has it been considered to move
> that code into __blk_mq_alloc_request()?

Yes, it's in two locations. I wanted to keep it out of
__blk_mq_alloc_request(), so we can still use that for normal tag
allocations. But maybe it's better for __blk_mq_alloc_request

Re: [PATCH 07/10] blk-mq: abstract out helpers for allocating/freeing tag maps

2017-01-12 Thread Jens Axboe

On Thu, Jan 12 2017, Bart Van Assche wrote:
> On Wed, 2017-01-11 at 14:40 -0700, Jens Axboe wrote:
> > @@ -2392,12 +2425,12 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
> > if (set->nr_hw_queues > nr_cpu_ids)
> > set->nr_hw_queues = nr_cpu_ids;
> >  
> > +   ret = -ENOMEM;
> > set->tags = kzalloc_node(nr_cpu_ids * sizeof(struct blk_mq_tags *),
> >  GFP_KERNEL, set->numa_node);
> > if (!set->tags)
> > return -ENOMEM;
> >  
> > -   ret = -ENOMEM;
> > set->mq_map = kzalloc_node(sizeof(*set->mq_map) * nr_cpu_ids,
> > GFP_KERNEL, set->numa_node);
> > if (!set->mq_map)
> 
> Not that it matters to me, but this change probably isn't needed?

Huh oops no, I'll move that back where it belongs.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 10/10] blk-mq-sched: allow setting of default IO scheduler

2017-01-12 Thread Bart Van Assche

On Wed, 2017-01-11 at 14:40 -0700, Jens Axboe wrote:
> Add Kconfig entries to manage what devices get assigned an MQ
> scheduler, and add a blk-mq flag for drivers to opt out of scheduling.
> The latter is useful for admin type queues that still allocate a blk-mq
> queue and tag set, but aren't use for normal IO.

Reviewed-by: Bart Van Assche --
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 09/10] mq-deadline: add blk-mq adaptation of the deadline IO scheduler

2017-01-12 Thread Bart Van Assche

On Wed, 2017-01-11 at 14:40 -0700, Jens Axboe wrote:
> This is basically identical to deadline-iosched, except it registers
> as a MQ capable scheduler. This is still a single queue design.

Reviewed-by: Bart Van Assche --
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 08/10] blk-mq-sched: add framework for MQ capable IO schedulers

2017-01-12 Thread Bart Van Assche

On Wed, 2017-01-11 at 14:40 -0700, Jens Axboe wrote:
> @@ -451,11 +456,11 @@ void blk_insert_flush(struct request *rq)
>* processed directly without going through flush machinery.  Queue
>* for normal execution.
>*/
> - if ((policy & REQ_FSEQ_DATA) &&
> - !(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH))) {
> - if (q->mq_ops) {
> - blk_mq_insert_request(rq, false, true, false);
> - } else
> + if (((policy & REQ_FSEQ_DATA) &&
> +  !(policy & (REQ_FSEQ_PREFLUSH | REQ_FSEQ_POSTFLUSH {
> + if (q->mq_ops)
> + blk_mq_sched_insert_request(rq, false, true, false);
> + else
>   list_add_tail(&rq->queuelist, &q->queue_head);
>   return;
>   }

Not that it really matters, but this change adds a pair of parentheses --
"if (e)" is changed into "if ((e))". Is this necessary?

> +void blk_mq_sched_free_hctx_data(struct request_queue *q,
> +  void (*exit)(struct blk_mq_hw_ctx *))
> +{
> + struct blk_mq_hw_ctx *hctx;
> + int i;
> +
> + queue_for_each_hw_ctx(q, hctx, i) {
> + if (exit)
> + exit(hctx);
> + kfree(hctx->sched_data);
> + hctx->sched_data = NULL;
> + }
> +}
> +EXPORT_SYMBOL_GPL(blk_mq_sched_free_hctx_data);
> +
> +int blk_mq_sched_init_hctx_data(struct request_queue *q, size_t size,
> + int (*init)(struct blk_mq_hw_ctx *),
> + void (*exit)(struct blk_mq_hw_ctx *))
> +{
> + struct blk_mq_hw_ctx *hctx;
> + int ret;
> + int i;
> +
> + queue_for_each_hw_ctx(q, hctx, i) {
> + hctx->sched_data = kmalloc_node(size, GFP_KERNEL, 
> hctx->numa_node);
> + if (!hctx->sched_data) {
> + ret = -ENOMEM;
> + goto error;
> + }
> +
> + if (init) {
> + ret = init(hctx);
> + if (ret) {
> + /*
> +  * We don't want to give exit() a partially
> +  * initialized sched_data. init() must clean up
> +  * if it fails.
> +  */
> + kfree(hctx->sched_data);
> + hctx->sched_data = NULL;
> + goto error;
> + }
> + }
> + }
> +
> + return 0;
> +error:
> + blk_mq_sched_free_hctx_data(q, exit);
> + return ret;
> +}

If one of the init() calls by blk_mq_sched_init_hctx_data() fails then
blk_mq_sched_free_hctx_data() will call exit() even for hctx's for which
init() has not been called. How about changing "if (exit)" into "if (exit &&
hctx->sched_data)" such that exit() is only called for hctx's for which
init() has been called?

> +struct request *blk_mq_sched_get_request(struct request_queue *q,
> +  struct bio *bio,
> +  unsigned int op,
> +  struct blk_mq_alloc_data *data)
> +{
> + struct elevator_queue *e = q->elevator;
> + struct blk_mq_hw_ctx *hctx;
> + struct blk_mq_ctx *ctx;
> + struct request *rq;
> +
> + blk_queue_enter_live(q);
> + ctx = blk_mq_get_ctx(q);
> + hctx = blk_mq_map_queue(q, ctx->cpu);
> +
> + blk_mq_set_alloc_data(data, q, 0, ctx, hctx);
> +
> + if (e) {
> + data->flags |= BLK_MQ_REQ_INTERNAL;
> + if (e->type->ops.mq.get_request)
> + rq = e->type->ops.mq.get_request(q, op, data);
> + else
> + rq = __blk_mq_alloc_request(data, op);
> + } else {
> + rq = __blk_mq_alloc_request(data, op);
> + if (rq) {
> + rq->tag = rq->internal_tag;
> + rq->internal_tag = -1;
> + }
> + }
> +
> + if (rq) {
> + rq->elv.icq = NULL;
> + if (e && e->type->icq_cache)
> + blk_mq_sched_assign_ioc(q, rq, bio);
> + data->hctx->queued++;
> + return rq;
> + }
> +
> + blk_queue_exit(q);
> + return NULL;
> +}

The "rq->tag = rq->internal_tag; rq->internal_tag = -1;" occurs not only
here but also in blk_mq_alloc_request_hctx(). Has it been considered to move
that code into __blk_mq_alloc_request()?

@@ -223,14 +225,17 @@ struct request *__blk_mq_alloc_request(struct 
blk_mq_alloc_data *data,
>  
>   tag = blk_mq_get_tag(data);
>   if (tag != BLK_MQ_TAG_FAIL) {
> - rq = data->hctx->tags->rqs[tag];
> + struct blk_mq_tags *tags = blk_mq_tags_from_data(data);
> +
> + rq = tags->rqs[tag];
>  
>   if (blk_mq_tag_busy(data->hctx)) {
>   rq->rq_flags = RQF_MQ_INFLIGHT;
>

Re: [PATCH 07/10] blk-mq: abstract out helpers for allocating/freeing tag maps

2017-01-12 Thread Bart Van Assche

On Wed, 2017-01-11 at 14:40 -0700, Jens Axboe wrote:
> @@ -2392,12 +2425,12 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
>   if (set->nr_hw_queues > nr_cpu_ids)
>   set->nr_hw_queues = nr_cpu_ids;
>  
> + ret = -ENOMEM;
>   set->tags = kzalloc_node(nr_cpu_ids * sizeof(struct blk_mq_tags *),
>GFP_KERNEL, set->numa_node);
>   if (!set->tags)
>   return -ENOMEM;
>  
> - ret = -ENOMEM;
>   set->mq_map = kzalloc_node(sizeof(*set->mq_map) * nr_cpu_ids,
>   GFP_KERNEL, set->numa_node);
>   if (!set->mq_map)

Not that it matters to me, but this change probably isn't needed?

Anyway:

Reviewed-by: Bart Van Assche 
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 06/10] blk-mq-tag: cleanup the normal/reserved tag allocation

2017-01-12 Thread Bart Van Assche

On Wed, 2017-01-11 at 14:39 -0700, Jens Axboe wrote:
> This is in preparation for having another tag set available. Cleanup
> the parameters, and allow passing in of tags fo blk_mq_put_tag().

It seems like an 'r' is missing from the description ("tags fo")?

Anyway:

Reviewed-by: Bart Van Assche --
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 05/10] blk-mq: export some helpers we need to the scheduling framework

2017-01-12 Thread Bart Van Assche

On Wed, 2017-01-11 at 14:39 -0700, Jens Axboe wrote:
> [ ... ]

Reviewed-by: Bart Van Assche --
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 04/10] blk-mq: un-export blk_mq_free_hctx_request()

2017-01-12 Thread Bart Van Assche

On Wed, 2017-01-11 at 14:39 -0700, Jens Axboe wrote:
> It's only used in blk-mq, kill it from the main exported header
> and kill the symbol export as well.

Reviewed-by: Bart Van Assche --
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 03/10] block: move rq_ioc() to blk.h

2017-01-12 Thread Bart Van Assche

On Wed, 2017-01-11 at 14:39 -0700, Jens Axboe wrote:
> We want to use it outside of blk-core.c.

Reviewed-by: Bart Van Assche --
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 01/10] block: move existing elevator ops to union

2017-01-12 Thread Bart Van Assche

On Wed, 2017-01-11 at 14:39 -0700, Jens Axboe wrote:
> Prep patch for adding MQ ops as well, since doing anon unions with
> named initializers doesn't work on older compilers.

Reviewed-by: Bart Van Assche --
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHSET v6] blk-mq scheduling framework

2017-01-12 Thread Bart Van Assche

On Wed, 2017-01-11 at 14:39 -0700, Jens Axboe wrote:
> I've reworked bits of this to get rid of the shadow requests, thanks
> to Bart for the inspiration. The missing piece, for me, was the fact
> that we have the tags->rqs[] indirection array already. I've done this
> somewhat differently, though, by having the internal scheduler tag
> map be allocated/torn down when an IO scheduler is attached or
> detached. This also means that when we run without a scheduler, we
> don't have to do double tag allocations, it'll work like before.

Hello Jens,

Thanks for having done the rework! This series looks great to me. I have a
few small comments though. I will post these as replies to the individual
patches.

Bart.--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] preview - block layer help to detect sequential IO

2017-01-12 Thread Jeff Moyer

Hi, Kashyap,

I'm CC-ing Kent, seeing how this is his code.

Kashyap Desai  writes:

> Objective of this patch is - 
>
> To move code used in bcache module in block layer which is used to
> find IO stream.  Reference code @drivers/md/bcache/request.c
> check_should_bypass().  This is a high level patch for review and
> understand if it is worth to follow ?
>
> As of now bcache module use this logic, but good to have it in block
> layer and expose function for external use.
>
> In this patch, I move logic of sequential IO search in block layer and
> exposed function blk_queue_rq_seq_cutoff.  Low level driver just need
> to call if they want stream detection per request queue.  For my
> testing I just added call blk_queue_rq_seq_cutoff(sdev->request_queue,
> 4) megaraid_sas driver.
>  
> In general, code of bcache module was referred and they are doing
> almost same as what we want to do in megaraid_sas driver below patch -
>
> http://marc.info/?l=linux-scsi&m=148245616108288&w=2
>  
> bcache implementation use search algorithm (hashed based on bio start
> sector) and detects 128 streams.  wanted those implementation
> to skip sequential IO to be placed on SSD and move it direct to the
> HDD.
>
> Will it be good design to keep this algorithm open at block layer (as
> proposed in patch.) ?

It's almost always a good idea to avoid code duplication, but this patch
definitely needs some work.

I haven't looked terribly closely at the bcache implementaiton, so do
let me know if I've misinterpreted something.

We should track streams per io_context/queue pair.  We already have a
data structure for that, the io_cq.  Right now that structure is
tailored for use by the I/O schedulers, but I'm sure we could rework
that.  That would also get rid of the tremedous amount of bloat this
patch adds to the request_queue.  It will also allow us to remove the
bcache-specific fields that were added to task_struct.  Overall, it
should be a good simplification, unless I've completely missed the point
(which happens).

I don't like that you put sequential I/O detection into bio_check_eod.
Split it out into its own function.

You've added a member to struct bio that isn't referenced.  It would
have been nice of you to put enough work into this RFC so that we could
at least see how the common code was used by bcache and your driver.

EWMA (exponentially weighted moving average) is not an acronym I keep
handy in my head.  It would be nice to add documentation on the
algorithm and design choices.  More comments in the code would also be
appreciated.  CFQ does some similar things (detecting sequential
vs. seeky I/O) in a much lighter-weight fashion.  Any change to the
algorithm, of course, would have to be verified to still meet bcache's
needs.

A queue flag might be a better way for the driver to request this
functionality.

Coding style will definitely need fixing.

I hope that was helpful.

Cheers,
Jeff

>
> Signed-off-by: Kashyap desai 
> ---
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 14d7c07..2e93d14 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -693,6 +693,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t 
> gfp_mask, int node_id)
>  {
>   struct request_queue *q;
>   int err;
> + struct seq_io_tracker *io;
>  
>   q = kmem_cache_alloc_node(blk_requestq_cachep,
>   gfp_mask | __GFP_ZERO, node_id);
> @@ -761,6 +762,15 @@ struct request_queue *blk_alloc_queue_node(gfp_t 
> gfp_mask, int node_id)
>  
>   if (blkcg_init_queue(q))
>   goto fail_ref;
> + 
> + q->sequential_cutoff = 0;
> + spin_lock_init(&q->io_lock);
> + INIT_LIST_HEAD(&q->io_lru);
> +
> + for (io = q->io; io < q->io + BLK_RECENT_IO; io++) {
> + list_add(&io->lru, &q->io_lru);
> + hlist_add_head(&io->hash, q->io_hash + BLK_RECENT_IO);
> + }
>  
>   return q;
>  
> @@ -1876,6 +1886,26 @@ static inline int bio_check_eod(struct bio *bio, 
> unsigned int nr_sectors)
>   return 0;
>  }
>  
> +static void add_sequential(struct task_struct *t)
> +{
> +#define blk_ewma_add(ewma, val, weight, factor) \
> +({  \
> +(ewma) *= (weight) - 1; \
> +(ewma) += (val) << factor;  \
> +(ewma) /= (weight); \
> +(ewma) >> factor;   \
> +})
> +
> + blk_ewma_add(t->sequential_io_avg,
> +  t->sequential_io, 8, 0);
> +
> + t->sequential_io = 0;
> +}
> +static struct hlist_head *blk_iohash(struct request_queue *q, uint64_t k)
> +{
> + return &q->io_hash[hash_64(k, BLK_RECENT_IO_BITS)];
> +}
> +
>  static noinline_for_stack bool
>  generic_make_request_checks(struct bio *bio)
>  {
> @@ -1884,6 +1914,7 @@ static inline int bio_check_eod(s

Re: [PATCH 0/2] Rename blk_queue_zone_size and bdev_zone_size

2017-01-12 Thread Jens Axboe

On 01/11/2017 09:38 PM, Jens Axboe wrote:
> On 01/11/2017 09:36 PM, Damien Le Moal wrote:
>> Jens,
>>
>> On 1/12/17 12:52, Jens Axboe wrote:
>>> On Thu, Jan 12 2017, Damien Le Moal wrote:
 All block device data fields and functions returning a number of 512B
 sectors are by convention named xxx_sectors while names in the form
 of xxx_size are generally used for a number of bytes. The 
 blk_queue_zone_size
 and bdev_zone_size functions were not following this convention so rename
 them.

 This is a style fix and no functional change is introduced by this patch.
>>>
>>> I agree, this cleans it up. Applied.
>>
>> Thank you. I saw that you applied to for-4.11/block. Could we get these
>> in applied to 4.10-rc so that the zoned block device API is cleaner from
>> the first stable release of that API?
> 
> Sure, I did consider that as well. Since I just pushed out the 4.11
> branch, I'll rebase and yank these into the 4.10 branch instead.

Just in case you missed it, I had to fold your two patches. Looking at
it again, what is going on? You rename a function, and then patch #2
renames the use of that function in a different spot? How did that ever
pass your testing? For something intended for the current series, please
be more careful than that, that's just sloppy.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers

2017-01-12 Thread Bart Van Assche

On Thu, 2017-01-12 at 10:41 +0200, Sagi Grimberg wrote:
> First, when the nvme device fires an interrupt, the driver consumes
> the completion(s) from the interrupt (usually there will be some more
> completions waiting in the cq by the time the host start processing it).
> With irq-poll, we disable further interrupts and schedule soft-irq for
> processing, which if at all, improve the completions per interrupt
> utilization (because it takes slightly longer before processing the cq).
> 
> Moreover, irq-poll is budgeting the completion queue processing which is
> important for a couple of reasons.
> 
> 1. it prevents hard-irq context abuse like we do today. if other cpu
> cores are pounding with more submissions on the same queue, we might
> get into a hard-lockup (which I've seen happening).
> 
> 2. irq-poll maintains fairness between devices by correctly budgeting
> the processing of different completions queues that share the same
> affinity. This can become crucial when working with multiple nvme
> devices, each has multiple io queues that share the same IRQ
> assignment.
> 
> 3. It reduces (or at least should reduce) the overall number of
> interrupts in the system because we only enable interrupts again
> when the completion queue is completely processed.
> 
> So overall, I think it's very useful for nvme and other modern HBAs,
> but unfortunately, other than solving (1), I wasn't able to see
> performance improvement but rather a slight regression, but I can't
> explain where its coming from...

Hello Sagi,

Thank you for the additional clarification. Although I am not sure whether
irq-poll is the ideal solution for the problems that has been described
above, I agree that it would help to discuss this topic further during
LSF/MM.

Bart.--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [Regression] fstrim hangs on Hyper-V: caused by "block: improve handling of the magic discard payload"

2017-01-12 Thread Chris Valean (Cloudbase Solutions SRL)

Hi Christoph,

Adding Nick and Alex to the thread.
We'll give it a try along with Dexuan and update you with the results.

Thank you!
Chris Valean

-Original Message-
From: Christoph Hellwig [mailto:h...@lst.de] 
Sent: Thursday, January 12, 2017 8:19 PM
To: Dexuan Cui 
Cc: linux-block@vger.kernel.org; KY Srinivasan ; Chris 
Valean (Cloudbase Solutions SRL) 
Subject: Re: [Regression] fstrim hangs on Hyper-V: caused by "block: improve 
handling of the magic discard payload"

Next try:  (I've also dropped most of the Cc list)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 
c35b6de..2f358f7 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1018,7 +1018,10 @@ static int scsi_init_sgtable(struct request *req, struct 
scsi_data_buffer *sdb)
count = blk_rq_map_sg(req->q, req, sdb->table.sgl);
BUG_ON(count > sdb->table.nents);
sdb->table.nents = count;
-   sdb->length = blk_rq_bytes(req);
+   if (req->rq_flags & RQF_SPECIAL_PAYLOAD)
+   sdb->length = req->special_vec.bv_len;
+   else
+   sdb->length = blk_rq_bytes(req);
return BLKPREP_OK;
 }

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Lsf-pc] [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers

2017-01-12 Thread Johannes Thumshirn

On Thu, Jan 12, 2017 at 04:41:00PM +0200, Sagi Grimberg wrote:
> 
> >>**Note: when I ran multiple threads on more cpus the performance
> >>degradation phenomenon disappeared, but I tested on a VM with
> >>qemu emulation backed by null_blk so I figured I had some other
> >>bottleneck somewhere (that's why I asked for some more testing).
> >
> >That could be because of the vmexits as every MMIO access in the guest
> >triggers a vmexit and if you poll with a low budget you do more MMIOs hence
> >you have more vmexits.
> >
> >Did you do testing only in qemu or with real H/W as well?
> 
> I tried once. IIRC, I saw the same phenomenons...

JFTR I tried my AHCI irq_poll patch on the Qemu emulation and the read
throughput dropped from ~1GB/s to ~350MB/s. But this can be related to
Qemu's I/O wiredness as well I think. I'll try on real hardware tomorrow.

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Regression] fstrim hangs on Hyper-V: caused by "block: improve handling of the magic discard payload"

2017-01-12 Thread Christoph Hellwig

Next try:  (I've also dropped most of the Cc list)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index c35b6de..2f358f7 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1018,7 +1018,10 @@ static int scsi_init_sgtable(struct request *req, struct 
scsi_data_buffer *sdb)
count = blk_rq_map_sg(req->q, req, sdb->table.sgl);
BUG_ON(count > sdb->table.nents);
sdb->table.nents = count;
-   sdb->length = blk_rq_bytes(req);
+   if (req->rq_flags & RQF_SPECIAL_PAYLOAD)
+   sdb->length = req->special_vec.bv_len;
+   else
+   sdb->length = blk_rq_bytes(req);
return BLKPREP_OK;
 }

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [dm-devel] [LSF/MM TOPIC][LSF/MM ATTEND] multipath redesign

2017-01-12 Thread Benjamin Marzinski

On Thu, Jan 12, 2017 at 09:27:40AM +0100, Hannes Reinecke wrote:
> On 01/11/2017 11:23 PM, Mike Snitzer wrote:
> >On Wed, Jan 11 2017 at  4:44am -0500,
> >Hannes Reinecke  wrote:
> >
> >>Hi all,
> >>
> >>I'd like to attend LSF/MM this year, and would like to discuss a
> >>redesign of the multipath handling.
> >>
> >>With recent kernels we've got quite some functionality required for
> >>multipathing already implemented, making some design decisions of the
> >>original multipath-tools implementation quite pointless.
> >>
> >>I'm working on a proof-of-concept implementation which just uses a
> >>simple configfs interface and doesn't require a daemon altogether.
> >>
> >>At LSF/MM I'd like to discuss how to move forward here, and whether we'd
> >>like to stay with the current device-mapper integration or move away
> >>from that towards a stand-alone implementation.
> >
> >I'd really like open exchange of the problems you're having with the
> >current multipath-tools and DM multipath _before LSF_.  Last LSF only
> >scratched the surface on people having disdain for the complexity that is
> >the multipath-tools userspace.  But considering how much of the
> >multipath-tools you've written I find it fairly comical that you're the
> >person advocating switching away from it.
> >
> Yeah, I know.
> 
> But I've stared long and hard at the code, and found some issues really hard
> to overcome. Even more so as most things it does are really pointless.
> 
> multipathd _insists_ on redoing the _entire_ device layout for basically any
> operation (except for path checking).
> As the data structures allow only for a single setup it uses a lock per
> multipath device to protect against concurrent changes.
> When lots of uevents are to be processed this lock is heavily contended,
> leading to a slow-down of uevent processing.
> (cf the patchseries from Tang Junhui and my earlier pathset for
> lock pushdown)
> 
> I've tried to move that lock down even further with distinct locks for
> device paths and multipath devices, but ultimately failed as it would amount
> to essentially a rewrite of the core engine.

The multipath user-space tools locking IS horrible and touches
everything.  I could never see a way around it that didn't involve
a ground-up redesign.
 
> >But if less userspace involvement is needed then fix userspace.  Fail to
> >see how configfs is any different than the established DM ioctl interface.
> >
> >As I just said in another email DM multipath could benefit from
> >factoring out the SCSI-specific bits so that they are nicely optimized
> >away if using new transports (e.g. NVMEoF).
> >
> >Could be lessons can be learned from your approach but I'd prefer we
> >provably exhaust the utility of the current DM multipath kernel
> >implementation.  DM multipath is one of the most actively maintained and
> >updated DM targets (aside from thinp and cache).  As you know DM
> >multipath has grown blk-mq support which yielded serious performance
> >improvement.  You also noted (in an earlier email) that I reintroduced
> >bio-based DM multipath.  On a data path level we have all possible block
> >core interfaces plumbed.  And yes, they all involve cloning due to the
> >underlying Device Mapper core.  Open to any ideas on optimization.  If
> >DM is imposing some inherent performance limitation then please report
> >it accordingly.
> >
> Ah. And I thought you disliked request-based multipathing ...
> 
> It's not _actually_ the DM interface which I'm objecting to, it's more the
> user-space implementation.
> The daemon is build around some design decisions which are simply not
> applicable anymore:
> - we now _do_ have reliable device identifications, so the the 'path_id'
> functionality is pointless.

This could be largely fixed in the existing code. The route that the
latest patch from Tang Junhui are going still grabs the wwid if we got
it from the uevent, but it isn't necesary, as long was we're careful.
Currently rbd devices don't get their wwid from the uevent but all other
devices do. It would probably be possible to write an rbd device udev
rule to set a variable so that they can work through udev environment
variables too.

> - The 'alua' device handler also provides you with reliable priority
> information, so it should be possible to do away with the 'prio' setting,
> too.

But this isn't true for all devices. Also, Like I mentioned last year
when this got brought up, no matter how we group the paths, there end up
being users that have good reasons why they want them grouped
differently in their case.  The path priority/grouping seems like one
place where evidence has shown that we should give users the tools to
make policy decisions, instead of making them ourselves.

> - And for (most) SCSI devices the 'state' setting provides a reliable
> indicator if the device is useable.

This is also not true for all devices.

So, are you planning on creating a multipath implementation that only
handles some devices? Obvious

RE: [Regression] fstrim hangs on Hyper-V: caused by "block: improve handling of the magic discard payload"

2017-01-12 Thread Dexuan Cui

> From: Christoph Hellwig [mailto:h...@lst.de]
> Sent: Thursday, January 12, 2017 23:53
> To: Dexuan Cui 
> Cc: Christoph Hellwig ; linux-block@vger.kernel.org; Jens Axboe
> ; Vitaly Kuznetsov ; linux-
> ker...@vger.kernel.org; KY Srinivasan ; Chris Valean
> (Cloudbase Solutions SRL) 
> Subject: Re: [Regression] fstrim hangs on Hyper-V: caused by "block: improve
> handling of the magic discard payload"
> 
> Can you check if this debug printk triggers for the discard commands?
> 
> ---
> diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
> index 888e16e..7ab7d08 100644
> --- a/drivers/scsi/storvsc_drv.c
> +++ b/drivers/scsi/storvsc_drv.c
> @@ -1031,6 +1031,10 @@ static void storvsc_command_completion(struct
> storvsc_cmd_request *cmd_request,
>   data_transfer_length = 0;
>   }
> 
> + if (cmd_request->payload->range.len != data_transfer_length)
> + printk_ratelimited("request len: %u, transfer len: %u\n",
> + cmd_request->payload->range.len,
> + data_transfer_length);
>   scsi_set_resid(scmnd,
>   cmd_request->payload->range.len - data_transfer_length);
> 

// I fixed the small building issue (data_transfer_length  ==>  
vm_srb->data_transfer_length).

No, the printk doesn't trigger for fstrim.

It does trigger at the early boot phase, though.

# dmesg |grep len:
[0.00] log_buf_len: 134217728 bytes
[7.073423] request len: 255, transfer len: 12
[7.084937] request len: 255, transfer len: 52
[7.121728] request len: 64, transfer len: 12
[7.121915] request len: 64, transfer len: 12
[7.123180] request len: 64, transfer len: 12
[7.123367] request len: 64, transfer len: 12
[7.127193] request len: 64, transfer len: 12
[7.127350] request len: 64, transfer len: 12
[7.178930] request len: 255, transfer len: 12
[7.179045] request len: 255, transfer len: 52

-- Dexuan
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Regression] fstrim hangs on Hyper-V: caused by "block: improve handling of the magic discard payload"

2017-01-12 Thread Christoph Hellwig

Can you check if this debug printk triggers for the discard commands?

---
diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
index 888e16e..7ab7d08 100644
--- a/drivers/scsi/storvsc_drv.c
+++ b/drivers/scsi/storvsc_drv.c
@@ -1031,6 +1031,10 @@ static void storvsc_command_completion(struct 
storvsc_cmd_request *cmd_request,
data_transfer_length = 0;
}
 
+   if (cmd_request->payload->range.len != data_transfer_length)
+   printk_ratelimited("request len: %u, transfer len: %u\n",
+   cmd_request->payload->range.len,
+   data_transfer_length);
scsi_set_resid(scmnd,
cmd_request->payload->range.len - data_transfer_length);
 
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [Regression] fstrim hangs on Hyper-V: caused by "block: improve handling of the magic discard payload"

2017-01-12 Thread Dexuan Cui

> From: Christoph Hellwig [mailto:h...@lst.de]
> Sent: Thursday, January 12, 2017 21:44
> To: Dexuan Cui 
> Cc: Christoph Hellwig ; linux-block@vger.kernel.org; Jens Axboe
> ; Vitaly Kuznetsov ; linux-
> ker...@vger.kernel.org; KY Srinivasan ; Chris Valean
> (Cloudbase Solutions SRL) 
> Subject: Re: [Regression] fstrim hangs on Hyper-V: caused by "block: improve
> handling of the magic discard payload"
> 
> Hi Dexuan,
> 
> sorry for dropping the ball on the previous private report, I hoped
> I could get my hands on a Hyper-V VM and reproduce it myself, but
> that has obviously not happened.
> 
> Can you send me the output of the provisioning_mode file for the
> scsi disk in question to get started?

Hi Christoph,
Thank you very much for the help! 

The file just shows "unmap":

root@decui-u1604:~# cd /sys/class/scsi_disk/2\:0\:0\:0
root@decui-u1604:/sys/class/scsi_disk/2:0:0:0# ls
allow_restart  cache_type  device  manage_start_stop   
max_write_same_blocks  protection_mode  provisioning_mode  thin_provisioning
app_tag_owndeferred_probe  FUA max_medium_access_timeouts  power
  protection_type  subsystem  uevent
root@decui-u1604:/sys/class/scsi_disk/2:0:0:0# cat provisioning_mode
unmap

I'm ready to provide any info you need. :-)

Thanks,
-- Dexuan
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Lsf-pc] [LSF/MM TOPIC] [LSF/MM ATTEND] md raid general discussion

2017-01-12 Thread Sagi Grimberg


Hey Coly,


Also I receive reports from users that raid1 performance is desired when
it is built on NVMe SSDs as a cache (maybe bcache or dm-cache). I am
working on some raid1 performance improvement (e.g. new raid1 I/O
barrier and lockless raid1 I/O submit), and have some more ideas to discuss.


Do you have some performance measurements to share?

Mike used null devices to simulate very fast devices which
led to nice performance enhancements in dm-multipath code.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RFC: 512e ZBC host-managed disks

2017-01-12 Thread Jeff Moyer

Christoph Hellwig  writes:

> On Thu, Jan 12, 2017 at 05:13:52PM +0900, Damien Le Moal wrote:
>> (3) Any other idea ?
>
> Do nothing and ignore the problem.  This whole idea so braindead that
> the person coming up with the T10 language should be shot.  Either a device
> has 511 logical sectors or 4k but not this crazy mix.
>
> And make sure no one ships such a piece of crap because we are hell
> not going to support it.

Agreed.  This is insane.

-Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Lsf-pc] [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers

2017-01-12 Thread Sagi Grimberg




**Note: when I ran multiple threads on more cpus the performance
degradation phenomenon disappeared, but I tested on a VM with
qemu emulation backed by null_blk so I figured I had some other
bottleneck somewhere (that's why I asked for some more testing).


That could be because of the vmexits as every MMIO access in the guest
triggers a vmexit and if you poll with a low budget you do more MMIOs hence
you have more vmexits.

Did you do testing only in qemu or with real H/W as well?


I tried once. IIRC, I saw the same phenomenons...
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Regression] fstrim hangs on Hyper-V: caused by "block: improve handling of the magic discard payload"

2017-01-12 Thread Christoph Hellwig

Hi Dexuan,

sorry for dropping the ball on the previous private report, I hoped
I could get my hands on a Hyper-V VM and reproduce it myself, but
that has obviously not happened.

Can you send me the output of the provisioning_mode file for the
scsi disk in question to get started?
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers

2017-01-12 Thread Johannes Thumshirn

On Thu, Jan 12, 2017 at 01:44:05PM +0200, Sagi Grimberg wrote:
[...]
> Its pretty basic:
> --
> [global]
> group_reporting
> cpus_allowed=0
> cpus_allowed_policy=split
> rw=randrw
> bs=4k
> numjobs=4
> iodepth=32
> runtime=60
> time_based
> loops=1
> ioengine=libaio
> direct=1
> invalidate=1
> randrepeat=1
> norandommap
> exitall
> 
> [job]
> --
> 
> **Note: when I ran multiple threads on more cpus the performance
> degradation phenomenon disappeared, but I tested on a VM with
> qemu emulation backed by null_blk so I figured I had some other
> bottleneck somewhere (that's why I asked for some more testing).

That could be because of the vmexits as every MMIO access in the guest
triggers a vmexit and if you poll with a low budget you do more MMIOs hence
you have more vmexits.

Did you do testing only in qemu or with real H/W as well?

> 
> Note that I ran randrw because I was backed with null_blk, testing
> with a real nvme device, you should either run randread or write, and
> if you do a write, you can't run it multi-threaded (well you can, but
> you'll get unpredictable performance...).

Noted, thanks.

Byte,
Johannes
-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers

2017-01-12 Thread Sagi Grimberg




I agree with Jens that we'll need some analysis if we want the
discussion to be affective, and I can spend some time this if I
can find volunteers with high-end nvme devices (I only have access
to client nvme devices.


I have a P3700 but somehow burned the FW. Let me see if I can bring it back to
live.

I also have converted AHCI to the irq_poll interface and will run some tests.
I do also have some hpsa devices on which I could run tests once the driver is
adopted.

But can we come to a common testing methology not to compare apples with
oranges? Sagi do you still have the fio job file from your last tests laying
somewhere and if yes could you share it?


Its pretty basic:
--
[global]
group_reporting
cpus_allowed=0
cpus_allowed_policy=split
rw=randrw
bs=4k
numjobs=4
iodepth=32
runtime=60
time_based
loops=1
ioengine=libaio
direct=1
invalidate=1
randrepeat=1
norandommap
exitall

[job]
--

**Note: when I ran multiple threads on more cpus the performance
degradation phenomenon disappeared, but I tested on a VM with
qemu emulation backed by null_blk so I figured I had some other
bottleneck somewhere (that's why I asked for some more testing).

Note that I ran randrw because I was backed with null_blk, testing
with a real nvme device, you should either run randread or write, and
if you do a write, you can't run it multi-threaded (well you can, but
you'll get unpredictable performance...).
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Lsf-pc] [LFS/MM TOPIC][LFS/MM ATTEND]: - Storage Stack and Driver Testing methodology.

2017-01-12 Thread Sagi Grimberg




Hi Folks,

I would like to propose a general discussion on Storage stack and device driver 
testing.


I think its very useful and needed.


Purpose:-
-
The main objective of this discussion is to address the need for
a Unified Test Automation Framework which can be used by different subsystems
in the kernel in order to improve the overall development and stability
of the storage stack.

For Example:-
From my previous experience, I've worked on the NVMe driver testing last year 
and we
have developed simple unit test framework
 (https://github.com/linux-nvme/nvme-cli/tree/master/tests).
In current implementation Upstream NVMe Driver supports following subsystems:-
1. PCI Host.
2. RDMA Target.
3. Fiber Channel Target (in progress).
Today due to lack of centralized automated test framework NVMe Driver testing is
scattered and performed using the combination of various utilities like 
nvme-cli/tests,
nvmet-cli, shell scripts (git://git.infradead.org/nvme-fabrics.git 
nvmf-selftests) etc.

In order to improve overall driver stability with various subsystems, it will 
be beneficial
to have a Unified Test Automation Framework (UTAF) which will centralize overall
testing.

This topic will allow developers from various subsystem engage in the 
discussion about
how to collaborate efficiently instead of having discussions on lengthy email 
threads.


While a unified test framework for all sounds great, I suspect that the
difference might be too large. So I think that for this framework to be
maintainable, it needs to be carefully designed such that we don't have
too much code churn.

For example we should start by classifying tests and then see where
sharing is feasible:

1. basic management - I think not a lot can be shared
2. spec compliance - again, not much sharing here
3. data-verification - probably everything can be shared
4. basic performance - probably a lot can be shared
5. vectored-io - probably everything can be shared
6. error handling - I can think of some sharing that can be used.

This repository can also store some useful tracing scripts (ebpf and
friends) that are useful for performance analysis.

So I think that for this to happen, we can start with the shared
test under block/, then migrate proto specific tests into
scsi/, nvme/, and then add transport specific tests so
we can have something like:

├── block
├── lib
├── nvme
│   ├── fabrics
│   │   ├── loop
│   │   └── rdma
│   └── pci
└── scsi
├── fc
└── iscsi

Thoughts?
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Regression] fstrim hangs on Hyper-V: caused by "block: improve handling of the magic discard payload"

2017-01-12 Thread Dexuan Cui

Hi,
Recently fstrim and mkfs always hang in Linux VM running on Hyper-V 2012 R2 or 
2016.
The VM uses the latest mainline kernel (v4.10-rc3).

git-bisect shows the patch 
"block: improve handling of the magic discard payload (f9d03f96)"
causes the issue. 
If I revert the patch, the issue will go away.

When the issue happens, any new shell command causing disk I/O will hang too, 
and
I even can't reboot the VM due to the pending I/O.

It seems blkdev_issue_discard() never returns, meaning the SCSI Unmap 
command(s) 
can't finish somehow, I think.

Any idea why the patch can cause this?

Thanks!
-- Dexuan

PS, this is the calltrace:

[ 1450.976205] INFO: task fstrim:1300 blocked for more than 120 seconds.
[ 1450.976264]   Not tainted 4.9.0+ #58
[ 1450.976291] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 1450.976342] fstrim  D0  1300   1280 0x
[ 1450.976382] Call Trace:
[ 1450.976412]  ? __schedule+0x232/0x700
[ 1450.976442]  ? try_to_grab_pending+0xb3/0x160
[ 1450.976476]  schedule+0x36/0x80
[ 1450.976501]  schedule_timeout+0x235/0x3f0
[ 1450.976532]  ? blk_run_queue_async+0x3c/0x40
[ 1450.976565]  io_schedule_timeout+0xa4/0x110
[ 1450.976596]  wait_for_completion_io+0xa5/0x110
[ 1450.976628]  ? wake_up_q+0x70/0x70
[ 1450.976654]  submit_bio_wait+0x59/0x70
[ 1450.976683]  blkdev_issue_discard+0x6a/0xb0
[ 1450.976783]  xfs_trim_extents+0x24c/0x410 [xfs]
[ 1450.976862]  xfs_ioc_trim+0x157/0x1c0 [xfs]
[ 1450.976938]  xfs_file_ioctl+0x8ee/0xb20 [xfs]
[ 1450.976972]  ? path_openat+0x3fb/0x13f0
[ 1450.977002]  ? page_add_file_rmap+0x58/0x140
[ 1450.977035]  ? alloc_set_pte+0x4ee/0x640
[ 1450.977065]  ? do_filp_open+0x92/0xe0
[ 1450.977093]  ? _copy_to_user+0x2e/0x40
[ 1450.977121]  ? cp_new_stat+0x141/0x160
[ 1450.977151]  do_vfs_ioctl+0x92/0x5a0
[ 1450.977178]  ? SYSC_newfstat+0x25/0x30
[ 1450.977206]  SyS_ioctl+0x79/0x90
[ 1450.977232]  entry_SYSCALL_64_fastpath+0x1e/0xad
[ 1450.977264] RIP: 0033:0x7f8cac393687
[ 1450.977290] RSP: 002b:7ffdce06fa38 EFLAGS: 0202 ORIG_RAX: 
0010
[ 1450.977340] RAX: ffda RBX: 00609330 RCX: 7f8cac393687
[ 1450.977386] RDX: 7ffdce06fa40 RSI: c0185879 RDI: 0003
[ 1450.977431] RBP: 7ffdce06fd18 R08:  R09: 
[ 1450.977476] R10: 053f R11: 0202 R12: 
[ 1450.977522] R13:  R14:  R15: 
[ 1450.977570] INFO: task ls:1304 blocked for more than 120 seconds.
[ 1450.977609]   Not tainted 4.9.0+ #58
[ 1450.977636] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 1450.977685] ls  D0  1304   1219 0x
[ 1450.977723] Call Trace:
[ 1450.977745]  ? __schedule+0x232/0x700
[ 1450.94]  ? __blk_run_queue+0x33/0x40
[ 1450.977803]  ? queue_unplugged+0x2a/0xb0
[ 1450.977833]  schedule+0x36/0x80
[ 1450.977857]  schedule_timeout+0x235/0x3f0
[ 1450.977886]  ? blk_finish_plug+0x2c/0x40
[ 1450.977963]  ? _xfs_buf_ioapply+0x324/0x440 [xfs]
[ 1450.977998]  wait_for_completion+0xa5/0x110
[ 1450.978028]  ? wake_up_q+0x70/0x70
[ 1450.978107]  ? xfs_trans_read_buf_map+0xf5/0x330 [xfs]
[ 1450.979283]  ? _xfs_buf_read+0x23/0x30 [xfs]
[ 1450.980522]  xfs_buf_submit_wait+0x7f/0x210 [xfs]
[ 1450.981706]  ? xfs_trans_read_buf_map+0xf5/0x330 [xfs]
[ 1450.982863]  _xfs_buf_read+0x23/0x30 [xfs]
[ 1450.984420]  xfs_buf_read_map+0x108/0x180 [xfs]
[ 1450.985559]  xfs_trans_read_buf_map+0xf5/0x330 [xfs]
[ 1450.986672]  xfs_imap_to_bp+0x5f/0xc0 [xfs]
[ 1450.987761]  xfs_iread+0x79/0x320 [xfs]
[ 1450.988894]  xfs_iget+0x32a/0x840 [xfs]
[ 1450.990055]  xfs_lookup+0xc6/0xe0 [xfs]
[ 1450.991132]  xfs_vn_lookup+0x4f/0x90 [xfs]
[ 1450.992221]  lookup_slow+0x96/0x140
[ 1450.993254]  walk_component+0x1ca/0x2f0
[ 1450.994283]  ? path_init+0x1d9/0x330
[ 1450.995309]  ? mntput+0x24/0x40
[ 1450.996955]  path_lookupat+0x5d/0x110
[ 1450.997979]  filename_lookup+0x9e/0x150
[ 1450.999001]  ? kmem_cache_alloc+0xd7/0x1b0
[ 1451.000126]  ? getname_flags+0x56/0x1f0
[ 1451.001150]  ? getname_flags+0x72/0x1f0
[ 1451.002164]  user_path_at_empty+0x36/0x40
[ 1451.003173]  vfs_fstatat+0x53/0xa0
[ 1451.004223]  SYSC_newlstat+0x22/0x40
[ 1451.005232]  SyS_newlstat+0xe/0x10
[ 1451.006233]  entry_SYSCALL_64_fastpath+0x1e/0xad
[ 1451.007750] RIP: 0033:0x7ff2730993d5
[ 1451.008820] RSP: 002b:7ffc7c1650c8 EFLAGS: 0246 ORIG_RAX: 
0006
[ 1451.009880] RAX: ffda RBX: 7ff273366b78 RCX: 7ff2730993d5
[ 1451.010953] RDX: 019dfb20 RSI: 019dfb20 RDI: 7ffc7c1650d0
[ 1451.012078] RBP: 7ff273366b20 R08:  R09: 00c0
[ 1451.013175] R10: 019e4550 R11: 0246 R12: 8041
[ 1451.014260] R13: 7ff273366b78 R14: 270f R15: 7ff273366b78

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.

Re: [PATCH 05/10] blk-mq: export some helpers we need to the scheduling framework

2017-01-12 Thread Johannes Thumshirn

On Wed, Jan 11, 2017 at 02:39:58PM -0700, Jens Axboe wrote:
> Signed-off-by: Jens Axboe 
> ---

Looks good,
Reviewed-by: Johannes Thumshirn 

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 01/10] block: move existing elevator ops to union

2017-01-12 Thread Johannes Thumshirn

On Wed, Jan 11, 2017 at 02:39:54PM -0700, Jens Axboe wrote:
> Prep patch for adding MQ ops as well, since doing anon unions with
> named initializers doesn't work on older compilers.
> 
> Signed-off-by: Jens Axboe 
> ---

Looks good,
Reviewed-by: Johannes Thumshirn 

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 02/10] blk-mq: make mq_ops a const pointer

2017-01-12 Thread Johannes Thumshirn

On Wed, Jan 11, 2017 at 02:39:55PM -0700, Jens Axboe wrote:
> We never change it, make that clear.
> 
> Signed-off-by: Jens Axboe 
> Reviewed-by: Bart Van Assche 
> ---

Looks good,
Reviewed-by: Johannes Thumshirn 

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 03/10] block: move rq_ioc() to blk.h

2017-01-12 Thread Johannes Thumshirn

On Wed, Jan 11, 2017 at 02:39:56PM -0700, Jens Axboe wrote:
> We want to use it outside of blk-core.c.
> 
> Signed-off-by: Jens Axboe 
> ---
Looks good,
Reviewed-by: Johannes Thumshirn 

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 04/10] blk-mq: un-export blk_mq_free_hctx_request()

2017-01-12 Thread Johannes Thumshirn

On Wed, Jan 11, 2017 at 02:39:57PM -0700, Jens Axboe wrote:
> It's only used in blk-mq, kill it from the main exported header
> and kill the symbol export as well.
> 
> Signed-off-by: Jens Axboe 
> ---

Looks good,
Reviewed-by: Johannes Thumshirn 

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers

2017-01-12 Thread Johannes Thumshirn

On Thu, Jan 12, 2017 at 10:23:47AM +0200, Sagi Grimberg wrote:
> 
> >>>Hi all,
> >>>
> >>>I'd like to attend LSF/MM and would like to discuss polling for block 
> >>>drivers.
> >>>
> >>>Currently there is blk-iopoll but it is neither as widely used as NAPI in 
> >>>the
> >>>networking field and accoring to Sagi's findings in [1] performance with
> >>>polling is not on par with IRQ usage.
> >>>
> >>>On LSF/MM I'd like to whether it is desirable to have NAPI like polling in
> >>>more block drivers and how to overcome the currently seen performance 
> >>>issues.
> >>
> >>It would be an interesting topic to discuss, as it is a shame that 
> >>blk-iopoll
> >>isn't used more widely.
> >
> >Forgot to mention - it should only be a topic, if experimentation has
> >been done and results gathered to pin point what the issues are, so we
> >have something concrete to discus. I'm not at all interested in a hand
> >wavy discussion on the topic.
> >
> 
> Hey all,
> 
> Indeed I attempted to convert nvme to use irq-poll (let's use its
> new name) but experienced some unexplained performance degradations.
> 
> Keith reported a 700ns degradation for QD=1 with his Xpoint devices,
> this sort of degradation are acceptable I guess because we do schedule
> a soft-irq before consuming the completion, but I noticed ~10% IOPs
> degradation fr QD=32 which is not acceptable.
> 
> I agree with Jens that we'll need some analysis if we want the
> discussion to be affective, and I can spend some time this if I
> can find volunteers with high-end nvme devices (I only have access
> to client nvme devices.

I have a P3700 but somehow burned the FW. Let me see if I can bring it back to
live. 

I also have converted AHCI to the irq_poll interface and will run some tests.
I do also have some hpsa devices on which I could run tests once the driver is
adopted.

But can we come to a common testing methology not to compare apples with
oranges? Sagi do you still have the fio job file from your last tests laying
somewhere and if yes could you share it?

Byte,
Johannes

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers

2017-01-12 Thread sagi grimberg

 A typical Ethernet network adapter delays the generation of an
 interrupt
 after it has received a packet. A typical block device or HBA does not
 delay
 the generation of an interrupt that reports an I/O completion.
>>>
>>> NVMe allows for configurable interrupt coalescing, as do a few modern
>>> SCSI HBAs.
>>
>> Essentially every modern SCSI HBA does interrupt coalescing; otherwise
>> the queuing interface won't work efficiently.
>
> Hello Hannes,
>
> The first e-mail in this e-mail thread referred to measurements against a
> block device for which interrupt coalescing was not enabled. I think that
> the measurements have to be repeated against a block device for which
> interrupt coalescing is enabled.

Hey Bart,

I see how interrupt coalescing can help, but even without it, I think it
should be better.

Moreover, I don't think that strict moderation is something that can
work. The only way interrupt moderation can be effective, is if it's
adaptive and adjusts itself to the workload. Note that this feature
is on by default in most of the modern Ethernet devices (adaptive-rx).

IMHO, irq-poll vs. interrupt polling should be compared without relying
on the underlying device capabilities.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers

2017-01-12 Thread Sagi Grimberg




I'd like to attend LSF/MM and would like to discuss polling for block
drivers.

Currently there is blk-iopoll but it is neither as widely used as NAPI in
the networking field and accoring to Sagi's findings in [1] performance
with polling is not on par with IRQ usage.

On LSF/MM I'd like to whether it is desirable to have NAPI like polling in
more block drivers and how to overcome the currently seen performance
issues.

[1] http://lists.infradead.org/pipermail/linux-nvme/2016-October/006975.ht
ml


A typical Ethernet network adapter delays the generation of an interrupt
after it has received a packet. A typical block device or HBA does not delay
the generation of an interrupt that reports an I/O completion. I think that
is why polling is more effective for network adapters than for block
devices. I'm not sure whether it is possible to achieve benefits similar to
NAPI for block devices without implementing interrupt coalescing in the
block device firmware. Note: for block device implementations that use the
RDMA API, the RDMA API supports interrupt coalescing (see also
ib_modify_cq()).


Hey Bart,

I don't agree that interrupt coalescing is the reason why irq-poll is
not suitable for nvme or storage devices.

First, when the nvme device fires an interrupt, the driver consumes
the completion(s) from the interrupt (usually there will be some more
completions waiting in the cq by the time the host start processing it).
With irq-poll, we disable further interrupts and schedule soft-irq for
processing, which if at all, improve the completions per interrupt
utilization (because it takes slightly longer before processing the cq).

Moreover, irq-poll is budgeting the completion queue processing which is
important for a couple of reasons.

1. it prevents hard-irq context abuse like we do today. if other cpu
   cores are pounding with more submissions on the same queue, we might
   get into a hard-lockup (which I've seen happening).

2. irq-poll maintains fairness between devices by correctly budgeting
   the processing of different completions queues that share the same
   affinity. This can become crucial when working with multiple nvme
   devices, each has multiple io queues that share the same IRQ
   assignment.

3. It reduces (or at least should reduce) the overall number of
   interrupts in the system because we only enable interrupts again
   when the completion queue is completely processed.

So overall, I think it's very useful for nvme and other modern HBAs,
but unfortunately, other than solving (1), I wasn't able to see
performance improvement but rather a slight regression, but I can't
explain where its coming from...
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH] preview - block layer help to detect sequential IO

2017-01-12 Thread Kashyap Desai

> -Original Message-
> From: kbuild test robot [mailto:l...@intel.com]
> Sent: Thursday, January 12, 2017 1:18 AM
> To: Kashyap Desai
> Cc: kbuild-...@01.org; linux-s...@vger.kernel.org;
linux-block@vger.kernel.org;
> ax...@kernel.dk; martin.peter...@oracle.com; j...@linux.vnet.ibm.com;
> sumit.sax...@broadcom.com; Kashyap desai
> Subject: Re: [PATCH] preview - block layer help to detect sequential IO
>
> Hi Kashyap,
>
> [auto build test ERROR on v4.9-rc8]
> [cannot apply to block/for-next linus/master linux/master next-20170111]
[if
> your patch is applied to the wrong git tree, please drop us a note to
help
> improve the system]
>
> url:
https://github.com/0day-ci/linux/commits/Kashyap-Desai/preview-block-
> layer-help-to-detect-sequential-IO/20170112-024228
> config: i386-randconfig-a0-201702 (attached as .config)
> compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
> reproduce:
> # save the attached .config to linux build tree
> make ARCH=i386
>
> All errors (new ones prefixed by >>):
>
>block/blk-core.c: In function 'add_sequential':
> >> block/blk-core.c:1899:16: error: 'struct task_struct' has no member
named
> 'sequential_io_avg'
>  blk_ewma_add(t->sequential_io_avg,


This error fixable. For now, I just wanted to get high level review of the
idea.
Below defines are required to use sequential_io and sequential_io_avg. I
have enable BCACHE for my testing in .config.

#if defined(CONFIG_BCACHE) || defined(CONFIG_BCACHE_MODULE)
unsigned intsequential_io;
unsigned intsequential_io_avg;
#endif

Looking for high level review comment.

` Kashyap


>^
>block/blk-core.c:1893:10: note: in definition of macro 'blk_ewma_add'
> (ewma) *= (weight) - 1;
\
>  ^~~~
> >> block/blk-core.c:1899:16: error: 'struct task_struct' has no member
named
> 'sequential_io_avg'
>  blk_ewma_add(t->sequential_io_avg,
>^
>block/blk-core.c:1894:10: note: in definition of macro 'blk_ewma_add'
> (ewma) += (val) << factor;
\
>  ^~~~
> >> block/blk-core.c:1900:5: error: 'struct task_struct' has no member
named
> 'sequential_io'
>t->sequential_io, 8, 0);
> ^
>block/blk-core.c:1894:20: note: in definition of macro 'blk_ewma_add'
> (ewma) += (val) << factor;
\
>^~~
> >> block/blk-core.c:1899:16: error: 'struct task_struct' has no member
named
> 'sequential_io_avg'
>  blk_ewma_add(t->sequential_io_avg,
>^
>block/blk-core.c:1895:10: note: in definition of macro 'blk_ewma_add'
> (ewma) /= (weight);
\
>  ^~~~
> >> block/blk-core.c:1899:16: error: 'struct task_struct' has no member
named
> 'sequential_io_avg'
>  blk_ewma_add(t->sequential_io_avg,
>^
>block/blk-core.c:1896:10: note: in definition of macro 'blk_ewma_add'
> (ewma) >> factor;
\
>  ^~~~
>block/blk-core.c:1902:3: error: 'struct task_struct' has no member
named
> 'sequential_io'
>  t->sequential_io = 0;
>   ^~
>block/blk-core.c: In function 'generic_make_request_checks':
>block/blk-core.c:2012:7: error: 'struct task_struct' has no member
named
> 'sequential_io'
>   task->sequential_io  = i->sequential;
>   ^~
>In file included from block/blk-core.c:14:0:
>block/blk-core.c:2020:21: error: 'struct task_struct' has no member
named
> 'sequential_io'
>   sectors = max(task->sequential_io,
> ^
>include/linux/kernel.h:747:2: note: in definition of macro '__max'
>  t1 max1 = (x); \
>  ^~
>block/blk-core.c:2020:13: note: in expansion of macro 'max'
>   sectors = max(task->sequential_io,
> ^~~
>block/blk-core.c:2020:21: error: 'struct task_struct' has no member
named
> 'sequential_io'
>   sectors = max(task->sequential_io,
> ^
>include/linux/kernel.h:747:13: note: in definition of macro '__max'
>  t1 max1 = (x); \
> ^
>block/blk-core.c:2020:13: note: in expansion of macro 'max'
>   sectors = max(task->sequential_io,
> ^~~
>block/blk-core.c:2021:14: error: 'struct task_struct' has no member
named
> 'sequential_io_avg'
>

Re: [LSF/MM TOPIC][LSF/MM ATTEND] multipath redesign

2017-01-12 Thread Hannes Reinecke


On 01/11/2017 11:23 PM, Mike Snitzer wrote:

On Wed, Jan 11 2017 at  4:44am -0500,
Hannes Reinecke  wrote:


Hi all,

I'd like to attend LSF/MM this year, and would like to discuss a
redesign of the multipath handling.

With recent kernels we've got quite some functionality required for
multipathing already implemented, making some design decisions of the
original multipath-tools implementation quite pointless.

I'm working on a proof-of-concept implementation which just uses a
simple configfs interface and doesn't require a daemon altogether.

At LSF/MM I'd like to discuss how to move forward here, and whether we'd
like to stay with the current device-mapper integration or move away
from that towards a stand-alone implementation.


I'd really like open exchange of the problems you're having with the
current multipath-tools and DM multipath _before LSF_.  Last LSF only
scratched the surface on people having disdain for the complexity that is
the multipath-tools userspace.  But considering how much of the
multipath-tools you've written I find it fairly comical that you're the
person advocating switching away from it.


Yeah, I know.

But I've stared long and hard at the code, and found some issues really 
hard to overcome. Even more so as most things it does are really pointless.


multipathd _insists_ on redoing the _entire_ device layout for basically 
any operation (except for path checking).
As the data structures allow only for a single setup it uses a lock per 
multipath device to protect against concurrent changes.
When lots of uevents are to be processed this lock is heavily contended, 
leading to a slow-down of uevent processing.

(cf the patchseries from Tang Junhui and my earlier pathset for
lock pushdown)

I've tried to move that lock down even further with distinct locks for 
device paths and multipath devices, but ultimately failed as it would 
amount to essentially a rewrite of the core engine.



But if less userspace involvement is needed then fix userspace.  Fail to
see how configfs is any different than the established DM ioctl interface.

As I just said in another email DM multipath could benefit from
factoring out the SCSI-specific bits so that they are nicely optimized
away if using new transports (e.g. NVMEoF).

Could be lessons can be learned from your approach but I'd prefer we
provably exhaust the utility of the current DM multipath kernel
implementation.  DM multipath is one of the most actively maintained and
updated DM targets (aside from thinp and cache).  As you know DM
multipath has grown blk-mq support which yielded serious performance
improvement.  You also noted (in an earlier email) that I reintroduced
bio-based DM multipath.  On a data path level we have all possible block
core interfaces plumbed.  And yes, they all involve cloning due to the
underlying Device Mapper core.  Open to any ideas on optimization.  If
DM is imposing some inherent performance limitation then please report
it accordingly.


Ah. And I thought you disliked request-based multipathing ...

It's not _actually_ the DM interface which I'm objecting to, it's more 
the user-space implementation.
The daemon is build around some design decisions which are simply not 
applicable anymore:
- we now _do_ have reliable device identifications, so the the 'path_id' 
functionality is pointless.
- The 'alua' device handler also provides you with reliable priority 
information, so it should be possible to do away with the 'prio' 
setting, too.
- And for (most) SCSI devices the 'state' setting provides a reliable 
indicator if the device is useable.


Hence I've implemented a notifier chain (hooked onto 'struct gendisk') 
which provides events for path up/path down etc.

With that it's possible to automatically fail and reinstate paths.
However, what's missing is an automatic pathgroup switch once all paths 
in a group are down.
In the current implementation the device-mapper target doesn't have any 
inkling about path priorities; it just sees path groups as such.
As it stands should reasonably trivial to switch to the next available 
pathgroup, but fallback will become ... interesting.
So we would need to update the interface here to allow for path group 
priorities and also for transmitting the fallback information.


Nothing insurmountable, agreed.
But once we do this most of the current functionality of the 
multipath-tools daemon will become obsolete.


Plus I wasn't quite sure about the direction device-mapper itself will 
be going, so I decided to implement a stand-alone version as a testbed.
I'm not trying to push that at all costs; I'm perfectly happy with 
updating device-mapper.

As long as no-one insists we're having to use the bio-based interface ...

Cheers,

Hannes
--
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. No

Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers

2017-01-12 Thread Sagi Grimberg




Hi all,

I'd like to attend LSF/MM and would like to discuss polling for block drivers.

Currently there is blk-iopoll but it is neither as widely used as NAPI in the
networking field and accoring to Sagi's findings in [1] performance with
polling is not on par with IRQ usage.

On LSF/MM I'd like to whether it is desirable to have NAPI like polling in
more block drivers and how to overcome the currently seen performance issues.


It would be an interesting topic to discuss, as it is a shame that blk-iopoll
isn't used more widely.


Forgot to mention - it should only be a topic, if experimentation has
been done and results gathered to pin point what the issues are, so we
have something concrete to discus. I'm not at all interested in a hand
wavy discussion on the topic.



Hey all,

Indeed I attempted to convert nvme to use irq-poll (let's use its
new name) but experienced some unexplained performance degradations.

Keith reported a 700ns degradation for QD=1 with his Xpoint devices,
this sort of degradation are acceptable I guess because we do schedule
a soft-irq before consuming the completion, but I noticed ~10% IOPs
degradation fr QD=32 which is not acceptable.

I agree with Jens that we'll need some analysis if we want the
discussion to be affective, and I can spend some time this if I
can find volunteers with high-end nvme devices (I only have access
to client nvme devices.

I can add debugfs statistics on average the number of completions I
consume per intererupt, I can also trace the interrupt and the soft-irq
start,end. Any other interesting stats I can add?

I also tried a hybrid mode where the first 4 completions were handled
in the interrupt and the rest in soft-irq but that didn't make much
of a difference.

Any other thoughts?
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RFC: 512e ZBC host-managed disks

2017-01-12 Thread Christoph Hellwig

On Thu, Jan 12, 2017 at 05:13:52PM +0900, Damien Le Moal wrote:
> (3) Any other idea ?

Do nothing and ignore the problem.  This whole idea so braindead that
the person coming up with the T10 language should be shot.  Either a device
has 511 logical sectors or 4k but not this crazy mix.

And make sure no one ships such a piece of crap because we are hell
not going to support it.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RFC: 512e ZBC host-managed disks

2017-01-12 Thread Damien Le Moal


Regular block devices are always accessible in units of logical block
sizes, regardless of the actual physical block size that the device has.
For hard disks, the common cases are:

512n: 512 B logical and physical blocks
512e: 512B logical blocks and 4096B physical blocks
4Kn: 4096B logical and physical blocks

and the sd.c in the kernel checks requests 512B "sectors" position and
size alignment against the disk declared logical block size. All is fine
with this, nothing new.

However, for host-managed zoned block devices (ZBC), the 512e case
breaks this model: the standard allows for 512B logical block reads,
*but* writes MUST be aligned on 4KB boundaries within sequential zones
(still using the 512B logical block size addressing). This is a problem
for users of the disk, e.g. an FS, who may wrongly believe that writing
512B units is possible (and so that it can use 512B FS block size).
Host-aware devices do not have this restriction. Nor does the
restriction apply to writes in conventional zones of host-managed devices.

Summary: for HM 512e block devices, reads are 512e compliant, but writes
in sequential zones are 4Kn compliant.

I would like an opinion on if we should do something about this. I see
the following possible options:

(1) Do nothing and let the disk user deal with the write alignment
problem. It already has to do so anyway as writes must be sequential.
But this would force in-kernel users to go and look at the device
physical block size, which is not something usually done by layers above
the block layer (FS, device mappers etc).

(2) For 512e host-managed devices, always report to the block layer
(device queue) a larger logical block size of 4096B to allow for disk
users to seamlessly adjust to the disk type without having to deal with
the physical sector size. I do not think that this would actually not
require changing the scsi_disk->sector_size field to that incorrect
value so that command addressing does not break. But I wonder if this
may not break a lot of things because of the difference introduced.

(3) Any other idea ?

Best regards.

-- 
Damien Le Moal, Ph.D.
Sr. Manager, System Software Research Group,
Western Digital Corporation
damien.lem...@wdc.com
(+81) 0466-98-3593 (ext. 513593)
1 kirihara-cho, Fujisawa,
Kanagawa, 252-0888 Japan
www.wdc.com, www.hgst.com
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 05/15] dm: remove incomple BLOCK_PC support

2017-01-12 Thread Christoph Hellwig

On Wed, Jan 11, 2017 at 08:09:37PM -0500, Mike Snitzer wrote:
> I'm not following your reasoning.
> 
> dm_blk_ioctl calls __blkdev_driver_ioctl and will call scsi_cmd_ioctl
> (sd_ioctl -> scsi_cmd_blk_ioctl -> scsi_cmd_ioctl) if DM's underlying
> block device is a scsi device.

Yes, it it does.  But scsi_cmd_ioctl as called from sd_ioctl will
operate entirely on the SCSI request_queue - dm-mpath will never see
the BLOCK_PC request generated by it.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RFC: split scsi passthrough fields out of struct request

2017-01-12 Thread Christoph Hellwig

On Wed, Jan 11, 2017 at 05:41:42PM -0500, Mike Snitzer wrote:
> I removed blk-mq on request_fn paths support because it was one of the
> permutations that I felt least useful/stable (see commit c5248f79f3 "dm:
> remove support for stacking dm-mq on .request_fn device(s)")
> 
> As for all of the different IO paths.  I've always liked the idea of
> blk-mq ruling the world.  With Jens' blk-mq IO scheduling advances maybe
> we're closer!

That removed blk-mq on top of request_fn code would do the right thing
for this series if it entirely replaced the old request_fn code in dm-mpath.
(as would the new bio code for that matter)
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

58 matches

Mail list logo