Re: [dm-devel] [PATCH] dm-region-hash: fix strange usage of mempool_alloc.

2017-04-23 Thread NeilBrown
On Fri, Apr 21 2017, Mikulas Patocka wrote:

> On Mon, 10 Apr 2017, NeilBrown wrote:
>
>> mempool_alloc() should only be called with GFP_ATOMIC when
>> it is not safe to wait. Passing __GFP_NOFAIL to kmalloc()
>> says that it is safe to wait indefinitely.  So this code is
>> inconsistent.
>> 
>> Clearly it is OK to wait indefinitely in this code, and
>> mempool_alloc() is able to do that.  So just use
>> mempool_alloc, and allow it to sleep.  If no memory is
>> available it will wait for something to be returned to the
>> pool, and will retry a normal allocation regularly.
>
> The region hash code is buggy anyway, because it may allocate more entries 
> than the size of the pool and not give them back.
>
> That kmalloc was introduced in the commit c06aad854 to work around a 
> deadlock due to incorrect mempool usage.
>
> Your patch slightly increases the probability of the deadlock because 
> mempool_alloc does all allocations with __GFP_NOMEMALLOC, so it uses 
> higher limit than kmalloc(GFP_NOIO).
>

Thanks for the review.

I had a look at how the allocation 'dm_region' objects are used,
and it would take a bit of work to make it really safe.
My guess is __rh_find() should be allowed to fail, and the various
callers need to handle failure.
For example, dm_rh_inc_pending() would be given a second bio_list,
and would move any bios for which rh_inc() fails, onto that list.
Then do_writes() would merge that list back into ms->writes.
That way do_mirror() would not block indefinitely and forward progress
could be assured ... maybe.
It would take more work than I'm able to give at the moment, so
I'm happy to just drop this patch.

Thanks,
NeilBrown


signature.asc
Description: PGP signature
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

Re: [dm-devel] [PATCH v2 for-4.12] block: fix blk_integrity_register to use template's interval_exp if not 0

2017-04-23 Thread Jens Axboe
On 04/22/2017 03:22 PM, Mike Snitzer wrote:
> When registering an integrity profile: if the template's interval_exp is
> not 0 use it, otherwise use the ilog2() of logical block size of the
> provided gendisk.
> 
> This fixes a long-standing DM linear target bug where it cannot pass
> integrity data to the underlying device if its logical block size
> conflicts with the underlying device's logical block size.
> 
> Cc: sta...@vger.kernel.org
> Reported-by: Mikulas Patocka 
> Signed-off-by: Mike Snitzer 
> ---
>  block/blk-integrity.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> Jens, please feel free to pick this up for 4.12 and I'll just make the
> dm-integrity commit Depends-on it once you've staged it.

OK, I'll commit it. I'll send out my pull request as soon as the merge
window opens, so you should not have any delays.

-- 
Jens Axboe

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH v2 for-4.12] block: fix blk_integrity_register to use template's interval_exp if not 0

2017-04-23 Thread Martin K. Petersen

Mike,

> When registering an integrity profile: if the template's interval_exp is
> not 0 use it, otherwise use the ilog2() of logical block size of the
> provided gendisk.
>
> This fixes a long-standing DM linear target bug where it cannot pass
> integrity data to the underlying device if its logical block size
> conflicts with the underlying device's logical block size.

Looks good!

Acked-by: Martin K. Petersen 

-- 
Martin K. Petersen  Oracle Linux Engineering

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


[dm-devel] [PATCH v2 for-4.12] block: fix blk_integrity_register to use template's interval_exp if not 0

2017-04-23 Thread Mike Snitzer
When registering an integrity profile: if the template's interval_exp is
not 0 use it, otherwise use the ilog2() of logical block size of the
provided gendisk.

This fixes a long-standing DM linear target bug where it cannot pass
integrity data to the underlying device if its logical block size
conflicts with the underlying device's logical block size.

Cc: sta...@vger.kernel.org
Reported-by: Mikulas Patocka 
Signed-off-by: Mike Snitzer 
---
 block/blk-integrity.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Jens, please feel free to pick this up for 4.12 and I'll just make the
dm-integrity commit Depends-on it once you've staged it.

diff --git a/block/blk-integrity.c b/block/blk-integrity.c
index 9f0ff5b..d6e1b9a 100644
--- a/block/blk-integrity.c
+++ b/block/blk-integrity.c
@@ -412,7 +412,8 @@ void blk_integrity_register(struct gendisk *disk, struct 
blk_integrity *template
 
bi->flags = BLK_INTEGRITY_VERIFY | BLK_INTEGRITY_GENERATE |
template->flags;
-   bi->interval_exp = ilog2(queue_logical_block_size(disk->queue));
+   bi->interval_exp = template->interval_exp ? :
+   ilog2(queue_logical_block_size(disk->queue));
bi->profile = template->profile ? template->profile : _profile;
bi->tuple_size = template->tuple_size;
bi->tag_size = template->tag_size;
-- 
2.10.1

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [resend PATCH v2 11/33] dm: add dax_device and dax_operations support

2017-04-23 Thread Mike Snitzer
On Thu, Apr 20 2017 at 12:30pm -0400,
Dan Williams  wrote:

> On Mon, Apr 17, 2017 at 12:09 PM, Dan Williams  
> wrote:
> > Allocate a dax_device to represent the capacity of a device-mapper
> > instance. Provide a ->direct_access() method via the new dax_operations
> > indirection that mirrors the functionality of the current direct_access
> > support via block_device_operations.  Once fs/dax.c has been converted
> > to use dax_operations the old dm_blk_direct_access() will be removed.
> >
> > A new helper dm_dax_get_live_target() is introduced to separate some of
> > the dm-specifics from the direct_access implementation.
> >
> > This enabling is only for the top-level dm representation to upper
> > layers. Converting target direct_access implementations is deferred to a
> > separate patch.
> >
> > Cc: Toshi Kani 
> > Cc: Mike Snitzer 
> 
> Hi Mike,
> 
> Any concerns with these dax_device and dax_operations changes to
> device-mapper for the upcoming merge window?

Sorry for the delay.

I just reviewed them, overall looks good.  The enabling functions in the
DAX code, that are mixed in with the DM changes, could maybe be factored
out into separate earlier patches but I don't feel that strongly about
that.

Feel free to add this tag to the handful of relevant DM patches:

Reviewed-by: Mike Snitzer 

I haven't done a merge with the linux-dm.git 'dm-4.12' branch but it'd
be good to verify there aren't any merge conflicts.  If there are then
it'd be nice to know going in to the merge so that we can forecast as
much to Linus.

I really appreciate you doing this work!

Thanks,
Mike

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH for-4.12] block, dm: use __blk_integrity_register to properly setup integrity profile

2017-04-23 Thread Mike Snitzer
On Fri, Apr 21 2017 at  6:43pm -0400,
Martin K. Petersen  wrote:

> 
> Mike/Mikulas,
> 
> > Introduce __blk_integrity_register() to allow users more precise control
> > over the logical block size used to register the integrity profile.
> > blk_integrity_register() now calls __blk_integrity_register().
> 
> This seems a bit kludgy. It would be cleaner having the integrity
> interval be explicitly specified in the integrity template for all
> callers. Maybe with a fallback to qlbs if it's set to 0.

Fair enough, I'll revise and send v2 shortly.  Thanks.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH] block: get rid of blk_integrity_revalidate()

2017-04-23 Thread Dan Williams
On Wed, Apr 19, 2017 at 7:04 PM, Martin K. Petersen
 wrote:
> Ilya Dryomov  writes:
>
> Ilya,
>
>> Commit 25520d55cdb6 ("block: Inline blk_integrity in struct gendisk")
>> introduced blk_integrity_revalidate(), which seems to assume ownership
>> of the stable pages flag and unilaterally clears it if no blk_integrity
>> profile is registered:
>>
>> if (bi->profile)
>> disk->queue->backing_dev_info->capabilities |=
>> BDI_CAP_STABLE_WRITES;
>> else
>> disk->queue->backing_dev_info->capabilities &=
>> ~BDI_CAP_STABLE_WRITES;
>>
>> It's called from revalidate_disk() and rescan_partitions(), making it
>> impossible to enable stable pages for drivers that support partitions
>> and don't use blk_integrity: while the call in revalidate_disk() can be
>> trivially worked around (see zram, which doesn't support partitions and
>> hence gets away with zram_revalidate_disk()), rescan_partitions() can
>> be triggered from userspace at any time.  This breaks rbd, where the
>> ceph messenger is responsible for generating/verifying CRCs.
>>
>> Since blk_integrity_{un,}register() "must" be used for (un)registering
>> the integrity profile with the block layer, move BDI_CAP_STABLE_WRITES
>> setting there.  This way drivers that call blk_integrity_register() and
>> use integrity infrastructure won't interfere with drivers that don't
>> but still want stable pages.
>
> I seem to recall that the reason for the revalidate hook was that either
> NVMe or nvdimm had to register an integrity profile prior to the actual
> format being known.
>
> So while I am OK with the change from a SCSI perspective, I think we
> need Keith and Dan to ack it.

Looks good to me,

Tested-by: Dan Williams 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [resend PATCH v2 00/33] dax: introduce dax_operations

2017-04-23 Thread Dan Williams
[ adding akpm, sfr, and jens ]

I applied this series and pushed it out for the nvdimm.git branch that
gets auto pulled into -next. The set is still awaiting acks from
device-mapper, ext4, xfs, and vfs (for the copy_from_iter_ops, patch
29/33). If those come next week perhaps this can be merged for 4.12,
but if not this will need to wait until 4.13.

There are some minor collisions with Al's copy_from_user rework, the
new dax tracepoints, and the removal of discard support from the brd
driver. A sample merge is available here:

https://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm.git/log/?h=libnvdimm-for-4.12-merge

If it causes any other problems just drop and I'll retry for 4.13.

On Mon, Apr 17, 2017 at 12:08 PM, Dan Williams  wrote:
> [ resend to add dm-devel, linux-block, and fs-devel, apologies for the
> duplicates ]
>
> Changes since v1 [1] and the dax-fs RFC [2]:
> * rename struct dax_inode to struct dax_device (Christoph)
> * rewrite arch_memcpy_to_pmem() in C with inline asm
> * use QUEUE_FLAG_WC to gate dax cache management (Jeff)
> * add device-mapper plumbing for the ->copy_from_iter() and ->flush()
>   dax_operations
> * kill struct blk_dax_ctl and bdev_direct_access (Christoph)
> * cleanup the ->direct_access() calling convention to be page based
>   (Christoph)
> * introduce dax_get_by_host() and don't pollute struct super_block with
>   dax_device details (Christoph)
>
> [1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008586.html
> [2]: https://lwn.net/Articles/713064/
>
> ---
> A few months back, in the course of reviewing the memcpy_nocache()
> proposal from Brian, Linus proposed that the pmem specific
> memcpy_to_pmem() routine be moved to be implemented at the driver level
> [3]:
>
>"Quite frankly, the whole 'memcpy_nocache()' idea or (ab-)using
> copy_user_nocache() just needs to die. It's idiotic.
>
> As you point out, it's also fundamentally buggy crap.
>
> Throw it away. There is no possible way this is ever valid or
> portable. We're not going to lie and claim that it is.
>
> If some driver ends up using 'movnt' by hand, that is up to that
> *driver*. But no way in hell should we care about this one whit in
> the sense of ."
>
> This feedback also dovetails with another fs/dax.c design wart of being
> hard coded to assume the backing device is pmem. We call the pmem
> specific copy, clear, and flush routines even if the backing device
> driver is one of the other 3 dax drivers (axonram, dccssblk, or brd).
> There is no reason to spend cpu cycles flushing the cache after writing
> to brd, for example, since it is using volatile memory for storage.
>
> Moreover, the pmem driver might be fronting a volatile memory range
> published by the ACPI NFIT, or the platform might have arranged to flush
> cpu caches on power fail. This latter capability is a feature that has
> appeared in embedded storage appliances (pre-ACPI-NFIT nvdimm
> platforms).
>
> So, this series:
>
> 1/ moves what was previously named "the pmem api" out of the global
>namespace and into drivers that need to be concerned with
>architecture specific persistent memory considerations.
>
> 2/ arranges for dax to stop abusing __copy_user_nocache() and implements
>a libnvdimm-local memcpy that uses 'movnt' on x86_64. This might be
>expanded in the future to use 'movntdqa' if the copy size is above
>some threshold, or expanded with support for other architectures [4].
>
> 3/ makes cache maintenance optional by arranging for dax to call driver
>specific copy and flush operations only if the driver publishes them.
>
> 4/ allows filesytem-dax cache management to be controlled by the block
>device write-cache queue flag. The pmem driver is updated to clear
>that flag by default when pmem is driving volatile memory.
>
> [3]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008364.html
> [4]: https://lists.01.org/pipermail/linux-nvdimm/2017-April/009478.html
>
> These patches have been through a round of build regression fixes
> notified by the 0day robot. All review welcome, but the patches that
> need extra attention are the device-mapper and uio changes
> (copy_from_iter_ops).
>
> This series is based on a merge of char-misc-next (for cdev api reworks)
> and libnvdimm-fixes (dax locking and __copy_user_nocache fixes).
>
> ---
>
> Dan Williams (33):
>   device-dax: rename 'dax_dev' to 'dev_dax'
>   dax: refactor dax-fs into a generic provider of 'struct dax_device' 
> instances
>   dax: add a facility to lookup a dax device by 'host' device name
>   dax: introduce dax_operations
>   pmem: add dax_operations support
>   axon_ram: add dax_operations support
>   brd: add dax_operations support
>   dcssblk: add dax_operations support
>   block: kill bdev_dax_capable()
>   dax: introduce dax_direct_access()
>   dm: add dax_device and dax_operations support
>   dm: 

Re: [dm-devel] [PATCH for-4.12] block, dm: use __blk_integrity_register to properly setup integrity profile

2017-04-23 Thread Martin K. Petersen

Mike/Mikulas,

> Introduce __blk_integrity_register() to allow users more precise control
> over the logical block size used to register the integrity profile.
> blk_integrity_register() now calls __blk_integrity_register().

This seems a bit kludgy. It would be cleaner having the integrity
interval be explicitly specified in the integrity template for all
callers. Maybe with a fallback to qlbs if it's set to 0.

-- 
Martin K. Petersen  Oracle Linux Engineering

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH] dm-mpath: Requeue after a small delay if blk_get_request() fails

2017-04-23 Thread Bart Van Assche
On Fri, 2017-04-07 at 16:50 -0700, Bart Van Assche wrote:
> If blk_get_request() returns ENODEV then multipath_clone_and_map()
> causes a request to be requeued immediately. This can cause a
> kworker thread to spend 100% of the CPU time of a single core in
> __blk_mq_run_hw_queue() and also can cause device removal to
> never finish.
> 
> Avoid this by only requeuing after a delay if blk_get_request()
> fails. Additionally, reduce the requeue delay.

Hello Mike,

The v4.12 merge window will open soon. Have you already had the
time to review this patch?

Thanks,

Bart.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH] block: get rid of blk_integrity_revalidate()

2017-04-23 Thread Jens Axboe
On 04/18/2017 10:43 AM, Ilya Dryomov wrote:
> Commit 25520d55cdb6 ("block: Inline blk_integrity in struct gendisk")
> introduced blk_integrity_revalidate(), which seems to assume ownership
> of the stable pages flag and unilaterally clears it if no blk_integrity
> profile is registered:
> 
> if (bi->profile)
> disk->queue->backing_dev_info->capabilities |=
> BDI_CAP_STABLE_WRITES;
> else
> disk->queue->backing_dev_info->capabilities &=
> ~BDI_CAP_STABLE_WRITES;
> 
> It's called from revalidate_disk() and rescan_partitions(), making it
> impossible to enable stable pages for drivers that support partitions
> and don't use blk_integrity: while the call in revalidate_disk() can be
> trivially worked around (see zram, which doesn't support partitions and
> hence gets away with zram_revalidate_disk()), rescan_partitions() can
> be triggered from userspace at any time.  This breaks rbd, where the
> ceph messenger is responsible for generating/verifying CRCs.
> 
> Since blk_integrity_{un,}register() "must" be used for (un)registering
> the integrity profile with the block layer, move BDI_CAP_STABLE_WRITES
> setting there.  This way drivers that call blk_integrity_register() and
> use integrity infrastructure won't interfere with drivers that don't
> but still want stable pages.

Applied, thanks.

-- 
Jens Axboe

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


[dm-devel] [PATCH for-4.12] block, dm: use __blk_integrity_register to properly setup integrity profile

2017-04-23 Thread Mike Snitzer
From: Mikulas Patocka 

When registering integrity profile on a DM device, the logical block
size of the underlying device with integrity support must be used, not
the logical block size of the stacked device we are creating.

Introduce __blk_integrity_register() to allow users more precise control
over the logical block size used to register the integrity profile.
blk_integrity_register() now calls __blk_integrity_register().

DM core's use of __blk_integrity_register() fixes a long-standing DM
linear target bug where it cannot pass integrity data to the underlying
device if its logical block size conflicts with the underlying device's
logical block size.

Cc: sta...@vger.kernel.org
Signed-off-by: Mikulas Patocka 
Signed-off-by: Mike Snitzer 
---
 block/blk-integrity.c  | 14 +++---
 drivers/md/dm-table.c  |  6 --
 include/linux/blkdev.h |  2 ++
 3 files changed, 17 insertions(+), 5 deletions(-)

NOTE: Jens, I'd like to get this staged in linux-dm.git for 4.12
because the dm-inegrity target depends on it, doing so would avoid
some awkward coordination between block and DM trees.  SO if you're OK
with this patch, and with me sending it to Linus as part of the DM
pull, please provide your Ack. Thanks!

diff --git a/block/blk-integrity.c b/block/blk-integrity.c
index 9f0ff5b..d2a36e3 100644
--- a/block/blk-integrity.c
+++ b/block/blk-integrity.c
@@ -396,9 +396,10 @@ static struct blk_integrity_profile nop_profile = {
 };
 
 /**
- * blk_integrity_register - Register a gendisk as being integrity-capable
+ * __blk_integrity_register - Register a gendisk as being integrity-capable
  * @disk:  struct gendisk pointer to make integrity-aware
  * @template:  block integrity profile to register
+ * @logical_block_size:block size - integrity tag exists for each block
  *
  * Description: When a device needs to advertise itself as being able to
  * send/receive integrity metadata it must use this function to register
@@ -406,19 +407,26 @@ static struct blk_integrity_profile nop_profile = {
  * struct with values appropriate for the underlying hardware. See
  * Documentation/block/data-integrity.txt.
  */
-void blk_integrity_register(struct gendisk *disk, struct blk_integrity 
*template)
+void __blk_integrity_register(struct gendisk *disk, struct blk_integrity 
*template,
+ unsigned logical_block_size)
 {
struct blk_integrity *bi = >queue->integrity;
 
bi->flags = BLK_INTEGRITY_VERIFY | BLK_INTEGRITY_GENERATE |
template->flags;
-   bi->interval_exp = ilog2(queue_logical_block_size(disk->queue));
+   bi->interval_exp = ilog2(logical_block_size);
bi->profile = template->profile ? template->profile : _profile;
bi->tuple_size = template->tuple_size;
bi->tag_size = template->tag_size;
 
blk_integrity_revalidate(disk);
 }
+EXPORT_SYMBOL(__blk_integrity_register);
+
+void blk_integrity_register(struct gendisk *disk, struct blk_integrity 
*template)
+{
+   __blk_integrity_register(disk, template, 
queue_logical_block_size(disk->queue));
+}
 EXPORT_SYMBOL(blk_integrity_register);
 
 /**
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index b060084..b7a5843 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1181,13 +1181,15 @@ static int dm_table_register_integrity(struct dm_table 
*t)
return 0;
 
if (!integrity_profile_exists(dm_disk(md))) {
+   struct blk_integrity *integrity = 
blk_get_integrity(template_disk);
+
t->integrity_supported = true;
/*
 * Register integrity profile during table load; we can do
 * this because the final profile must match during resume.
 */
-   blk_integrity_register(dm_disk(md),
-  blk_get_integrity(template_disk));
+   __blk_integrity_register(dm_disk(md), integrity,
+1U << integrity->interval_exp);
return 0;
}
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 796016e..353400c 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1778,6 +1778,8 @@ struct blk_integrity_profile {
const char  *name;
 };
 
+extern void __blk_integrity_register(struct gendisk *, struct blk_integrity *,
+unsigned);
 extern void blk_integrity_register(struct gendisk *, struct blk_integrity *);
 extern void blk_integrity_unregister(struct gendisk *);
 extern int blk_integrity_compare(struct gendisk *, struct gendisk *);
-- 
2.10.1

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel