Re: [dm-devel] [RFC PATCH] dm-zoned: extend the way of exposing zoned block device

2020-01-07 Thread Damien Le Moal
On 2020/01/08 16:13, Nobody wrote:
> From: Bob Liu 
> 
> Motivation:
> Now the dm-zoned device mapper target exposes a zoned block device(ZBC) as a
> regular block device by storing metadata and buffering random writes in
> conventional zones.
> This way is not very flexible, there must be enough conventional zones and the
> performance may be constrained.
> By putting metadata(also buffering random writes) in separated device we can 
> get
> more flexibility and potential performance improvement e.g by storing metadata
> in faster device like persistent memory.
> 
> This patch try to split the metadata of dm-zoned to an extra block
> device instead of zoned block device itself.
> (Buffering random writes also in the todo list.)
> 
> Patch is at the very early stage, just want to receive some feedback about
> this extension.
> Another option is to create an new md-zoned device with separated metadata
> device based on md framework.

For metadata only, it should not be hard at all to move to another
conventional zone device. It will however be a little more tricky for
conventional zones used for data since dm-zoned assumes that this random
write space is also zoned. Moving this space to a conventional device
requires implementing a zone emulation (fake zones) for the regular
drive, using a zone size that matches the size of sequential zones.

Beyond this, dm-zoned also needs to be changed to accept partial drives
and the dm core code to accept mixing of regular and zoned disks (that
is forbidden now).

Another approach worth exploring is stacking dm-zoned as is on top of a
modified dm-linear with the ability to emulate conventional zones on top
of a regular block device (you only need report zones method
implemented). That is much easier to do. We actually hacked something
like that last month to see the performance change and saw a jump from
56 MB/s to 250 MB/s for write intensive workloads (file creation on
ext4). I am not so warm in this approach though as it modifies dm-linear
and we want to keep it very lean and simple. Modifying dm-zoned to allow
support for a device pair is a better approach I think.

> 
> Signed-off-by: Bob Liu 
> ---
>  drivers/md/dm-zoned-metadata.c |  6 +++---
>  drivers/md/dm-zoned-target.c   | 15 +--
>  drivers/md/dm-zoned.h  |  1 +
>  3 files changed, 17 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/md/dm-zoned-metadata.c b/drivers/md/dm-zoned-metadata.c
> index 22b3cb0..89cd554 100644
> --- a/drivers/md/dm-zoned-metadata.c
> +++ b/drivers/md/dm-zoned-metadata.c
> @@ -439,7 +439,7 @@ static struct dmz_mblock *dmz_get_mblock_slow(struct 
> dmz_metadata *zmd,
>  
>   /* Submit read BIO */
>   bio->bi_iter.bi_sector = dmz_blk2sect(block);
> - bio_set_dev(bio, zmd->dev->bdev);
> + bio_set_dev(bio, zmd->dev->meta_bdev);
>   bio->bi_private = mblk;
>   bio->bi_end_io = dmz_mblock_bio_end_io;
>   bio_set_op_attrs(bio, REQ_OP_READ, REQ_META | REQ_PRIO);
> @@ -593,7 +593,7 @@ static int dmz_write_mblock(struct dmz_metadata *zmd, 
> struct dmz_mblock *mblk,
>   set_bit(DMZ_META_WRITING, >state);
>  
>   bio->bi_iter.bi_sector = dmz_blk2sect(block);
> - bio_set_dev(bio, zmd->dev->bdev);
> + bio_set_dev(bio, zmd->dev->meta_bdev);
>   bio->bi_private = mblk;
>   bio->bi_end_io = dmz_mblock_bio_end_io;
>   bio_set_op_attrs(bio, REQ_OP_WRITE, REQ_META | REQ_PRIO);
> @@ -620,7 +620,7 @@ static int dmz_rdwr_block(struct dmz_metadata *zmd, int 
> op, sector_t block,
>   return -ENOMEM;
>  
>   bio->bi_iter.bi_sector = dmz_blk2sect(block);
> - bio_set_dev(bio, zmd->dev->bdev);
> + bio_set_dev(bio, zmd->dev->meta_bdev);
>   bio_set_op_attrs(bio, op, REQ_SYNC | REQ_META | REQ_PRIO);
>   bio_add_page(bio, page, DMZ_BLOCK_SIZE, 0);
>   ret = submit_bio_wait(bio);
> diff --git a/drivers/md/dm-zoned-target.c b/drivers/md/dm-zoned-target.c
> index 70a1063..55c64fe 100644
> --- a/drivers/md/dm-zoned-target.c
> +++ b/drivers/md/dm-zoned-target.c
> @@ -39,6 +39,7 @@ struct dm_chunk_work {
>   */
>  struct dmz_target {
>   struct dm_dev   *ddev;
> + struct dm_dev   *metadata_dev;

This naming would be confusing as it implies metadata only. If you also
want to move the random write zones to a regular device, then I would
suggest names like:

ddev -> zoned_bdev (the zoned device, e.g. SMR disk)
metadata_dev -> reg_bdev (regular block device, e.g. an SSD)

The metadata+random write (fake) zones space can use the regular block
device, and all sequential zones are assumed to be on the zoned device.
Care must be taken to handle the case of a zoned device that has
conventional zones: these could be used as is, not needing any reclaim,
so potentially contributing to further optimization.

>  
>   unsigned long   flags;
>  
> @@ -701,6 +702,7 @@ static int dmz_get_zoned_device(struct dm_target *ti, 
> char *path)
>   goto err;
>   }
>  
> +

Re: [dm-devel] [lvm-devel] kernel BUG at drivers/md/persistent-data/dm-space-map-disk.c:178

2020-01-07 Thread Eric Wheeler
On Tue, 7 Jan 2020, Joe Thornber wrote:

> On Tue, Jan 07, 2020 at 10:46:27AM +, Joe Thornber wrote:
> > I'll get a patch to you later today.
> 
> Eric,
> 
> Patch below.  I've run it through a bunch of tests in the dm test suite.  But
> obviously I have never hit your issue.  Will do more testing today.


Thank you Joe, I'll apply the patch and pull out the spinlock.  

I'm not familiar with how sync IO prevents a spinlock.  Can you give a 
brief explaination or point me at documentation?

--
Eric Wheeler



> 
> - Joe
> 
> 
> 
> Author: Joe Thornber 
> Date:   Tue Jan 7 11:58:42 2020 +
> 
> [dm-thin, dm-cache] Fix bug in space-maps.
> 
> The space-maps track the reference counts for disk blocks.  There are 
> variants
> for tracking metadata blocks, and data blocks.
> 
> We implement transactionality by never touching blocks from the previous
> transaction, so we can rollback in the event of a crash.
> 
> When allocating a new block we need to ensure the block is free (has 
> reference
> count of 0) in both the current and previous transaction.  Prior to this 
> patch we
> were doing this by searching for a free block in the previous 
> transaction, and
> relying on a 'begin' counter to track where the last allocation in the 
> current
> transaction was.  This 'begin' field was not being updated in all code 
> paths (eg,
> increment of a data block reference count due to breaking sharing of a 
> neighbour
> block in the same btree leaf).
> 
> This patch keeps the 'begin' field, but now it's just a hint to speed up 
> the search.
> Instead we search the current transaction for a free block, and then 
> double check
> it's free in the old transaction.  Much simpler.
> 
> diff --git a/drivers/md/persistent-data/dm-space-map-common.c 
> b/drivers/md/persistent-data/dm-space-map-common.c
> index bd68f6fef694..b4983e4022e6 100644
> --- a/drivers/md/persistent-data/dm-space-map-common.c
> +++ b/drivers/md/persistent-data/dm-space-map-common.c
> @@ -380,6 +380,34 @@ int sm_ll_find_free_block(struct ll_disk *ll, dm_block_t 
> begin,
>   return -ENOSPC;
>  }
>  
> +int sm_ll_find_common_free_block(struct ll_disk *old_ll, struct ll_disk 
> *new_ll,
> +  dm_block_t begin, dm_block_t end, dm_block_t 
> *b)
> +{
> + int r;
> + uint32_t count;
> +
> + do {
> + r = sm_ll_find_free_block(new_ll, begin, new_ll->nr_blocks, b);
> + if (r)
> + break;
> +
> + /* double check this block wasn't used in the old transaction */
> + if (*b >= old_ll->nr_blocks)
> + count = 0;
> +
> + else {
> + r = sm_ll_lookup(old_ll, *b, );
> + if (r)
> + break;
> +
> + if (count)
> + begin = *b + 1;
> + }
> + } while (count);
> +
> + return r;
> +}
> +
>  static int sm_ll_mutate(struct ll_disk *ll, dm_block_t b,
>   int (*mutator)(void *context, uint32_t old, uint32_t 
> *new),
>   void *context, enum allocation_event *ev)
> diff --git a/drivers/md/persistent-data/dm-space-map-common.h 
> b/drivers/md/persistent-data/dm-space-map-common.h
> index b3078d5eda0c..8de63ce39bdd 100644
> --- a/drivers/md/persistent-data/dm-space-map-common.h
> +++ b/drivers/md/persistent-data/dm-space-map-common.h
> @@ -109,6 +109,8 @@ int sm_ll_lookup_bitmap(struct ll_disk *ll, dm_block_t b, 
> uint32_t *result);
>  int sm_ll_lookup(struct ll_disk *ll, dm_block_t b, uint32_t *result);
>  int sm_ll_find_free_block(struct ll_disk *ll, dm_block_t begin,
> dm_block_t end, dm_block_t *result);
> +int sm_ll_find_common_free_block(struct ll_disk *old_ll, struct ll_disk 
> *new_ll,
> +  dm_block_t begin, dm_block_t end, dm_block_t 
> *result);
>  int sm_ll_insert(struct ll_disk *ll, dm_block_t b, uint32_t ref_count, enum 
> allocation_event *ev);
>  int sm_ll_inc(struct ll_disk *ll, dm_block_t b, enum allocation_event *ev);
>  int sm_ll_dec(struct ll_disk *ll, dm_block_t b, enum allocation_event *ev);
> diff --git a/drivers/md/persistent-data/dm-space-map-disk.c 
> b/drivers/md/persistent-data/dm-space-map-disk.c
> index 32adf6b4a9c7..bf4c5e2ccb6f 100644
> --- a/drivers/md/persistent-data/dm-space-map-disk.c
> +++ b/drivers/md/persistent-data/dm-space-map-disk.c
> @@ -167,8 +167,10 @@ static int sm_disk_new_block(struct dm_space_map *sm, 
> dm_block_t *b)
>   enum allocation_event ev;
>   struct sm_disk *smd = container_of(sm, struct sm_disk, sm);
>  
> - /* FIXME: we should loop round a couple of times */
> - r = sm_ll_find_free_block(>old_ll, smd->begin, 
> smd->old_ll.nr_blocks, b);
> + /*
> +  * Any block we allocate has to be free in both the old and current ll.
> +  */
> + r = 

Re: [dm-devel] Reply: [PATCH v3] dm verity: don't prefetch hash blocks for already-verified data

2020-01-07 Thread Mike Snitzer
Please fix your mail client so that when you reply to a message you
don't start a new thread.


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [lvm-devel] kernel BUG at drivers/md/persistent-data/dm-space-map-disk.c:178

2020-01-07 Thread Joe Thornber
On Tue, Jan 07, 2020 at 10:46:27AM +, Joe Thornber wrote:
> I'll get a patch to you later today.

Eric,

Patch below.  I've run it through a bunch of tests in the dm test suite.  But
obviously I have never hit your issue.  Will do more testing today.

- Joe



Author: Joe Thornber 
Date:   Tue Jan 7 11:58:42 2020 +

[dm-thin, dm-cache] Fix bug in space-maps.

The space-maps track the reference counts for disk blocks.  There are 
variants
for tracking metadata blocks, and data blocks.

We implement transactionality by never touching blocks from the previous
transaction, so we can rollback in the event of a crash.

When allocating a new block we need to ensure the block is free (has 
reference
count of 0) in both the current and previous transaction.  Prior to this 
patch we
were doing this by searching for a free block in the previous transaction, 
and
relying on a 'begin' counter to track where the last allocation in the 
current
transaction was.  This 'begin' field was not being updated in all code 
paths (eg,
increment of a data block reference count due to breaking sharing of a 
neighbour
block in the same btree leaf).

This patch keeps the 'begin' field, but now it's just a hint to speed up 
the search.
Instead we search the current transaction for a free block, and then double 
check
it's free in the old transaction.  Much simpler.

diff --git a/drivers/md/persistent-data/dm-space-map-common.c 
b/drivers/md/persistent-data/dm-space-map-common.c
index bd68f6fef694..b4983e4022e6 100644
--- a/drivers/md/persistent-data/dm-space-map-common.c
+++ b/drivers/md/persistent-data/dm-space-map-common.c
@@ -380,6 +380,34 @@ int sm_ll_find_free_block(struct ll_disk *ll, dm_block_t 
begin,
return -ENOSPC;
 }
 
+int sm_ll_find_common_free_block(struct ll_disk *old_ll, struct ll_disk 
*new_ll,
+dm_block_t begin, dm_block_t end, dm_block_t 
*b)
+{
+   int r;
+   uint32_t count;
+
+   do {
+   r = sm_ll_find_free_block(new_ll, begin, new_ll->nr_blocks, b);
+   if (r)
+   break;
+
+   /* double check this block wasn't used in the old transaction */
+   if (*b >= old_ll->nr_blocks)
+   count = 0;
+
+   else {
+   r = sm_ll_lookup(old_ll, *b, );
+   if (r)
+   break;
+
+   if (count)
+   begin = *b + 1;
+   }
+   } while (count);
+
+   return r;
+}
+
 static int sm_ll_mutate(struct ll_disk *ll, dm_block_t b,
int (*mutator)(void *context, uint32_t old, uint32_t 
*new),
void *context, enum allocation_event *ev)
diff --git a/drivers/md/persistent-data/dm-space-map-common.h 
b/drivers/md/persistent-data/dm-space-map-common.h
index b3078d5eda0c..8de63ce39bdd 100644
--- a/drivers/md/persistent-data/dm-space-map-common.h
+++ b/drivers/md/persistent-data/dm-space-map-common.h
@@ -109,6 +109,8 @@ int sm_ll_lookup_bitmap(struct ll_disk *ll, dm_block_t b, 
uint32_t *result);
 int sm_ll_lookup(struct ll_disk *ll, dm_block_t b, uint32_t *result);
 int sm_ll_find_free_block(struct ll_disk *ll, dm_block_t begin,
  dm_block_t end, dm_block_t *result);
+int sm_ll_find_common_free_block(struct ll_disk *old_ll, struct ll_disk 
*new_ll,
+dm_block_t begin, dm_block_t end, dm_block_t 
*result);
 int sm_ll_insert(struct ll_disk *ll, dm_block_t b, uint32_t ref_count, enum 
allocation_event *ev);
 int sm_ll_inc(struct ll_disk *ll, dm_block_t b, enum allocation_event *ev);
 int sm_ll_dec(struct ll_disk *ll, dm_block_t b, enum allocation_event *ev);
diff --git a/drivers/md/persistent-data/dm-space-map-disk.c 
b/drivers/md/persistent-data/dm-space-map-disk.c
index 32adf6b4a9c7..bf4c5e2ccb6f 100644
--- a/drivers/md/persistent-data/dm-space-map-disk.c
+++ b/drivers/md/persistent-data/dm-space-map-disk.c
@@ -167,8 +167,10 @@ static int sm_disk_new_block(struct dm_space_map *sm, 
dm_block_t *b)
enum allocation_event ev;
struct sm_disk *smd = container_of(sm, struct sm_disk, sm);
 
-   /* FIXME: we should loop round a couple of times */
-   r = sm_ll_find_free_block(>old_ll, smd->begin, 
smd->old_ll.nr_blocks, b);
+   /*
+* Any block we allocate has to be free in both the old and current ll.
+*/
+   r = sm_ll_find_common_free_block(>old_ll, >ll, smd->begin, 
smd->ll.nr_blocks, b);
if (r)
return r;
 
diff --git a/drivers/md/persistent-data/dm-space-map-metadata.c 
b/drivers/md/persistent-data/dm-space-map-metadata.c
index 25328582cc48..9e3c64ec2026 100644
--- a/drivers/md/persistent-data/dm-space-map-metadata.c
+++ b/drivers/md/persistent-data/dm-space-map-metadata.c
@@ -448,7 +448,10 @@ static int 

Re: [dm-devel] [lvm-devel] kernel BUG at drivers/md/persistent-data/dm-space-map-disk.c:178

2020-01-07 Thread Joe Thornber
On Tue, Jan 07, 2020 at 10:35:46AM +, Joe Thornber wrote:
> On Sat, Dec 28, 2019 at 02:13:07AM +, Eric Wheeler wrote:
> > On Fri, 27 Dec 2019, Eric Wheeler wrote:
> 
> > > Just hit the bug again without mq-scsi (scsi_mod.use_blk_mq=n), all other 
> > > times MQ has been turned on. 
> > > 
> > > I'm trying the -ENOSPC hack which will flag the pool as being out of 
> > > space 
> > > so I can recover more gracefully than a BUG_ON. Here's a first-draft 
> > > patch, maybe the spinlock will even prevent the issue.
> > > 
> > > Compile tested, I'll try on a real system tomorrow.
> > > 
> > > Comments welcome:
> 
> Both sm_ll_find_free_block() and sm_ll_inc() can trigger synchronous IO.  So 
> you
> absolutely cannot use a spin lock.
> 
> dm_pool_alloc_data_block() holds a big rw semaphore which should prevent 
> anything
> else trying to allocate at the same time.

I suspect the problem is to do with the way we search for the new block in the 
space map for the previous transaction (sm->old_ll), and then increment in the 
current
transaction (sm->ll).

We keep old_ll around so we can ensure we never (re) allocate a block that's 
used in
the previous transaction.  This gives us our crash resistance, since if 
anything goes
wrong we effectively rollback to the previous transaction.

What I think we should be doing is running find_free on the old_ll, then double 
checking
it's not actually used in the current transaction.  ATM we're relying on 
smd->begin being
set properly everywhere, and I suspect this isn't the case.  A quick look shows 
sm_disk_inc_block()
doesn't adjust it.  sm_disk_inc_block() can be called when breaking sharing of 
a neighbouring entry
in a leaf btree node ... and we know you use snapshots very heavily.

I'll get a patch to you later today.

- Joe

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] dm mpath: Add timeout mechanism for queue_if_no_path

2020-01-07 Thread Khazhismel Kumykov
On Mon, Jan 6, 2020 at 11:28 AM Mike Snitzer  wrote:
>
> On Thu, Jan 02 2020 at  5:45pm -0500,
> Gabriel Krisman Bertazi  wrote:
>
> > From: Anatol Pomazau 
> >
> > Add a configurable timeout mechanism to disable queue_if_no_path without
> > assistance from multipathd.  In reality, this reimplements the
> > no_path_retry mechanism from multipathd in kernel space, which is
> > interesting for cases where we cannot rely on a daemon being present all
> > the time, in case of failure or to reduce the guest footprint of cloud
> > services.
> >
> > Despite replicating the policy configuration on kernel space, it is
> > quite an important case to prevent IOs from hanging forever, waiting for
> > userspace to behave correctly.
> >
> > Co-developed-by: Frank Mayhar 
> > Signed-off-by: Frank Mayhar 
> > Co-developed-by: Bharath Ravi 
> > Signed-off-by: Bharath Ravi 
> > Co-developed-by: Khazhismel Kumykov 
> > Signed-off-by: Khazhismel Kumykov 
> > Signed-off-by: Anatol Pomazau 
> > Co-developed-by: Gabriel Krisman Bertazi 
> > Signed-off-by: Gabriel Krisman Bertazi 
>
> This seems like a slippery slope.
>
> I've heard this line of dm-multipath concern from Google before
> (e.g. doesn't want to rely on availability of userspace component).
>
> Thing is, to properly use dm-multipath (e.g. to reinstate failed paths)
> multipathd really is needed no?
Yeah, in order to reinstate the failed paths, or any full recovery, we
do need the user space component to be running, and this doesn't aim
to change the responsibilities here. Rather, we're aiming to avoid
in-kernel hangs/processes lingering indefinitely in unkillable sleep
due to buggy/unavailable/down userspace multipath daemon.
>
> If not, how is it that the proposed patch is all that is missing when
> multipathd isn't running?  I find that hard to appreciate.
>
> So I'm inclined to not accept this type of change.  But especially not
> until more comprehensive context is given for how Google is using
> dm-multipath without multipathd.

This patch isn't aimed at enabling dm-multipath without a userspace
multipath daemon, rather to avoid processes hanging indefinitely in
the event the daemon is unable to proceed (for whatever reason).

>
> Thanks,
> Mike
>


smime.p7s
Description: S/MIME Cryptographic Signature
--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

[dm-devel] Reply: [PATCH v3] dm verity: don't prefetch hash blocks for already-verified data

2020-01-07 Thread 周先荣
Hey Eric:

Thank you very much. I am quit happy you improvement of and send this 
patch.

Subject: [PATCH v3] dm verity: don't prefetch hash blocks for already-verified 
data

From: "xianrong.zhou" 

Try to skip prefetching hash blocks that won't be needed due to the 
"check_at_most_once" option being enabled and the corresponding data blocks 
already having been verified.

Since prefetching operates on a range of data blocks, do this by just trimming 
the two ends of the range.  This doesn't skip every unneeded hash block, since 
data blocks in the middle of the range could also be unneeded, and hash blocks 
are still prefetched in large clusters as controlled by 
dm_verity_prefetch_cluster.  But it can still help a lot.

In a test on Android Q launching 91 apps every 15s repeated 21 times, 
prefetching was only done for 447177/4776629 = 9.36% of data blocks.

Tested-by: ruxian.feng 
Co-developed-by: yuanjiong.gao 
Signed-off-by: yuanjiong.gao 
Signed-off-by: xianrong.zhou 
[EB: simplified the 'while' loops and improved the commit message]
Signed-off-by: Eric Biggers 
---
 drivers/md/dm-verity-target.c | 18 --
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/drivers/md/dm-verity-target.c b/drivers/md/dm-verity-target.c 
index 4fb33e7562c5..0d61e9c67986 100644
--- a/drivers/md/dm-verity-target.c
+++ b/drivers/md/dm-verity-target.c
@@ -611,8 +611,22 @@ static void verity_prefetch_io(struct work_struct *work)
 
 static void verity_submit_prefetch(struct dm_verity *v, struct dm_verity_io 
*io)  {
+   sector_t block = io->block;
+   unsigned int n_blocks = io->n_blocks;
struct dm_verity_prefetch_work *pw;
 
+   if (v->validated_blocks) {
+   while (n_blocks && test_bit(block, v->validated_blocks)) {
+   block++;
+   n_blocks--;
+   }
+   while (n_blocks && test_bit(block + n_blocks - 1,
+   v->validated_blocks))
+   n_blocks--;
+   if (!n_blocks)
+   return;
+   }
+
pw = kmalloc(sizeof(struct dm_verity_prefetch_work),
GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
 
@@ -621,8 +635,8 @@ static void verity_submit_prefetch(struct dm_verity *v, 
struct dm_verity_io *io)
 
INIT_WORK(>work, verity_prefetch_io);
pw->v = v;
-   pw->block = io->block;
-   pw->n_blocks = io->n_blocks;
+   pw->block = block;
+   pw->n_blocks = n_blocks;
queue_work(v->verify_wq, >work);
 }
 
--
2.24.1


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel