Re: [dm-devel] [PATCH RFC] dm thin: Add support for online trim to dm-thinpool

2023-10-09 Thread Joe Thornber
On Sat, Oct 7, 2023 at 2:33 AM Sarthak Kukreti 
wrote:

> Currently, dm-thinpool only supports offline trim: there is
> a userspace tool called `thin_trim` (part of `thin-provisioning-tools`),
> that will look up all the unmapped regions within the thinpool
> and issue discards to these regions. However, this can take up a lot of
> time, specially on larger storage devices.
>

I think this will work.  Give me a couple of days to do some testing, then
I'll get back to you.
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH v7 0/5] Introduce provisioning primitives

2023-05-30 Thread Joe Thornber
On Tue, May 30, 2023 at 3:02 PM Mike Snitzer  wrote:

>
> Also Joe, for you proposed dm-thinp design where you distinquish
> between "provision" and "reserve": Would it make sense for REQ_META
> (e.g. all XFS metadata) with REQ_PROVISION to be treated as an
> LBA-specific hard request?  Whereas REQ_PROVISION on its own provides
> more freedom to just reserve the length of blocks? (e.g. for XFS
> delalloc where LBA range is unknown, but dm-thinp can be asked to
> reserve space to accomodate it).
>

My proposal only involves 'reserve'.  Provisioning will be done as part of
the usual io path.

- Joe
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH v7 0/5] Introduce provisioning primitives

2023-05-30 Thread Joe Thornber
On Sat, May 27, 2023 at 12:45 AM Dave Chinner  wrote:

> On Fri, May 26, 2023 at 12:04:02PM +0100, Joe Thornber wrote:
>
> > 1) We have an api (ioctl, bio flag, whatever) that lets you
> > reserve/guarantee a region:
> >
> >   int reserve_region(dev, sector_t begin, sector_t end);
>
> A C-based interface is not sufficient because the layer that must do
> provsioning is not guaranteed to be directly under the filesystem.
> We must be able to propagate the request down to the layers that
> need to provision storage, and that includes hardware devices.
>
> e.g. dm-thin would have to issue REQ_PROVISION on the LBA ranges it
> allocates in it's backing device to guarantee that the provisioned
> LBA range it allocates is also fully provisioned by the storage
> below it
>

Fine, bio flag it is.


>
> >   This api should be used minimally, eg, critical FS metadata only.
>
>
>
> Plan for having to support tens of GBs of provisioned space in
> filesystems, not tens of MBs
>

Also fine.


I think there's a 2-3 solid days of coding to fully implement
> REQ_PROVISION support in XFS, including userspace tool support.
> Maybe a couple of weeks more to flush the bugs out before it's
> largely ready to go.
>
> So if there's buy in from the block layer and DM people for
> REQ_PROVISION as described, then I'll definitely have XFS support
> ready for you to test whenever dm-thinp is ready to go.
>

Great, this is what I wanted to hear.  I guess we need an ack from the
block guys and
then I'll get started.

- Joe
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH v7 0/5] Introduce provisioning primitives

2023-05-26 Thread Joe Thornber
Here's my take:

I don't see why the filesystem cares if thinp is doing a reservation or
provisioning under the hood.  All that matters is that a future write
to that region will be honoured (barring device failure etc.).

I agree that the reservation/force mapped status needs to be inherited
by snapshots.


One of the few strengths of thinp is the performance of taking a snapshot.
Most snapshots created are never activated.  Many other snapshots are
only alive for a brief period, and used read-only.  eg, blk-archive
(https://github.com/jthornber/blk-archive) uses snapshots to do very
fast incremental backups.  As such I'm strongly against any scheme that
requires provisioning as part of the snapshot operation.

Hank and I are in the middle of the range tree work which requires a
metadata
change.  So now is a convenient time to piggyback other metadata changes to
support reservations.


Given the above this is what I suggest:

1) We have an api (ioctl, bio flag, whatever) that lets you
reserve/guarantee a region:

  int reserve_region(dev, sector_t begin, sector_t end);

  This api should be used minimally, eg, critical FS metadata only.

2) Each thinp device will gain two extra fields in the metadata:

  - Total reserved (TR)
  - reserved actually provisioned (RAP)

  The difference between these two is the amount of space we need to
guarantee
  for this device.

3) Each individual mapping for a device will gain a flag indicating if it's
'reserved'.
   We will need to support 'reserved but unmapped' mappings.  There are
   two provisioning events:

  - unmapped but reserved -> mapped reserved.  Initial provision, RAP
incremented.
  - mapped and reserved -> ie. break sharing after snapshot.  Again RAP
incremented.

4) When a snapshot is taken:
  - Reset RAP to zero for origin
  - New snap has Total reserved set to that of origin, RAP <- 0
  - All mappings are shared, so the reserved flags for each mapping is
preserved
  - Only allow snapshot if there is enough free space for the new reserve
pool.


5) Reserve for the whole pool is the sum of (TR - RAP) for all devices.
This pool
   can only be touched if we're provisioning for a reserved mapping.

One drawback with this scheme is the double accounting due to RAP being
reset to zero for the origin device.  This comes about because after the
snapshot operation the two devices are equal, no sense of which device
is the origin is preserved or needed.  So a write to a given block
will trigger provisioning on whichever device it is written to first.
A later write to the other device will not trigger a provision.  I think
this problem can be solved without too much complexity.


Now this is a lot of work.  As well as the kernel changes we'll need to
update the userland tools: thin_check, thin_ls, thin_metadata_unpack,
thin_rmap, thin_delta, thin_metadata_pack, thin_repair, thin_trim,
thin_dump, thin_metadata_size, thin_restore.  Are we confident that we
have buy in from the FS teams that this will be widely adopted?  Are users
asking for this?  I really don't want to do 6 months of work for nothing.

- Joe

On Fri, May 26, 2023 at 10:37 AM Dave Chinner  wrote:

> On Thu, May 25, 2023 at 12:19:47PM -0400, Brian Foster wrote:
> > On Wed, May 24, 2023 at 10:40:34AM +1000, Dave Chinner wrote:
> > > On Tue, May 23, 2023 at 11:26:18AM -0400, Mike Snitzer wrote:
> > > > On Tue, May 23 2023 at 10:05P -0400, Brian Foster <
> bfos...@redhat.com> wrote:
> > > > > On Mon, May 22, 2023 at 02:27:57PM -0400, Mike Snitzer wrote:
> > > > > ... since I also happen to think there is a potentially interesting
> > > > > development path to make this sort of reserve pool configurable in
> terms
> > > > > of size and active/inactive state, which would allow the fs to use
> an
> > > > > emergency pool scheme for managing metadata provisioning and not
> have to
> > > > > track and provision individual metadata buffers at all (dealing
> with
> > > > > user data is much easier to provision explicitly). So the space
> > > > > inefficiency thing is potentially just a tradeoff for simplicity,
> and
> > > > > filesystems that want more granularity for better behavior could
> achieve
> > > > > that with more work. Filesystems that don't would be free to rely
> on the
> > > > > simple/basic mechanism provided by dm-thin and still have basic
> -ENOSPC
> > > > > protection with very minimal changes.
> > > > >
> > > > > That's getting too far into the weeds on the future bits, though.
> This
> > > > > is essentially 99% a dm-thin approach, so I'm mainly curious if
> there's
> > > > > sufficient interest in this sort of "reserve mode" approach to try
> and
> > > > > clean it up further and have dm guys look at it, or if you guys
> see any
> > > > > obvious issues in what it does that makes it potentially
> problematic, or
> > > > > if you would just prefer to go down the path described above...
> > > >
> > > > The model that Dave detailed, which builds on REQ_PROVISION and is
> > > > sticky (by 

Re: [dm-devel] [PATCH v3 2/3] dm: Add support for block provisioning

2023-04-14 Thread Joe Thornber
On Fri, Apr 14, 2023 at 7:52 AM Sarthak Kukreti 
wrote:

> Add support to dm devices for REQ_OP_PROVISION. The default mode
> is to passthrough the request to the underlying device, if it
> supports it. dm-thinpool uses the provision request to provision
> blocks for a dm-thin device. dm-thinpool currently does not
> pass through REQ_OP_PROVISION to underlying devices.
>
> For shared blocks, provision requests will break sharing and copy the
> contents of the entire block.
>

I see two issue with this patch:

i) You use get_bio_block_range() to see which blocks the provision bio
covers.  But this function only returns
complete blocks that are covered (it was designed for discard).  Unlike
discard, provision is not a hint so those
partial blocks will need to be provisioned too.

ii) You are setting off multiple dm_thin_new_mapping operations in flight
at once.  Each of these receives
the same virt_cell and frees it  when it completes.  So I think we have
multiple frees occuring?  In addition you also
release the cell yourself in process_provision_cell().  Fixing this is not
trivial, you'll need to reference count the cells,
and aggregate the mapping operation results.

I think it would be far easier to restrict the size of the provision bio to
be no bigger than one thinp block (as we do for normal io).  This way dm
core can split the bios, chain the child bios rather than having to
reference count mapping ops, and aggregate the results.

- Joe
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [dm-6.4 PATCH 1/8] dm: split discards further if target sets max_discard_granularity

2023-03-23 Thread Joe Thornber
Also nacking this patch since you refused to test it.  I said I'd test
since you haven't, but you couldn't wait.

- Joe

On Wed, Mar 22, 2023 at 6:19 PM Mike Snitzer  wrote:

> The block core (bio_split_discard) will already split discards based
> on the 'discard_granularity' and 'max_discard_sectors' queue_limits.
> But the DM thin target also needs to ensure that it doesn't receive a
> discard that spans a 'max_discard_sectors' boundary.
>
> Introduce a dm_target 'max_discard_granularity' flag that if set will
> cause DM core to split discard bios relative to 'max_discard_sectors'.
> This treats 'discard_granularity' as a "min_discard_granularity" and
> 'max_discard_sectors' as a "max_discard_granularity".
>
> Requested-by: Joe Thornber 
> Signed-off-by: Mike Snitzer 
> ---
>  drivers/md/dm.c   | 54 +++
>  include/linux/device-mapper.h |  6 
>  include/uapi/linux/dm-ioctl.h |  4 +--
>  3 files changed, 37 insertions(+), 27 deletions(-)
>
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index b6ace995b9ca..eeb58f89369e 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1162,7 +1162,8 @@ static inline sector_t
> max_io_len_target_boundary(struct dm_target *ti,
> return ti->len - target_offset;
>  }
>
> -static sector_t max_io_len(struct dm_target *ti, sector_t sector)
> +static sector_t __max_io_len(struct dm_target *ti, sector_t sector,
> +unsigned int max_granularity)
>  {
> sector_t target_offset = dm_target_offset(ti, sector);
> sector_t len = max_io_len_target_boundary(ti, target_offset);
> @@ -1173,11 +1174,16 @@ static sector_t max_io_len(struct dm_target *ti,
> sector_t sector)
>  *   explains why stacked chunk_sectors based splitting via
>  *   bio_split_to_limits() isn't possible here.
>  */
> -   if (!ti->max_io_len)
> +   if (!max_granularity)
> return len;
> return min_t(sector_t, len,
> min(queue_max_sectors(ti->table->md->queue),
> -   blk_chunk_sectors_left(target_offset,
> ti->max_io_len)));
> +   blk_chunk_sectors_left(target_offset,
> max_granularity)));
> +}
> +
> +static inline sector_t max_io_len(struct dm_target *ti, sector_t sector)
> +{
> +   return __max_io_len(ti, sector, ti->max_io_len);
>  }
>
>  int dm_set_target_max_io_len(struct dm_target *ti, sector_t len)
> @@ -1561,26 +1567,6 @@ static void __send_empty_flush(struct clone_info
> *ci)
> bio_uninit(ci->bio);
>  }
>
> -static void __send_changing_extent_only(struct clone_info *ci, struct
> dm_target *ti,
> -   unsigned int num_bios)
> -{
> -   unsigned int len, bios;
> -
> -   len = min_t(sector_t, ci->sector_count,
> -   max_io_len_target_boundary(ti, dm_target_offset(ti,
> ci->sector)));
> -
> -   atomic_add(num_bios, >io->io_count);
> -   bios = __send_duplicate_bios(ci, ti, num_bios, );
> -   /*
> -* alloc_io() takes one extra reference for submission, so the
> -* reference won't reach 0 without the following (+1) subtraction
> -*/
> -   atomic_sub(num_bios - bios + 1, >io->io_count);
> -
> -   ci->sector += len;
> -   ci->sector_count -= len;
> -}
> -
>  static bool is_abnormal_io(struct bio *bio)
>  {
> enum req_op op = bio_op(bio);
> @@ -1602,11 +1588,16 @@ static bool is_abnormal_io(struct bio *bio)
>  static blk_status_t __process_abnormal_io(struct clone_info *ci,
>   struct dm_target *ti)
>  {
> -   unsigned int num_bios = 0;
> +   unsigned int bios, num_bios = 0;
> +   unsigned int len, max_granularity = 0;
>
> switch (bio_op(ci->bio)) {
> case REQ_OP_DISCARD:
> num_bios = ti->num_discard_bios;
> +   if (ti->max_discard_granularity) {
> +   struct queue_limits *limits =
> dm_get_queue_limits(ti->table->md);
> +   max_granularity = limits->max_discard_sectors;
> +   }
> break;
> case REQ_OP_SECURE_ERASE:
> num_bios = ti->num_secure_erase_bios;
> @@ -1627,7 +1618,20 @@ static blk_status_t __process_abnormal_io(struct
> clone_info *ci,
> if (unlikely(!num_bios))
> return BLK_STS_NOTSUPP;
>
> -   __send_changing_extent_only(ci, ti, num_bios);
> +   len = min_t(sector_t, ci->sector_count,
> +   __max_io_len(ti, ci->sector,

Re: [dm-devel] Thin pool CoW latency

2023-03-06 Thread Joe Thornber
On Sun, Mar 5, 2023 at 8:40 PM Demi Marie Obenour <
d...@invisiblethingslab.com> wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA512
>
> Like Eric, I am very concerned about CoW latency and throughput.  I
> am almost certain that allocating new blocks and snapshot copy-on-write
> are _the_ hot paths in Qubes OS.  In particular, I suspect that
> workloads such as building an image in a throwaway VM or installing
> packages onto a root volume that had just been shapshotted are dominated
> by metadata operations, rather than by in-place updates.  I suspect that
> frequently-snapshotted volumes will observe similar behavior in general.
>
>
Yes, provisioning and breaking sharing are relatively slow operations.  As
discussed with Eric
I'm not intending to change how either of these operations is implemented.
If the performance
profile is not suitable for your application your company can either do
some work to improve it yourselves, or
select a different solution.

- Joe
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [announce] thin-provisioning-tools v1.0.0-rc1

2023-03-06 Thread Joe Thornber
Hi Eric,

On Fri, Mar 3, 2023 at 9:21 PM Eric Wheeler 
wrote:

>
> It would be nice to get people testing the new improvements:
>
> Do you think it can make it for the 6.3 merge window that is open?
>

Doubtful.  The bulk of the changes are in dm-bufio, which is used by a lot
of targets.  So
I want to do a lot more testing before pushing upstream.



> Did thinp v2 get dropped, or just turn into the patchset above?
>
>
I've shelved thinp v2 in favour of userland approach I outlined.

 > I think io_uring, and ublk have shown us that this is viable.  That way

> > a snapshot copy-on-write, or dm-cache data migration, which are very
> > slow operations can be done with ordinary userland code.
>
> Would be nice to minimize CoW latency somehow if going to userspace
> increases that a notable amount.  CoW for spinning disks is definitely
> slow, but NVMe's are pretty quick to copy a 64k chunk.
>

Yes, minimising latency would be good.  I don't mind performing the CoW
within kernel, but I don't want to
handle the metadata updates in kernel.


> > For the fast paths, layering will be removed by having userland give
> > the kernel instruction to execute for specific regions of the virtual
> > device (ie. remap to here).
>
> Maybe you just answered my question of latency?
>

Yep.


>
> > The kernel driver will have nothing specific to thin/cache etc. I'm not
> > sure how many of the current dm-targets would fit into this model, but
> > I'm sure thin provisioning, caching, linear, and stripe can.
>
> To be clear, linear and stripe would stay in the kernel?


Linear and stripe would not need a call out to userland to service.  And
very little of thin/cache io would either.

 - Joe
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [announce] thin-provisioning-tools v1.0.0-rc1

2023-03-02 Thread Joe Thornber
Hi Eric,

On Wed, Mar 1, 2023 at 10:26 PM Eric Wheeler 
wrote:

>
> Hurrah! I've been looking forward to this for a long time...
>
>
> ...So if you have any commentary on the future of dm-thin with respect
> to metadata range support, or dm-thin performance in general, that I would
> be very curious about your roadmap and your plans.
>

The plan over the next few months is roughly:

- Get people using the new Rust tools.  They are _so_ much faster than the
old C++ ones. [available now]
- Push upstream a set of patches I've been working on to boost thin
concurrency performance.  These are
  nearing completion and are available here for those who are interested:
https://github.com/jthornber/linux/tree/2023-02-28-thin-concurrency-7.
  These are making a huge difference to performance in my testing, eg, fio
with 16 jobs running concurrently gets several times the throughput.
  [Upstream in the next month hopefully]
- Change thinp metadata to store ranges rather than individual mappings.
This will reduce the amount of space the metadata consumes, and
  have the knock on effect of boosting performance slightly (less metadata
means faster lookups).  However I consider this a half-way house, in
  that I'm only going to change the metadata and not start using ranges
within the core target (I'm not moving away from fixed block sizes).  [Next
3 months]

I don't envisage significant changes to dm-thin or dm-cache after this.


Longer term I think we're nearing a crunch point where we drastically
change how we do things.  Since I wrote device-mapper in 2001 the speed of
devices has increased so much that I think dm is no longer doing a good job:

- The layering approach introduces inefficiencies with each layer.  Sure it
may only be a 5% hit to add another linear mapping into the stack.
  But those 5%'s add up.
- dm targets only see individual bios rather than the whole request queue.
This prevents a lot of really useful optimisations.
  Think how much smarter dm-cache and dm-thin could be if they could look
at the whole queue.
- The targets are getting too complicated.  I think dm-thin is around 8k
lines of code, though it shares most of that with dm-cache.
   I understand the dedup target from the vdo guys weighs in at 64k lines.
Kernel development is fantastically expensive (or slow depending
   how you want to look at it).  I did a lot of development work on thinp
v2, and it was looking a lot like a filesystem shoe-horned into the block
layer.
   I can see why bcache turned into bcache-fs.
- Code within the block layer is memory constrained.  We can't allocate
arbitrary sized allocations within targets, instead we have to use mempools
  of fixed size objects (frowned upon these days), or declare up front how
much memory we need to service a bio (forcing us to assume the worst case).
  This stuff isn't hard, just tedious and makes coding sophisticated
targets pretty joyless.

So my plan going forwards is to keep the fast path of these targets in
kernel (eg, a write to a provisioned, unsnapshotted region).  But take
the slow paths out to userland.  I think io_uring, and ublk have shown us
that this is viable.  That way a snapshot copy-on-write, or dm-cache data
migration, which are very slow operations can be done with ordinary
userland code.  For the fast paths, layering will be removed by having
userland give the kernel
instruction to execute for specific regions of the virtual device (ie.
remap to here).  The kernel driver will have nothing specific to thin/cache
etc.
I'm not sure how many of the current dm-targets would fit into this model,
but I'm sure thin provisioning, caching, linear, and stripe can.

- Joe








> Thanks again for all your great work on this.
>
> -Eric
>
> > [note: _data_ sharing was always maintained, this is purely about
> metadata space usage]
> >
> > # thin_metadata_pack/unpack
> >
> > These are a couple of new tools that are used for support.  They compress
> > thin metadata, typically to a tenth of the size (much better than you'd
> > get with generic compressors).  This makes it easier to pass damaged
> > metadata around for inspection.
> >
> > # blk-archive
> >
> > The blk-archive tools were initially part of this thin-provisioning-tools
> > package.  But have now been split off to their own project:
> >
> > https://github.com/jthornber/blk-archive
> >
> > They allow efficient archiving of thin devices (data deduplication
> > and compression).  Which will be of interest to those of you who are
> > holding large numbers of snapshots in thin pools as a poor man's backup.
> >
> > In particular:
> >
> > - Thin snapshots can be used to archive live data.
> > - it avoids reading unprovisioned areas of thin devices.
> > - it can calculate deltas between thin devices to minimise
> >   how much data is read and deduped (incremental backups).
> > - restoring to a thin device tries to maximise data sharing
> >   within the thin pool (a big win if you're restoring 

Re: [dm-devel] [PATCH 2/2] dm-thin: Allow specifying an offset

2023-02-07 Thread Joe Thornber
Nack.  I'm not building a linear target into every other target.  Layering
targets is simple.

On Tue, Feb 7, 2023 at 7:56 AM Demi Marie Obenour <
d...@invisiblethingslab.com> wrote:

> This allows exposing only part of a thin volume without having to layer
> dm-linear.  One use-case is a hypervisor replacing a partition table.
>
> Signed-off-by: Demi Marie Obenour 
> ---
>  drivers/md/dm-thin.c | 32 ++--
>  1 file changed, 26 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
> index
> d85fdbd782ae5426003c99a4b4bf53818cc85efa..87f14933375b050a950a5f58e98c13b4d28f6af0
> 100644
> --- a/drivers/md/dm-thin.c
> +++ b/drivers/md/dm-thin.c
> @@ -357,6 +357,7 @@ struct thin_c {
>  */
> refcount_t refcount;
> struct completion can_destroy;
> +   u64 offset;
>  };
>
>  /**/
> @@ -1180,9 +1181,9 @@ static void
> process_prepared_discard_passdown_pt1(struct dm_thin_new_mapping *m)
> discard_parent = bio_alloc(NULL, 1, 0, GFP_NOIO);
> discard_parent->bi_end_io = passdown_endio;
> discard_parent->bi_private = m;
> -   if (m->maybe_shared)
> -   passdown_double_checking_shared_status(m, discard_parent);
> -   else {
> +   if (m->maybe_shared)
> +   passdown_double_checking_shared_status(m, discard_parent);
> +   else {
> struct discard_op op;
>
> begin_discard(, tc, discard_parent);
> @@ -4149,7 +4150,7 @@ static int thin_ctr(struct dm_target *ti, unsigned
> argc, char **argv)
>
> mutex_lock(_thin_pool_table.mutex);
>
> -   if (argc != 2 && argc != 3) {
> +   if (argc < 2 || argc > 4) {
> ti->error = "Invalid argument count";
> r = -EINVAL;
> goto out_unlock;
> @@ -4168,7 +4169,8 @@ static int thin_ctr(struct dm_target *ti, unsigned
> argc, char **argv)
> bio_list_init(>retry_on_resume_list);
> tc->sort_bio_list = RB_ROOT;
>
> -   if (argc == 3) {
> +   /* Use "/" to indicate "no origin device" while providing an
> offset */
> +   if (argc >= 3 && strcmp(argv[2], "/")) {
> if (!strcmp(argv[0], argv[2])) {
> ti->error = "Error setting origin device";
> r = -EINVAL;
> @@ -4196,6 +4198,23 @@ static int thin_ctr(struct dm_target *ti, unsigned
> argc, char **argv)
> goto bad_common;
> }
>
> +   tc->offset = 0;
> +   if (argc > 3) {
> +   sector_t sector_offset;
> +
> +   if (kstrtoull(argv[3], 10, >offset)) {
> +   ti->error = "Invalid offset";
> +   r = -EINVAL;
> +   goto bad_common;
> +   }
> +
> +   if (check_add_overflow(tc->offset, ti->len,
> _offset)) {
> +   ti->error = "Offset + len overflows sector_t";
> +   r = -EINVAL;
> +   goto bad_common;
> +   }
> +   }
> +
> pool_md = dm_get_md(tc->pool_dev->bdev->bd_dev);
> if (!pool_md) {
> ti->error = "Couldn't get pool mapped device";
> @@ -4285,8 +4304,9 @@ static int thin_ctr(struct dm_target *ti, unsigned
> argc, char **argv)
>
>  static int thin_map(struct dm_target *ti, struct bio *bio)
>  {
> -   bio->bi_iter.bi_sector = dm_target_offset(ti,
> bio->bi_iter.bi_sector);
> +   struct thin_c *tc = ti->private;
>
> +   bio->bi_iter.bi_sector = dm_target_offset(ti,
> bio->bi_iter.bi_sector) + tc->offset;
> return thin_bio_map(ti, bio);
>  }
>
> --
> Sincerely,
> Demi Marie Obenour (she/her/hers)
> Invisible Things Lab
>
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://listman.redhat.com/mailman/listinfo/dm-devel
>
>
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH 1/2] Fail I/O to thin pool devices

2023-02-07 Thread Joe Thornber
Nack.

I don't see the security issue; how is this any different from running the
thin tools on any incorrect device?  Or even the data device that the pool
is mirroring.  In general the thin tools don't modify the metadata they're
running on.  If you know of a security issue with the thin tools please let
me know.


On Tue, Feb 7, 2023 at 7:56 AM Demi Marie Obenour <
d...@invisiblethingslab.com> wrote:

> A thin pool device currently just passes all I/O to its origin device,
> but this is a footgun: the user might not realize that tools that
> operate on thin pool metadata must operate on the metadata volume.  This
> could have security implications.
>
> Fix this by failing all I/O to thin pool devices.
>
> Signed-off-by: Demi Marie Obenour 
> ---
>  drivers/md/dm-thin.c | 17 ++---
>  1 file changed, 6 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
> index
> 64cfcf46881dc5d87d5dfdb5650ba9babd32cd31..d85fdbd782ae5426003c99a4b4bf53818cc85efa
> 100644
> --- a/drivers/md/dm-thin.c
> +++ b/drivers/md/dm-thin.c
> @@ -3405,19 +3405,14 @@ static int pool_ctr(struct dm_target *ti, unsigned
> argc, char **argv)
>
>  static int pool_map(struct dm_target *ti, struct bio *bio)
>  {
> -   int r;
> -   struct pool_c *pt = ti->private;
> -   struct pool *pool = pt->pool;
> -
> /*
> -* As this is a singleton target, ti->begin is always zero.
> +* Previously, access to the pool was passed down to the origin
> device.
> +* However, this turns out to be error-prone: if the user runs any
> of
> +* the thin tools on the pool device, the tools could wind up
> parsing
> +* potentially attacker-controlled data.  This mistake has actually
> +* happened in practice.  Therefore, fail all I/O on the pool
> device.
>  */
> -   spin_lock_irq(>lock);
> -   bio_set_dev(bio, pt->data_dev->bdev);
> -   r = DM_MAPIO_REMAPPED;
> -   spin_unlock_irq(>lock);
> -
> -   return r;
> +   return -EIO;
>  }
>
>  static int maybe_resize_data_dev(struct dm_target *ti, bool *need_commit)
> --
> Sincerely,
> Demi Marie Obenour (she/her/hers)
> Invisible Things Lab
>
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://listman.redhat.com/mailman/listinfo/dm-devel
>
>
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


[dm-devel] [announce] thin-provisioning-tools v1.0.0-rc1

2022-12-12 Thread Joe Thornber
We're pleased to announce the first release candidate of v1.0.0 of the
thin-provisioning-tools (which also contains tools for dm-cache and
dm-era).

Please try it out on your test systems and give us feedback.  In
particular regarding documentation, build and install process.

https://github.com/jthornber/thin-provisioning-tools


# Rust

This is a complete rewrite of the tools in the Rust language.  This switch
was made for three primary reasons:

- Memory safety.
- The type system makes it easier to reason about multithreaded code and we
need
  to use multiple threads to boost performance.
- Rust is a more expressive language than C++ (eg, proper algebraic data
types).

# IO engines

The old C++ tools used libaio for all IO.  The Rust version by default
uses traditional synchronous IO.  This sounds like a step backwards,
but the use of multiple threads and IO scheduling means there's a big
leap in performance.

In addition there's a compile time option to include asynchronous
IO support via io_uring.  This engine is slightly faster, but not all
distributions support io_uring at the moment.  In addition we've seen
recent (summer 2022) kernel versions that lose io notifications, causing
us to feel that io_uring itself isn't quite ready.

# Performance

We've focussed on thin_check and cache_check performance most of all
since these regularly get run on system startup.  But all tools should
have made significant performance improvements.

Over the years we've collected some gnarly dm-thin metadata examples from
users, eg, using hundreds of thousands of snapshots, and completely
filling the maximum metadata size of 16G.  These are great for
benchmarking, for example running thin_check on my system with one of these
files:

thin_check (v0.9):  6m
thin_check (v1.0, sync engine): 28s  (12.9 times faster)
thin_check (v1.0, io_uring engine): 23s  (15.6 times faster)

# thin_dump/restore retains snapshot sharing

Another issue with previous versions of the tools is dumping and restoring
thin metadata would have the effect of losing sharing of the metadata
btrees for snapshots.  Meaning restored metadata often took up more space,
and
in some cases would exceed the 16G metadata limit.  This is no longer the
case.

[note: _data_ sharing was always maintained, this is purely about metadata
space usage]

# thin_metadata_pack/unpack

These are a couple of new tools that are used for support.  They compress
thin metadata, typically to a tenth of the size (much better than you'd
get with generic compressors).  This makes it easier to pass damaged
metadata around for inspection.

# blk-archive

The blk-archive tools were initially part of this thin-provisioning-tools
package.  But have now been split off to their own project:

https://github.com/jthornber/blk-archive

They allow efficient archiving of thin devices (data deduplication
and compression).  Which will be of interest to those of you who are
holding large numbers of snapshots in thin pools as a poor man's backup.

In particular:

- Thin snapshots can be used to archive live data.
- it avoids reading unprovisioned areas of thin devices.
- it can calculate deltas between thin devices to minimise
  how much data is read and deduped (incremental backups).
- restoring to a thin device tries to maximise data sharing
  within the thin pool (a big win if you're restoring snapshots).
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH -next] dm thin: Use last transaction's pmd->root when commit failed

2022-12-08 Thread Joe Thornber
Acked-by: Joe Thornber 

On Thu, Dec 8, 2022 at 2:07 PM Zhihao Cheng  wrote:

> Recently we found a softlock up problem in dm thin pool looking up btree
> caused by corrupted metadata:
>  Kernel panic - not syncing: softlockup: hung tasks
>  CPU: 7 PID: 2669225 Comm: kworker/u16:3
>  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
>  Workqueue: dm-thin do_worker [dm_thin_pool]
>  Call Trace:
>
>dump_stack+0x9c/0xd3
>panic+0x35d/0x6b9
>watchdog_timer_fn.cold+0x16/0x25
>__run_hrtimer+0xa2/0x2d0
>
>RIP: 0010:__relink_lru+0x102/0x220 [dm_bufio]
>__bufio_new+0x11f/0x4f0 [dm_bufio]
>new_read+0xa3/0x1e0 [dm_bufio]
>dm_bm_read_lock+0x33/0xd0 [dm_persistent_data]
>ro_step+0x63/0x100 [dm_persistent_data]
>btree_lookup_raw.constprop.0+0x44/0x220 [dm_persistent_data]
>dm_btree_lookup+0x16f/0x210 [dm_persistent_data]
>dm_thin_find_block+0x12c/0x210 [dm_thin_pool]
>__process_bio_read_only+0xc5/0x400 [dm_thin_pool]
>process_thin_deferred_bios+0x1a4/0x4a0 [dm_thin_pool]
>process_one_work+0x3c5/0x730
>
> Following process may generate a broken btree mixed with fresh and stale
> nodes, which could let dm thin trapped into an infinite loop while looking
> up data block:
>  Transaction 1: pmd->root = A, A->B->C   // One path in btree
> pmd->root = X, X->Y->Z   // Copy-up
>  Transaction 2: X,Z is updated on disk, Y is written failed.
> // Commit failed, dm thin becomes read-only.
> process_bio_read_only
>  dm_thin_find_block
>   __find_block
>dm_btree_lookup(pmd->root)
> The pmd->root points to a broken btree, Y may contain stale node pointing
> to any block, for example X, which lets dm thin trapped into a dead loop
> while looking up Z.
>
> Fix it by setting pmd->root in __open_metadata(), so that dm thin will use
> last transaction's pmd->root if commit failed.
>
> Fetch a reproducer in [Link].
>
> Linke: https://bugzilla.kernel.org/show_bug.cgi?id=216790
> Fixes: 991d9fa02da0 ("dm: add thin provisioning target")
> Signed-off-by: Zhihao Cheng 
> ---
>  drivers/md/dm-thin-metadata.c | 9 +
>  1 file changed, 9 insertions(+)
>
> diff --git a/drivers/md/dm-thin-metadata.c b/drivers/md/dm-thin-metadata.c
> index 1a62226ac34e..26c42ee183ed 100644
> --- a/drivers/md/dm-thin-metadata.c
> +++ b/drivers/md/dm-thin-metadata.c
> @@ -724,6 +724,15 @@ static int __open_metadata(struct dm_pool_metadata
> *pmd)
> goto bad_cleanup_data_sm;
> }
>
> +   /*
> +* For pool metadata opening process, root setting is redundant
> +* because it will be set again in __begin_transaction(). But dm
> +* pool aborting process really needs to get last transaction's
> +* root in case accessing broken btree.
> +*/
> +   pmd->root = le64_to_cpu(disk_super->data_mapping_root);
> +   pmd->details_root = le64_to_cpu(disk_super->device_details_root);
> +
> __setup_btree_details(pmd);
> dm_bm_unlock(sblock);
>
> --
> 2.31.1
>
>
--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH 4/4 v2] persistent-data: reduce lock contention while walking the btree

2022-10-13 Thread Joe Thornber
fio results (many jobs, large queue depth, 64 core machine, preallocated
thin):

upstream:
READ: bw=303MiB/s (318MB/s), 15.7MiB/s-24.9MiB/s (16.5MB/s-26.1MB/s),
io=9093MiB (9535MB), run=30004-30007msec
WRITE: bw=304MiB/s (318MB/s), 15.7MiB/s-24.8MiB/s (16.5MB/s-26.0MB/s),
io=9108MiB (9550MB), run=30004-30007msec

Mikulas' patches:
READ: bw=524MiB/s (549MB/s), 32.4MiB/s-32.9MiB/s (34.0MB/s-34.5MB/s),
io=15.3GiB (16.5GB), run=30001-30002msec
WRITE: bw=524MiB/s (550MB/s), 32.5MiB/s-32.9MiB/s (34.1MB/s-34.5MB/s),
io=15.4GiB (16.5GB), run=30001-30002msec

My patches:
READ: bw=538MiB/s (564MB/s), 33.4MiB/s-33.7MiB/s (35.1MB/s-35.4MB/s),
io=15.8GiB (16.9GB), run=30001-30002msec
WRITE: bw=539MiB/s (565MB/s), 33.4MiB/s-33.8MiB/s (35.0MB/s-35.4MB/s),
io=15.8GiB (16.9GB), run=30001-30002msec

Both our approaches give similar results.  Mine is slightly faster because
the changes are all internal to dm-bufio, so everything
benefits from the added concurrency not just the btrees (eg, space maps).

As mentioned on irc I don't like the new interface to dm-bufio.  Having
different methods of getting a buffer is ugly, you have to fall back to the
other method if it fails, and remember how you eventually got it for
unlocking.  Plus we need to change any code to take advantage of it.
Currently dm-bufio is used by dm-thin, dm-cache, dm-era, dm-verity,
dm-ebs-target, dm-integrity, and dm-snap-persistent.  We're not going to
audit all these and see which can benefit.

On the other hand I don't like the size of my patch (~1200 line diff).
I'll post it when it's complete and we can continue the discussion then.

- Joe


On Wed, Oct 12, 2022 at 7:31 AM Joe Thornber  wrote:

> Thanks Mikulas,
>
> I'll test this morning.
>
> - Joe
>
>
> On Tue, Oct 11, 2022 at 8:10 PM Mikulas Patocka 
> wrote:
>
>> Hi
>>
>> Here I'm sending updated patch 4 that fixes hang on discard. We must not
>> do the optimization in dm_btree_lookup_next.
>>
>> Mikulas
>>
>>
>> From: Mikulas Patocka 
>>
>> This patch reduces lock contention in btree walks. We modify the
>> functions init_ro_wpin, exit_ro_spine and ro_step so that they use
>> dm_bufio_lock_read/dm_bufio_get_unlocked/dm_bufio_unlock_read. If
>> dm_bm_fast_get_block fails, we fallback to normal locking.
>>
>> When doing tests on pmem and fully provisioned thin volume, it improves
>> thoughput from 286MiB/s to 442MiB/s.
>>
>> fio --ioengine=psync --iodepth=1 --rw=randrw --bs=4k --direct=1
>> --numjobs=12 --time_based --runtime=10 --group_reporting --name=/dev/vg/thin
>> before:
>> READ: bw=286MiB/s
>> WRITE: bw=286MiB/s
>> after:
>> READ: bw=442MiB/s
>> WRITE: bw=442MiB/s
>>
>> Signed-off-by: Mikulas Patocka 
>>
>> ---
>>  drivers/md/persistent-data/dm-block-manager.c   |   32
>> +++
>>  drivers/md/persistent-data/dm-block-manager.h   |6 ++
>>  drivers/md/persistent-data/dm-btree-internal.h  |4 +
>>  drivers/md/persistent-data/dm-btree-spine.c |   41
>> ++--
>>  drivers/md/persistent-data/dm-btree.c   |6 +-
>>  drivers/md/persistent-data/dm-transaction-manager.c |   17 
>>  drivers/md/persistent-data/dm-transaction-manager.h |6 ++
>>  7 files changed, 104 insertions(+), 8 deletions(-)
>>
>> Index: linux-2.6/drivers/md/persistent-data/dm-block-manager.c
>> ===
>> --- linux-2.6.orig/drivers/md/persistent-data/dm-block-manager.c
>> +++ linux-2.6/drivers/md/persistent-data/dm-block-manager.c
>> @@ -601,6 +601,38 @@ void dm_bm_unlock(struct dm_block *b)
>>  }
>>  EXPORT_SYMBOL_GPL(dm_bm_unlock);
>>
>> +void dm_bm_fast_lock(struct dm_block_manager *bm)
>> +{
>> +   dm_bufio_lock_read(bm->bufio);
>> +}
>> +
>> +void dm_bm_fast_unlock(struct dm_block_manager *bm)
>> +{
>> +   dm_bufio_unlock_read(bm->bufio);
>> +}
>> +
>> +int dm_bm_fast_get_block(struct dm_block_manager *bm,
>> +dm_block_t b, struct dm_block_validator *v,
>> +struct dm_block **result)
>> +{
>> +   int r;
>> +   struct buffer_aux *aux;
>> +   void *p;
>> +
>> +   p = dm_bufio_get_unlocked(bm->bufio, b, (struct dm_buffer **)
>> result);
>> +   if (IS_ERR(p))
>> +   return PTR_ERR(p);
>> +   if (unlikely(!p))
>> +   return -EWOULDBLOCK;
>> +
>> +   aux = dm_bufio_get_aux_data(to_buffer(*result));
>> +   r = dm_bm_validate_buffer(bm, to_buffer(*result), aux, v);
>> + 

Re: [dm-devel] [PATCH 4/4 v2] persistent-data: reduce lock contention while walking the btree

2022-10-12 Thread Joe Thornber
Thanks Mikulas,

I'll test this morning.

- Joe


On Tue, Oct 11, 2022 at 8:10 PM Mikulas Patocka  wrote:

> Hi
>
> Here I'm sending updated patch 4 that fixes hang on discard. We must not
> do the optimization in dm_btree_lookup_next.
>
> Mikulas
>
>
> From: Mikulas Patocka 
>
> This patch reduces lock contention in btree walks. We modify the
> functions init_ro_wpin, exit_ro_spine and ro_step so that they use
> dm_bufio_lock_read/dm_bufio_get_unlocked/dm_bufio_unlock_read. If
> dm_bm_fast_get_block fails, we fallback to normal locking.
>
> When doing tests on pmem and fully provisioned thin volume, it improves
> thoughput from 286MiB/s to 442MiB/s.
>
> fio --ioengine=psync --iodepth=1 --rw=randrw --bs=4k --direct=1
> --numjobs=12 --time_based --runtime=10 --group_reporting --name=/dev/vg/thin
> before:
> READ: bw=286MiB/s
> WRITE: bw=286MiB/s
> after:
> READ: bw=442MiB/s
> WRITE: bw=442MiB/s
>
> Signed-off-by: Mikulas Patocka 
>
> ---
>  drivers/md/persistent-data/dm-block-manager.c   |   32 +++
>  drivers/md/persistent-data/dm-block-manager.h   |6 ++
>  drivers/md/persistent-data/dm-btree-internal.h  |4 +
>  drivers/md/persistent-data/dm-btree-spine.c |   41
> ++--
>  drivers/md/persistent-data/dm-btree.c   |6 +-
>  drivers/md/persistent-data/dm-transaction-manager.c |   17 
>  drivers/md/persistent-data/dm-transaction-manager.h |6 ++
>  7 files changed, 104 insertions(+), 8 deletions(-)
>
> Index: linux-2.6/drivers/md/persistent-data/dm-block-manager.c
> ===
> --- linux-2.6.orig/drivers/md/persistent-data/dm-block-manager.c
> +++ linux-2.6/drivers/md/persistent-data/dm-block-manager.c
> @@ -601,6 +601,38 @@ void dm_bm_unlock(struct dm_block *b)
>  }
>  EXPORT_SYMBOL_GPL(dm_bm_unlock);
>
> +void dm_bm_fast_lock(struct dm_block_manager *bm)
> +{
> +   dm_bufio_lock_read(bm->bufio);
> +}
> +
> +void dm_bm_fast_unlock(struct dm_block_manager *bm)
> +{
> +   dm_bufio_unlock_read(bm->bufio);
> +}
> +
> +int dm_bm_fast_get_block(struct dm_block_manager *bm,
> +dm_block_t b, struct dm_block_validator *v,
> +struct dm_block **result)
> +{
> +   int r;
> +   struct buffer_aux *aux;
> +   void *p;
> +
> +   p = dm_bufio_get_unlocked(bm->bufio, b, (struct dm_buffer **)
> result);
> +   if (IS_ERR(p))
> +   return PTR_ERR(p);
> +   if (unlikely(!p))
> +   return -EWOULDBLOCK;
> +
> +   aux = dm_bufio_get_aux_data(to_buffer(*result));
> +   r = dm_bm_validate_buffer(bm, to_buffer(*result), aux, v);
> +   if (unlikely(r))
> +   return r;
> +
> +   return 0;
> +}
> +
>  int dm_bm_flush(struct dm_block_manager *bm)
>  {
> if (dm_bm_is_read_only(bm))
> Index: linux-2.6/drivers/md/persistent-data/dm-block-manager.h
> ===
> --- linux-2.6.orig/drivers/md/persistent-data/dm-block-manager.h
> +++ linux-2.6/drivers/md/persistent-data/dm-block-manager.h
> @@ -96,6 +96,12 @@ int dm_bm_write_lock_zero(struct dm_bloc
>
>  void dm_bm_unlock(struct dm_block *b);
>
> +void dm_bm_fast_lock(struct dm_block_manager *bm);
> +void dm_bm_fast_unlock(struct dm_block_manager *bm);
> +int dm_bm_fast_get_block(struct dm_block_manager *bm,
> +dm_block_t b, struct dm_block_validator *v,
> +struct dm_block **result);
> +
>  /*
>   * It's a common idiom to have a superblock that should be committed last.
>   *
> Index: linux-2.6/drivers/md/persistent-data/dm-btree-internal.h
> ===
> --- linux-2.6.orig/drivers/md/persistent-data/dm-btree-internal.h
> +++ linux-2.6/drivers/md/persistent-data/dm-btree-internal.h
> @@ -64,9 +64,11 @@ struct ro_spine {
> struct dm_btree_info *info;
>
> struct dm_block *node;
> +   bool fast_locked;
> +   bool fast_lock_failed;
>  };
>
> -void init_ro_spine(struct ro_spine *s, struct dm_btree_info *info);
> +void init_ro_spine(struct ro_spine *s, struct dm_btree_info *info, bool
> disable_fast_access);
>  void exit_ro_spine(struct ro_spine *s);
>  int ro_step(struct ro_spine *s, dm_block_t new_child);
>  struct btree_node *ro_node(struct ro_spine *s);
> Index: linux-2.6/drivers/md/persistent-data/dm-btree-spine.c
> ===
> --- linux-2.6.orig/drivers/md/persistent-data/dm-btree-spine.c
> +++ linux-2.6/drivers/md/persistent-data/dm-btree-spine.c
> @@ -118,27 +118,60 @@ void unlock_block(struct dm_btree_info *
> dm_tm_unlock(info->tm, b);
>  }
>
> +static void bn_fast_lock(struct dm_btree_info *info)
> +{
> +   dm_tm_fast_lock(info->tm);
> +}
> +
> +static void bn_fast_unlock(struct dm_btree_info *info)
> +{
> +   dm_tm_fast_unlock(info->tm);
> +}
> +
> +static int 

Re: [dm-devel] [lvm-devel] kernel BUG at drivers/md/persistent-data/dm-space-map-disk.c:178

2020-01-07 Thread Joe Thornber
On Tue, Jan 07, 2020 at 10:46:27AM +, Joe Thornber wrote:
> I'll get a patch to you later today.

Eric,

Patch below.  I've run it through a bunch of tests in the dm test suite.  But
obviously I have never hit your issue.  Will do more testing today.

- Joe



Author: Joe Thornber 
Date:   Tue Jan 7 11:58:42 2020 +

[dm-thin, dm-cache] Fix bug in space-maps.

The space-maps track the reference counts for disk blocks.  There are 
variants
for tracking metadata blocks, and data blocks.

We implement transactionality by never touching blocks from the previous
transaction, so we can rollback in the event of a crash.

When allocating a new block we need to ensure the block is free (has 
reference
count of 0) in both the current and previous transaction.  Prior to this 
patch we
were doing this by searching for a free block in the previous transaction, 
and
relying on a 'begin' counter to track where the last allocation in the 
current
transaction was.  This 'begin' field was not being updated in all code 
paths (eg,
increment of a data block reference count due to breaking sharing of a 
neighbour
block in the same btree leaf).

This patch keeps the 'begin' field, but now it's just a hint to speed up 
the search.
Instead we search the current transaction for a free block, and then double 
check
it's free in the old transaction.  Much simpler.

diff --git a/drivers/md/persistent-data/dm-space-map-common.c 
b/drivers/md/persistent-data/dm-space-map-common.c
index bd68f6fef694..b4983e4022e6 100644
--- a/drivers/md/persistent-data/dm-space-map-common.c
+++ b/drivers/md/persistent-data/dm-space-map-common.c
@@ -380,6 +380,34 @@ int sm_ll_find_free_block(struct ll_disk *ll, dm_block_t 
begin,
return -ENOSPC;
 }
 
+int sm_ll_find_common_free_block(struct ll_disk *old_ll, struct ll_disk 
*new_ll,
+dm_block_t begin, dm_block_t end, dm_block_t 
*b)
+{
+   int r;
+   uint32_t count;
+
+   do {
+   r = sm_ll_find_free_block(new_ll, begin, new_ll->nr_blocks, b);
+   if (r)
+   break;
+
+   /* double check this block wasn't used in the old transaction */
+   if (*b >= old_ll->nr_blocks)
+   count = 0;
+
+   else {
+   r = sm_ll_lookup(old_ll, *b, );
+   if (r)
+   break;
+
+   if (count)
+   begin = *b + 1;
+   }
+   } while (count);
+
+   return r;
+}
+
 static int sm_ll_mutate(struct ll_disk *ll, dm_block_t b,
int (*mutator)(void *context, uint32_t old, uint32_t 
*new),
void *context, enum allocation_event *ev)
diff --git a/drivers/md/persistent-data/dm-space-map-common.h 
b/drivers/md/persistent-data/dm-space-map-common.h
index b3078d5eda0c..8de63ce39bdd 100644
--- a/drivers/md/persistent-data/dm-space-map-common.h
+++ b/drivers/md/persistent-data/dm-space-map-common.h
@@ -109,6 +109,8 @@ int sm_ll_lookup_bitmap(struct ll_disk *ll, dm_block_t b, 
uint32_t *result);
 int sm_ll_lookup(struct ll_disk *ll, dm_block_t b, uint32_t *result);
 int sm_ll_find_free_block(struct ll_disk *ll, dm_block_t begin,
  dm_block_t end, dm_block_t *result);
+int sm_ll_find_common_free_block(struct ll_disk *old_ll, struct ll_disk 
*new_ll,
+dm_block_t begin, dm_block_t end, dm_block_t 
*result);
 int sm_ll_insert(struct ll_disk *ll, dm_block_t b, uint32_t ref_count, enum 
allocation_event *ev);
 int sm_ll_inc(struct ll_disk *ll, dm_block_t b, enum allocation_event *ev);
 int sm_ll_dec(struct ll_disk *ll, dm_block_t b, enum allocation_event *ev);
diff --git a/drivers/md/persistent-data/dm-space-map-disk.c 
b/drivers/md/persistent-data/dm-space-map-disk.c
index 32adf6b4a9c7..bf4c5e2ccb6f 100644
--- a/drivers/md/persistent-data/dm-space-map-disk.c
+++ b/drivers/md/persistent-data/dm-space-map-disk.c
@@ -167,8 +167,10 @@ static int sm_disk_new_block(struct dm_space_map *sm, 
dm_block_t *b)
enum allocation_event ev;
struct sm_disk *smd = container_of(sm, struct sm_disk, sm);
 
-   /* FIXME: we should loop round a couple of times */
-   r = sm_ll_find_free_block(>old_ll, smd->begin, 
smd->old_ll.nr_blocks, b);
+   /*
+* Any block we allocate has to be free in both the old and current ll.
+*/
+   r = sm_ll_find_common_free_block(>old_ll, >ll, smd->begin, 
smd->ll.nr_blocks, b);
if (r)
return r;
 
diff --git a/drivers/md/persistent-data/dm-space-map-metadata.c 
b/drivers/md/persistent-data/dm-space-map-metadata.c
index 25328582cc48..9e3c64ec2026 100644
--- a/drivers/md/persistent-data/dm-space-map-metadata.c
+++ b/drivers/md/persistent-data/dm-space-map-metadata.

Re: [dm-devel] [lvm-devel] kernel BUG at drivers/md/persistent-data/dm-space-map-disk.c:178

2020-01-07 Thread Joe Thornber
On Tue, Jan 07, 2020 at 10:35:46AM +, Joe Thornber wrote:
> On Sat, Dec 28, 2019 at 02:13:07AM +, Eric Wheeler wrote:
> > On Fri, 27 Dec 2019, Eric Wheeler wrote:
> 
> > > Just hit the bug again without mq-scsi (scsi_mod.use_blk_mq=n), all other 
> > > times MQ has been turned on. 
> > > 
> > > I'm trying the -ENOSPC hack which will flag the pool as being out of 
> > > space 
> > > so I can recover more gracefully than a BUG_ON. Here's a first-draft 
> > > patch, maybe the spinlock will even prevent the issue.
> > > 
> > > Compile tested, I'll try on a real system tomorrow.
> > > 
> > > Comments welcome:
> 
> Both sm_ll_find_free_block() and sm_ll_inc() can trigger synchronous IO.  So 
> you
> absolutely cannot use a spin lock.
> 
> dm_pool_alloc_data_block() holds a big rw semaphore which should prevent 
> anything
> else trying to allocate at the same time.

I suspect the problem is to do with the way we search for the new block in the 
space map for the previous transaction (sm->old_ll), and then increment in the 
current
transaction (sm->ll).

We keep old_ll around so we can ensure we never (re) allocate a block that's 
used in
the previous transaction.  This gives us our crash resistance, since if 
anything goes
wrong we effectively rollback to the previous transaction.

What I think we should be doing is running find_free on the old_ll, then double 
checking
it's not actually used in the current transaction.  ATM we're relying on 
smd->begin being
set properly everywhere, and I suspect this isn't the case.  A quick look shows 
sm_disk_inc_block()
doesn't adjust it.  sm_disk_inc_block() can be called when breaking sharing of 
a neighbouring entry
in a leaf btree node ... and we know you use snapshots very heavily.

I'll get a patch to you later today.

- Joe

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] dm-thin: Several Questions on dm-thin performance.

2019-12-18 Thread Joe Thornber
On Sun, Dec 15, 2019 at 09:44:49PM +, Eric Wheeler wrote:
> I was looking through the dm-bio-prison-v2 commit for dm-cache (b29d4986d) 
> and it is huge, ~5k lines.  Do you still have a git branch with these 
> commits in smaller pieces (not squashed) so we can find the bits that 
> might be informative for converting lv-thin to use dm-bio-prison-v2?
> 
> For example, I think that, at least, the policy changes and 
> btracker code is dm-cache specific and just a distraction when trying to 
> understand the dm-bio-prison-v2 conversion.

To be honest I would hold off for a couple of months.  I've been working
on the design for thinp 2 and have got to the point where I need to write
a userland proof of concept implementation.  In particular I've focussed on
packing more into btree nodes, and separating transactions so IO to different
thins has no locking contention.  The proof of concept will tell me just how
small I can get the metadata.  If the level of metadata compression is ~1/10th
we'll plug the new btrees into the existing design and switch to bio prison v2.
If it's greater, say 1/50th, then I'll rewrite the whole target to
use write-ahead logging for transactionality and ditch all metadata sharing 
altogether.
When the metadata is that small we can copy entire btrees to implement 
snapshots.

- Joe

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 2/2] dm thin: Flush data device before committing metadata

2019-12-04 Thread Joe Thornber
On Wed, Dec 04, 2019 at 04:07:42PM +0200, Nikos Tsironis wrote:
> The thin provisioning target maintains per thin device mappings that map
> virtual blocks to data blocks in the data device.


Ack.  But I think we're issuing the FLUSH twice with your patch.  Since the
original bio is still remapped and issued at the end of process_deferred_bios?

- Joe

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] dm-thin: Several Questions on dm-thin performance.

2019-12-04 Thread Joe Thornber
(These notes are for my own benefit as much as anything, I haven't
worked on this for a couple of years and will forget it all completely
if I don't write it down somewhere).

Let's start by writing some pseudocode for what the remap function for
thin provisioning actually does.

--
-- Metadata

newtype ThinId = Int
data Bio = Bio
newtype VBlock = Integer-- virtual block
newtype DBlock = Integer-- data block

data LookupResult =
Unprovisioned |
Provisioned { lrDBlock :: DataBlock,
  lrShared :: Bool
}

metadataLookup :: ThinId -> VBlock -> Task LookupResult
metadataLookup = undefined

metadataInsert :: ThinId -> VBlock -> DBlock -> Task ()
metadataInsert = undefined

metadataRemove :: ThinId -> VBlock -> Task ()
metadataRemove = undefined

-- Blocks all other metadata operations while running
metadataCommit :: Task ()
metadataCommit = undefined

--
-- Tasks

-- Many steps of servicing a bio can block.  eg, taking a lock,
-- reading metadata, updating metadata, zeroing data, copying
-- data ...
-- So we completely side step the issue in this pseudocode by
-- running everything in a magic light weight thread.
spawn :: Task () -> IO ()
spawn = undefined

--
-- Locking

-- These 'with' primitives acquire a lock (can block of course), perform
-- an action, and then automatically release the lock.
 
-- Shared lock can be upgraded, so we need to pass the lock into
-- the action body.
withSharedLock :: ThinId -> VBlock -> (Lock -> Task ()) -> Task ()
withSharedLock thinId vblock actionFn = undefined

withExclusiveLock :: ThinId -> VBlock -> Task () -> Task ()
withExclusiveLock thinId vblock action = undefined

-- This promotes a shared lock to exclusive.
withUpgradedLock :: Lock -> Task () -> Task ()
withUpgradedLock lock action = undefined

-- Data locks are always exclusive
withDataLock :: DBlock -> Task () -> Task ()
withDataLock dblock action = undefined

--

-- Top level remap function.  Kicks off a green thread for each bio.
-- How we handle a bio depends on whether it's a read, write, discard
-- or flush bio.  Whether the block is already provisioned, and if so
-- whether it is shared between snapshots.
remap :: ThinId -> Bio -> IO ()
remap thinId bio = spawn $ remapFn thinId bio vblock
where
vblock = virtBlock bio
remapFn = case classifyBio bio of
ReadBio -> remapRead
WriteBio -> remapWrite
DiscardBio -> remapDiscard
FlushBio -> remapFlush

--

remapRead :: ThinId -> Bio -> VBlock -> Task ()
remapRead thinId bio vblock = do
withSharedLock thinId vblock $ \_ -> do
lr <- metadataLookup thinId vblock
case lr of
-- Read, Unprovisioned, Shared/!Shared
Unprovisioned -> do
fillWithZeroes bio
complete bio Success

-- Read, Provisioned, Shared/!Shared
(Provisioned dblock _) ->
remapAndWait bio dblock

--

remapWrite :: ThinId -> Bio -> VBlock -> Task ()
remapWrite thinId bio vblock = do
withSharedLock thinId vblock $ \lock -> do
lr <- metadataLookup thinId vblock
case lr of
-- Write, Unprovisioned
Unprovisioned ->
withUpgradedLock lock $
provision thinId bio vblock

-- Write, Provisioned, !Shared
(Provisioned dblock False) ->
remapAndWait bio dblock

-- Write, Provisioned, Shared
(Provisioned dblock True) ->
withUpgradedLock lock $
breakSharing thinId bio vblock dblock

breakSharing :: ThinId -> Bio -> VBlock -> DataBlock -> Task ()
breakSharing thinId bio vblock dblockOld = do
ab <- allocateBlock
   case ab of
   NoDataSpace ->
   complete bio Failure

   (Allocated dblockNew) ->

Re: [dm-devel] [PATCH] dm btree: increase rebalance threshold in __rebalance2()

2019-12-03 Thread Joe Thornber
Ack.  Thank you.

On Tue, Dec 03, 2019 at 07:42:58PM +0800, Hou Tao wrote:
> We got the following warnings from thin_check during thin-pool setup:
> 
>   $ thin_check /dev/vdb
>   examining superblock
>   examining devices tree
> missing devices: [1, 84]
>   too few entries in btree_node: 41, expected at least 42 (block 138, 
> max_entries = 126)
>   examining mapping tree
> 
> The phenomenon is the number of entries in one node of details_info tree is
> less than (max_entries / 3). And it can be easily reproduced by the following
> procedures:
> 
>   $ new a thin pool
>   $ presume the max entries of details_info tree is 126
>   $ new 127 thin devices (e.g. 1~127) to make the root node being full
> and then split
>   $ remove the first 43 (e.g. 1~43) thin devices to make the children
> reblance repeatedly
>   $ stop the thin pool
>   $ thin_check
> 
> The root cause is that the B-tree removal procedure in __rebalance2()
> doesn't guarantee the invariance: the minimal number of entries in
> non-root node should be >= (max_entries / 3).
> 
> Simply fix the problem by increasing the rebalance threshold to
> make sure the number of entries in each child will be greater
> than or equal to (max_entries / 3 + 1), so no matter which
> child is used for removal, the number will still be valid.
> 
> Signed-off-by: Hou Tao 
> ---
>  drivers/md/persistent-data/dm-btree-remove.c | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/md/persistent-data/dm-btree-remove.c 
> b/drivers/md/persistent-data/dm-btree-remove.c
> index 21ea537bd55e..eff04fa23dfa 100644
> --- a/drivers/md/persistent-data/dm-btree-remove.c
> +++ b/drivers/md/persistent-data/dm-btree-remove.c
> @@ -203,7 +203,13 @@ static void __rebalance2(struct dm_btree_info *info, 
> struct btree_node *parent,
>   struct btree_node *right = r->n;
>   uint32_t nr_left = le32_to_cpu(left->header.nr_entries);
>   uint32_t nr_right = le32_to_cpu(right->header.nr_entries);
> - unsigned threshold = 2 * merge_threshold(left) + 1;
> + /*
> +  * Ensure the number of entries in each child will be greater
> +  * than or equal to (max_entries / 3 + 1), so no matter which
> +  * child is used for removal, the number will still be not
> +  * less than (max_entries / 3).
> +  */
> + unsigned int threshold = 2 * (merge_threshold(left) + 1);
>  
>   if (nr_left + nr_right < threshold) {
>   /*
> -- 
> 2.22.0
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] dm-thin: Several Questions on dm-thin performance.

2019-12-03 Thread Joe Thornber
On Mon, Dec 02, 2019 at 10:26:00PM +, Eric Wheeler wrote:

> Hi Joe,
> 
> I'm not sure if I will have the time but thought I would start the 
> research and ask a few questions. I looked at the v1/v2 .h files and some 
> of the functions just change suffix to _v2 and maybe calling 
> convention/structure field changes.
> 
> However, there appear to be some design changes, too:

Yes, the interface is different, and it's really not trivial to switch dm-thin
over to use it (otherwise I'd have done it already).  dm-cache already uses
the new interface which could be used as a guide, especially if you look at the 
patches
that made the switch.

I'm going to write up some notes over the next couple of days.  Which I'll post 
on this thread.

- Joe

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] dm-thin: Several Questions on dm-thin performance.

2019-11-22 Thread Joe Thornber
On Fri, Nov 22, 2019 at 11:14:15AM +0800, JeffleXu wrote:

> The first question is what's the purpose of data cell? In thin_bio_map(),
> normal bio will be packed as a virtual cell and data cell. I can understand
> that virtual cell is used to prevent discard bio and non-discard bio
> targeting the same block from being processed at the same time. I find it
> was added in commit     e8088073c9610af017fd47fddd104a2c3afb32e8 (dm thin:
> fix race between simultaneous io and discards to same block), but I'm still
> confused about the use of data cell.

As you are aware there are two address spaces for the locks.  The 'virtual' one
refers to cells in the logical address space of the thin devices, and the 
'data' one
refers to the underlying data device.  There are certain conditions where we 
unfortunately need to hold both of these (eg, to prevent a data block being 
reprovisioned
before an io to it has completed).

> The second question is the impact of virtual cell and data cell on IO
> performance. If $data_block_size is large for example 1G, in multithread fio
> test, most bio will be buffered in cell->bios list and then be processed by
> worker thread asynchronously, even when there's no discard bio. Thus the
> original parallel IO is processed by worker thread serially now. As the
> number of fio test threads increase, the single worker thread can easily get
> CPU 100%, and thus become the bottleneck of the performance since dm-thin
> workqueue is ordered unbound.

Yep, this is a big issue.  Take a look at dm-bio-prison-v2.h, this is the
new interface that we need to move dm-thin across to use (dm-cache already uses 
it).
It allows concurrent holders of a cell (ie, read locks), so we'll be able to 
remap
much more io without handing it off to a worker thread.  Once this is done I 
want
to add an extra field to cells that will cache the mapping, this way if you 
acquire a
cell that is already held then you can avoid the expensive btree lookup.  
Together 
these changes should make a huge difference to the performance.

If you've got some spare coding cycles I'd love some help with this ;)

- Joe

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: kernel BUG at drivers/md/persistent-data/dm-space-map-disk.c:178 with scsi_mod.use_blk_mq=y

2019-09-27 Thread Joe Thornber
Hi Eric,

On Thu, Sep 26, 2019 at 06:27:09PM +, Eric Wheeler wrote:
> I pvmoved the tmeta to an SSD logical volume (dm-linear) on a non-bcache 
> volume and we got the same trace this morning, so while the tdata still 
> passes through bcache, all meta operations are direct to an SSD. This is 
> still using multi-queue scsi, but dm_mod.use_blk_mq=N.
> 
> Since bcache is no longer involved with metadata operations, and since 
> this appears to be a metadata issue, are there any other reasons to 
> suspect bcache?

Did you recreate the pool, or are you just using the existing pool but with
a different IO path?  If it's the latter then there could still be something
wrong with the metadata, introduced while bcache was in the stack.

Would it be possible to send me a copy of the metadata device please so
I can double check the space maps (I presume you've run thin_check on it)?

[Assuming you're using the existing pool] Another useful experiment would be to 
thump_dump and then thin_restore the metadata, which will create totally fresh
metadata and see if you can still reproduce the issue.

Thanks,

- Joe


Re: [dm-devel] Why does dm-thin pool metadata space map use 4K page to carry index ?

2019-09-05 Thread Joe Thornber
On Thu, Sep 05, 2019 at 02:43:28PM +0800, jianchao wang wrote:
> But why does it use this 4K page instead of btree as the disk sm ?
> 
> The brb mechanism seem be able to avoid the nested block allocation
> when do COW on the metadata sm btree.
> 
> Would anyone please help to tell why does it use this 4K page instead of a
> btree ?

It's a long time since I wrote this, so I can't remember the order that things
were written.  It may well be that brb mechanism for avoiding recursive 
allocations
came after the on disk formats were defined.  Irrespective of that the single 
page
pointing to index pages should perform better.

Is the 16G limit to the metadata device causing you issues?

- Joe

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH v2] dm thin: Fix bug wrt FUA request completion

2019-02-15 Thread Joe Thornber
Ack.

Thanks for this I was under the mistaken impression that FUA requests got split 
by core dm into separate payload and PREFLUSH requests.

I've audited dm-cache and that looks ok.

How did you test this patch?  That missing bio_list_init() in V1 must
have caused memory corruption?

- Joe


On Fri, Feb 15, 2019 at 01:21:38AM +0200, Nikos Tsironis wrote:
> When provisioning a new data block for a virtual block, either because
> the block was previously unallocated or because we are breaking sharing,
> if the whole block of data is being overwritten the bio that triggered
> the provisioning is issued immediately, skipping copying or zeroing of
> the data block.
> 
> When this bio completes the new mapping is inserted in to the pool's
> metadata by process_prepared_mapping(), where the bio completion is
> signaled to the upper layers.
> 
> This completion is signaled without first committing the metadata. If
> the bio in question has the REQ_FUA flag set and the system crashes
> right after its completion and before the next metadata commit, then the
> write is lost despite the REQ_FUA flag requiring that I/O completion for
> this request is only signaled after the data has been committed to
> non-volatile storage.
> 
> Fix this by deferring the completion of overwrite bios, with the REQ_FUA
> flag set, after the metadata has been committed.
> 
> Signed-off-by: Nikos Tsironis 
> ---
>  drivers/md/dm-thin.c | 55 
> +++-
>  1 file changed, 50 insertions(+), 5 deletions(-)
> 
> Changes in v2:
>   - Add missing bio_list_init() in pool_create()
> 
> v1: https://www.redhat.com/archives/dm-devel/2019-February/msg00064.html
> 
> diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
> index ca8af21bf644..e83b63608262 100644
> --- a/drivers/md/dm-thin.c
> +++ b/drivers/md/dm-thin.c
> @@ -257,6 +257,7 @@ struct pool {
>  
>   spinlock_t lock;
>   struct bio_list deferred_flush_bios;
> + struct bio_list deferred_flush_completions;
>   struct list_head prepared_mappings;
>   struct list_head prepared_discards;
>   struct list_head prepared_discards_pt2;
> @@ -956,6 +957,39 @@ static void process_prepared_mapping_fail(struct 
> dm_thin_new_mapping *m)
>   mempool_free(m, >tc->pool->mapping_pool);
>  }
>  
> +static void complete_overwrite_bio(struct thin_c *tc, struct bio *bio)
> +{
> + struct pool *pool = tc->pool;
> + unsigned long flags;
> +
> + /*
> +  * If the bio has the REQ_FUA flag set we must commit the metadata
> +  * before signaling its completion.
> +  */
> + if (!bio_triggers_commit(tc, bio)) {
> + bio_endio(bio);
> + return;
> + }
> +
> + /*
> +  * Complete bio with an error if earlier I/O caused changes to the
> +  * metadata that can't be committed, e.g, due to I/O errors on the
> +  * metadata device.
> +  */
> + if (dm_thin_aborted_changes(tc->td)) {
> + bio_io_error(bio);
> + return;
> + }
> +
> + /*
> +  * Batch together any bios that trigger commits and then issue a
> +  * single commit for them in process_deferred_bios().
> +  */
> + spin_lock_irqsave(>lock, flags);
> + bio_list_add(>deferred_flush_completions, bio);
> + spin_unlock_irqrestore(>lock, flags);
> +}
> +
>  static void process_prepared_mapping(struct dm_thin_new_mapping *m)
>  {
>   struct thin_c *tc = m->tc;
> @@ -988,7 +1022,7 @@ static void process_prepared_mapping(struct 
> dm_thin_new_mapping *m)
>*/
>   if (bio) {
>   inc_remap_and_issue_cell(tc, m->cell, m->data_block);
> - bio_endio(bio);
> + complete_overwrite_bio(tc, bio);
>   } else {
>   inc_all_io_entry(tc->pool, m->cell->holder);
>   remap_and_issue(tc, m->cell->holder, m->data_block);
> @@ -2317,7 +2351,7 @@ static void process_deferred_bios(struct pool *pool)
>  {
>   unsigned long flags;
>   struct bio *bio;
> - struct bio_list bios;
> + struct bio_list bios, bio_completions;
>   struct thin_c *tc;
>  
>   tc = get_first_thin(pool);
> @@ -2328,26 +2362,36 @@ static void process_deferred_bios(struct pool *pool)
>   }
>  
>   /*
> -  * If there are any deferred flush bios, we must commit
> -  * the metadata before issuing them.
> +  * If there are any deferred flush bios, we must commit the metadata
> +  * before issuing them or signaling their completion.
>*/
>   bio_list_init();
> + bio_list_init(_completions);
> +
>   spin_lock_irqsave(>lock, flags);
>   bio_list_merge(, >deferred_flush_bios);
>   bio_list_init(>deferred_flush_bios);
> +
> + bio_list_merge(_completions, >deferred_flush_completions);
> + bio_list_init(>deferred_flush_completions);
>   spin_unlock_irqrestore(>lock, flags);
>  
> - if (bio_list_empty() &&
> + if (bio_list_empty() && bio_list_empty(_completions) &&
>   

Re: [dm-devel] extracting thin mappings in real time

2018-10-04 Thread Joe Thornber
On Wed, Oct 03, 2018 at 04:47:41PM +0100, Thanos Makatos wrote:
> poo metadata object can return this information? I've started looking
> at  thin_bio_map(), is this the best place to start?

See thin-metadata.h

- Joe


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] extracting thin mappings in real time

2018-10-03 Thread Joe Thornber
On Wed, Oct 03, 2018 at 03:13:36PM +0100, Thanos Makatos wrote:
> > Could you say more about why you want to do this?
> >
> 
> So that I can directly read the data block without having to pass through
> dm-thin, e.g. there might be a more direct datapath to the physical block
> device.
> 
> Mappings don't change if you're not using snapshots.  If you are, then any
> > write
> > to a shared block will trigger a copy on write operation, and subsequently
> > a new
> > mapping.
> 
> So for backups etc. I made the metadata snapshot available, which gives you
> > a
> > view of all the metadata frozen at a point in time.  If you're using the
> > metadata
> > snap you still need to guarantee the mappings for the devices that you're
> > interested
> > in are not changing.  eg, instead of backing up a live thin, take a
> > snapshot of the thin
> > and back this up instead.
> >
> 
> For now I can guarantee that these mappings don't change (no snapshots,
> backups, etc), so it's safe to extract the mapping(s) without having
> acquired the meradata snapshot, correct? If this is the case, then how can
> I extract individual mappins? (Either from kernel space or from user
> space.) Are there specific dm-thin kernel functions I should start looking
> at?

Why is this thin-specific?  What happens if there are two thin-pools under
your code?  Do you want the final physical location?  The whole dm stack?

Efficiently reading the metadata requires a cache of metadata blocks.  Thinp 
uses dm-bufio
to provide this.  It seems very wasteful for you to duplicate this.  So your 
options are:


i) querying the pool metadata object directly.  Currently there is no way to 
get hold of the
   metadata object, you would need to make any queries off the io path, since 
they can block.

ii) or examining the completed bios, you'd probably need to add some per bio 
data to give context.  

- Joe

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] extracting thin mappings in real time

2018-10-03 Thread Joe Thornber
On Wed, Oct 03, 2018 at 01:40:22PM +0100, Thanos Makatos wrote:
> I have a kernel module that sits on top of a thin device mapper target that
> receives block I/O requests and re-submits then to the thin target. I would
> like to implement the following functionality: whenever I receive a write
> completion from the thin target (assuming that it's the first time a block
> written to) I would like to extract the newly-established mapping of that
> virtual block.
> 
> I know that I can do this using thin_dump, however this involves:
> (1) spawning a process
> (2) reserving/releasing a metadata snapshot, and
> (3) dumping _all_ the mappings.
> 
> In other words, it's far to heavyweight for my performance requirements.
> 
> Ideally I would like to be able to obtain the mapping in kernel space. I
> had a look at thin_dump and from what I understand it directly reads the
> B-tree from the disk? Is there some kernel function that already does this?
> E.g. given a thin LBA return the physical block address.
> 
> Also, regarding having to have reserved a metadata snapshot, is this
> necessary for obtaining mappings? Aren't mappings immutable once established
> ?

Could you say more about why you want to do this?

Mappings don't change if you're not using snapshots.  If you are, then any write
to a shared block will trigger a copy on write operation, and subsequently a new
mapping.

So for backups etc. I made the metadata snapshot available, which gives you a
view of all the metadata frozen at a point in time.  If you're using the 
metadata
snap you still need to guarantee the mappings for the devices that you're 
interested
in are not changing.  eg, instead of backing up a live thin, take a snapshot of 
the thin
and back this up instead.

- Joe

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] dm thin: data block's ref count is not zero but not belong to any device.

2018-09-27 Thread Joe Thornber
On Tue, Sep 25, 2018 at 11:13:17PM -0400, monty wrote:
> 
> Hi! I met a problem of dm-thin: a thin-pool has no volumes but its
> nr_free_blocks_data is not zero. I guess the scene of this problem like:
> a. create a thin volume thin01, size is 10GB;
> b. write 10GB to thin01;
> c. create a snapshot thin01-snap, orig is thin01;
> d. write 10GB to thin01-snap;
> e. for a data block of thin01-snap, which will be written by user, will
> experience following steps:
>   1. alloc a data block m firstly;
>   2. wirte data to data block m;
>   3. bio callback and assign data block m to thin01-snap.
> If power loss happens between step 2 and step 3, data block will not
> belong to any device but its ref count is not zero.
> Is it a problem? How to solve this problem?

Yes, there are a couple of ways this can happen, a dump and restore of
the metadata will repair it.

- Joe



> 
> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] dm thin: superblock may write succeed before other metadata blocks because of wirting metadata in async mode.

2018-06-20 Thread Joe Thornber
On Wed, Jun 20, 2018 at 01:03:57PM -0400, monty wrote:
> Hi, Mike and Joe. Thanks for your reply. I read __commit_transaction
> many times and didn't find any problem of 2-phase commit. I use
> md-raid1(PCIe nvme and md-raid5) in write-behind mode to store dm-thin
> metadata.
> Test case:
> 1. I do copy-diff test on thin device and then reboot my machine.
> 2. Rebuild our device stack and exec "vgchang -ay".
> The thin-pool can not be established(details_root become a bitmap node
> and metadata's bitmap_root become a btree_node).

As you simplify your setup does the problem go away?  eg, turn off 
write-behind, use just the nvme dev etc.

The only effect of your change is to call flush twice rather than once.

- Joe

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] dm thin: superblock may write succeed before other metadata blocks because of wirting metadata in async mode.

2018-06-19 Thread Joe Thornber
On Tue, Jun 19, 2018 at 09:11:06AM -0400, Mike Snitzer wrote:
> On Mon, May 21 2018 at  8:53pm -0400,
> Monty Pavel  wrote:
> 
> > 
> > If dm_bufio_write_dirty_buffers func is called by __commit_transaction
> > func and power loss happens during executing it, coincidencely
> > superblock wrote correctly but some metadata blocks didn't. The reason
> > is we write all metadata in async mode. We can guarantee that we send
> > superblock after other blocks but we cannot guarantee that superblock
> > write completely early than other blocks.
> > So, We need to commit other metadata blocks before change superblock.
> > 
> > Signed-off-by: Monty Pavel 
> > ---
> >  drivers/md/dm-thin-metadata.c |8 
> >  1 files changed, 8 insertions(+), 0 deletions(-)
> > 
> > diff --git a/drivers/md/dm-thin-metadata.c b/drivers/md/dm-thin-metadata.c
> > index 36ef284..897d7d6 100644
> > --- a/drivers/md/dm-thin-metadata.c
> > +++ b/drivers/md/dm-thin-metadata.c
> > @@ -813,6 +813,14 @@ static int __commit_transaction(struct 
> > dm_pool_metadata *pmd)
> > if (r)
> > return r;
> >  
> > +   r = dm_tm_commit(pmd->tm, sblock);
> > +   if (r)
> > +   return r;
> > +
> > +   r = superblock_lock(pmd, );
> > +   if (r)
> > +   return r;
> > +
> > disk_super = dm_block_data(sblock);
> > disk_super->time = cpu_to_le32(pmd->time);
> > disk_super->data_mapping_root = cpu_to_le64(pmd->root);

I don't believe you've tested this; sblock is passed to dm_tm_commit()
uninitialised, and you didn't even bother to remove the later (and correct)
call to dm_tm_commit().  See dm-transaction-manager.h for an explanation of
how the 2-phase commit works.

What is the issue that started you looking in this area?

- Joe

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] dm-thin: Why is DATA_DEV_BLOCK_SIZE_MIN_SECTORS set to 64k?

2018-06-12 Thread Joe Thornber
On Sat, Jun 09, 2018 at 07:31:54PM +, Eric Wheeler wrote:
> I understand the choice.  What I am asking is this: would it be safe to 
> let others make their own choice about block size provided they are warned 
> about the metadata-chunk-size/pool-size limit tradeoff?
> 
> If it is safe, can we relax the restriction?  For example, 16k chunks 
> still enables ~4TB pools, but with 1/4th of the CoW IO overhead on heavily 
> snapshotted environments.

Yes, it would be safe.  There are downsides though; all io gets split
on block size boundaries, so dropping to 16k or smaller could
seriously increase the cpu usage.  Smaller blocks also means more
mappings, more metadata, more kernel memory consumption.

- Joe

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


[dm-devel] Patch [1/1] Fix bug in btree_split_beneath()

2017-12-20 Thread Joe Thornber
[dm-thin] Fix bug in btree_split_beneath()

When inserting a new key/value pair into a btree we walk down the spine of
btree nodes performing the following 2 operations:

  i) space for a new entry
  ii) adjusting the first key entry if the new key is lower than any in the 
node.

If the _root_ node is full, the function btree_split_beneath() allocates 2 new
nodes, and redistibutes the root nodes entries between them.  The root node is
left with 2 entries corresponding to the 2 new nodes.

btree_split_beneath() then adjusts the spine to point to one of the two new
children.  This means the first key is never adjusted if the new key was lower,
ie. operation (ii) gets missed out.  This can result in the new key being
'lost' for a period; until another low valued key is inserted that will uncover
it.  A serious bug, and quite hard to make trigger in normal use.

Test case here:


https://github.com/jthornber/thin-provisioning-tools/blob/master/functional-tests/device-mapper/dm-tests.scm#L593

This patch fixes the issue by changing btree_split_beneath() so it no longer
adjusts the spine.  Instead it unlocks both the new nodes, and lets the main
loop in btree_insert_raw() relock the appropriate one and make any neccessary
adjustments.


This bug was identified by Monty Pavel.

diff --git a/drivers/md/persistent-data/dm-btree.c 
b/drivers/md/persistent-data/dm-btree.c
index 416060c2570..ee2c01685f5 100644
--- a/drivers/md/persistent-data/dm-btree.c
+++ b/drivers/md/persistent-data/dm-btree.c
@@ -572,23 +572,8 @@ static int btree_split_beneath(struct shadow_spine *s, 
uint64_t key)
pn->keys[1] = rn->keys[0];
memcpy_disk(value_ptr(pn, 1), , sizeof(__le64));
 
-   /*
-* rejig the spine.  This is ugly, since it knows too
-* much about the spine
-*/
-   if (s->nodes[0] != new_parent) {
-   unlock_block(s->info, s->nodes[0]);
-   s->nodes[0] = new_parent;
-   }
-   if (key < le64_to_cpu(rn->keys[0])) {
-   unlock_block(s->info, right);
-   s->nodes[1] = left;
-   } else {
-   unlock_block(s->info, left);
-   s->nodes[1] = right;
-   }
-   s->count = 2;
-
+   unlock_block(s->info, left);
+   unlock_block(s->info, right);
return 0;
 }
 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH] persistent-data: fix bug about btree of updating internal node's minima key in btree_split_beneath.

2017-12-19 Thread Joe Thornber
On Mon, Dec 18, 2017 at 05:34:09PM +, Joe Thornber wrote:
> Patch below.  This is completely untested.  I'll test tomorrow and update.

The patch appears to work.  I'm using this test to reproduce the problem:

   
https://github.com/jthornber/thin-provisioning-tools/blob/master/functional-tests/device-mapper/dm-tests.scm#L593

I need to do some more regression testing before I send it upstream.

- Joe

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH] persistent-data: fix bug about btree of updating internal node's minima key in btree_split_beneath.

2017-12-18 Thread Joe Thornber
On Mon, Dec 18, 2017 at 05:13:08PM +, Joe Thornber wrote:
> Hi Monty,
> 
> On Mon, Dec 18, 2017 at 04:27:58PM -0500, monty wrote:
> > Subject: [PATCH] persistent-data: fix bug about btree of updating internal 
> > node's minima
> >  key in btree_split_beneath.
> > 
> >  fix bug about btree_split_beneath func, this bug may cause a key had
> >  been inserted to btree, but dm_btree_lookup can not find the key in
> >  btree later.
> 
> I think you've spotted a real issue, but I don't like where you've
> fixed up the btree_split_beneath() function.  I'll post a patch in a
> bit.

Patch below.  This is completely untested.  I'll test tomorrow and update.

Instead of fiddling with the spine we unlock both the new children and
let the current spine entry (which now has just two entries) continue.
That way the bounds on the lowest key is adjusted within the main loop
of btree_insert_raw().

- Joe



diff --git a/drivers/md/persistent-data/dm-btree.c 
b/drivers/md/persistent-data/dm-btree.c
index 416060c2570..ee2c01685f5 100644
--- a/drivers/md/persistent-data/dm-btree.c
+++ b/drivers/md/persistent-data/dm-btree.c
@@ -572,23 +572,8 @@ static int btree_split_beneath(struct shadow_spine *s, 
uint64_t key)
pn->keys[1] = rn->keys[0];
memcpy_disk(value_ptr(pn, 1), , sizeof(__le64));

-   /*
-* rejig the spine.  This is ugly, since it knows too
-* much about the spine
-*/
-   if (s->nodes[0] != new_parent) {
-   unlock_block(s->info, s->nodes[0]);
-   s->nodes[0] = new_parent;
-   }
-   if (key < le64_to_cpu(rn->keys[0])) {
-   unlock_block(s->info, right);
-   s->nodes[1] = left;
-   } else {
-   unlock_block(s->info, left);
-   s->nodes[1] = right;
-   }
-   s->count = 2;
-
+   unlock_block(s->info, left);
+   unlock_block(s->info, right);
return 0;
}



--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH] persistent-data: fix bug about btree of updating internal node's minima key in btree_split_beneath.

2017-12-18 Thread Joe Thornber
Hi Monty,

On Mon, Dec 18, 2017 at 04:27:58PM -0500, monty wrote:
> Subject: [PATCH] persistent-data: fix bug about btree of updating internal 
> node's minima
>  key in btree_split_beneath.
> 
>  fix bug about btree_split_beneath func, this bug may cause a key had
>  been inserted to btree, but dm_btree_lookup can not find the key in
>  btree later.

I think you've spotted a real issue, but I don't like where you've
fixed up the btree_split_beneath() function.  I'll post a patch in a
bit.

How did you recreate this bug?

Thanks,

- Joe


> 
> Signed-off-by: monty 
> ---
>  drivers/md/persistent-data/dm-btree.c |2 ++
>  1 files changed, 2 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/md/persistent-data/dm-btree.c 
> b/drivers/md/persistent-data/dm-btree.c
> index f21ce6a..25ce5ec 100644
> --- a/drivers/md/persistent-data/dm-btree.c
> +++ b/drivers/md/persistent-data/dm-btree.c
> @@ -694,6 +694,8 @@ static int btree_split_beneath(struct shadow_spine *s, 
> uint64_t key)
>   if (key < le64_to_cpu(rn->keys[0])) {
>   unlock_block(s->info, right);
>   s->nodes[1] = left;
> + if (key < le64_to_cpu(ln->keys[0]))
> + ln->keys[0] = cpu_to_le64(key);
>   } else {
>   unlock_block(s->info, left);
>   s->nodes[1] = right;
> -- 
> 1.7.1
> 

> --
> dm-devel mailing list
> dm-devel@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] Significantly dropped dm-cache performance in 4.13 compared to 4.11

2017-11-14 Thread Joe Thornber
On Mon, Nov 13, 2017 at 02:01:11PM -0500, Mike Snitzer wrote:
> On Mon, Nov 13 2017 at 12:31pm -0500,
> Stefan Ring <stefan...@gmail.com> wrote:
> 
> > On Thu, Nov 9, 2017 at 4:15 PM, Stefan Ring <stefan...@gmail.com> wrote:
> > > On Tue, Nov 7, 2017 at 3:41 PM, Joe Thornber <thorn...@redhat.com> wrote:
> > >> On Fri, Nov 03, 2017 at 07:50:23PM +0100, Stefan Ring wrote:
> > >>> It strikes me as odd that the amount read from the spinning disk is
> > >>> actually more than what comes out of the combined device in the end.
> > >>
> > >> This suggests dm-cache is trying to promote too way too much.
> > >> I'll try and reproduce the issue, your setup sounds pretty straight 
> > >> forward.
> > >
> > > I think it's actually the most straight-forward you can get ;).
> > >
> > > I've also tested kernel 4.12 in the meantime, which behaves just like
> > > 4.13. So the difference in behavior seems to have been introduced
> > > somewhere between 4.11 and 4.12.
> > >
> > > I've also done plain dd from the dm-cache disk to /dev/null a few
> > > times, which wrote enormous amounts of data to the SDD. My poor SSD
> > > has received the same amount of writes during the last week that it
> > > has had to endure during the entire previous year.
> > 
> > Do you think it would make a difference if I removed and recreated the 
> > cache?
> > 
> > I don't want to fry my SSD any longer. I've just copied several large
> > files into the dm-cached zfs dataset, and while reading them back
> > immediately afterwards, the SSD started writing crazy amounts again.
> > In my understanding, linear reads should rarely end up on the cache
> > device, but that is absolutely not what I'm experiencing.
> 
> Joe tried to reproduce your reported issue today and couldn't.

I'm not sure what's going on here.  Would you mind sending me the
metadata please?  Either a cache_dump of it, or a copy of the metadata
dev?

- Joe

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] Significantly dropped dm-cache performance in 4.13 compared to 4.11

2017-11-07 Thread Joe Thornber
On Fri, Nov 03, 2017 at 07:50:23PM +0100, Stefan Ring wrote:
> It strikes me as odd that the amount read from the spinning disk is
> actually more than what comes out of the combined device in the end.

This suggests dm-cache is trying to promote too way too much.
I'll try and reproduce the issue, your setup sounds pretty straight forward.

- Joe

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] dm-cache coherence issue

2017-06-27 Thread Joe Thornber
On Mon, Jun 26, 2017 at 10:36:23PM +0200, Johannes Bauer wrote:
> On 26.06.2017 21:56, Mike Snitzer wrote:
> 
> >> Interesting, I did *not* change to writethrough. However, there
> >> shouldn't have been any I/O on the device (it was not accessed by
> >> anything after I switched to the cleaner policy).
> [...]
> >> Anyways, I'll try to replicate my scenario again because I'm actually
> >> quite sure that I did everything correctly (I did it a few times).
> > 
> > Except you didn't first switch to writethrough -- which is _not_
> > correct.
> 
> Absolutely, very good to know. So even without any I/O being request,
> dm-cache is allowed to "hold back" pages as long as the dm-cache device
> is in writeback mode? Would this also explain why the "dmsetup wait"
> hung indefinitely?

Some things to try to see if it makes a difference:

- unmount the cache before checksumming it so we know there's no IO from
  the page cache going in.

- deactivate the cache before checksumming the origin.

- Stop using encryption on top of cache and see if that makes a
  difference.

- Use 'dmsetup wait' properly, as Mike mentioned.  See the following code
  from dmtest:


https://github.com/jthornber/device-mapper-test-suite/blob/master/lib/dmtest/cache_utils.rb#L28

https://github.com/jthornber/device-mapper-test-suite/blob/master/lib/dmtest/device-mapper/event_tracker.rb

- use md5sum rather than your checksum program.  Humour me.



- Joe

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] dm-cache coherence issue

2017-06-26 Thread Joe Thornber
On Mon, Jun 26, 2017 at 12:33:42PM +0100, Joe Thornber wrote:
> On Sat, Jun 24, 2017 at 03:56:54PM +0200, Johannes Bauer wrote:
> > So I seem to have a very basic misunderstanding of what the cleaner
> > policy/dirty pages mean. Is there a way to force the cache to flush
> > entirely? Apparently, "dmsetup wait" and/or "sync" don't do the job.
> 
> I'll try and reproduce your scenario and get back to you.

Here's a similar scenario that I've added to the dm test suite:

https://github.com/jthornber/device-mapper-test-suite/commit/457e889b0c4d510609c0d7464af07f2ebee20768

It goes through all the steps you need to use to decommission a cache
with dirty blocks.  Namely:

- switch to writethrough mode (so new io can't create new dirty blocks)
- switch to the cleaner policy
- wait for clean
- remove the cache device before accessing the origin directly

I've run it with a recent kernel, and one from about a year ago and it
passes fine.

- Joe

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] dm-cache coherence issue

2017-06-26 Thread Joe Thornber
On Sat, Jun 24, 2017 at 03:56:54PM +0200, Johannes Bauer wrote:
> So I seem to have a very basic misunderstanding of what the cleaner
> policy/dirty pages mean. Is there a way to force the cache to flush
> entirely? Apparently, "dmsetup wait" and/or "sync" don't do the job.

Your understanding is correct.  There is a tool in the latest thinp
tools release called cache_writeback that does offline decommissioning
of a cache.

I'll try and reproduce your scenario and get back to you.

- Joe

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [RFC] dm-thin: Heuristic early chunk copy before COW

2017-03-09 Thread Joe Thornber
Hi Eric,

On Wed, Mar 08, 2017 at 10:17:51AM -0800, Eric Wheeler wrote:
> Hello all,
> 
> For dm-thin volumes that are snapshotted often, there is a performance 
> penalty for writes because of COW overhead since the modified chunk needs 
> to be copied into a freshly allocated chunk.
> 
> What if we were to implement some sort of LRU for COW operations on 
> chunks? We could then queue chunks that are commonly COWed within the 
> inter-snapshot interval to be background copied immediately after the next 
> snapshot. This would hide the latency and increase effective throughput 
> when the thin device is written by its user since only the meta data would 
> need an update because the chunk has already been copied.
> 
> I can imagine a simple algorithm where the COW increments the chunk LRU by 
> 2, and decrements the LRU by 1 for all stored LRUs when the volume is 
> snapshotted. After the snapshot, any LRU>0 would be queued for early copy.
> 
> The LRU would be in memory only, probably stored in a red/black tree. 
> Pre-copied chunks would not update on-disk meta data unless a write occurs 
> to that chunk. The allocator would need to be updated to ignore chunks 
> that are in the LRU list which have been pre-copied (perhaps except in the 
> case of pool free space exhaustion).
> 
> Does this sound viable?

Yes, I can see that it would benefit some people, and presumably we'd
only turn it on for those people.  Random thoughts:

- I'm doing a lot of background work in the latest version of dm-cache
  in idle periods and it certainly pays off.

- There can be a *lot* of chunks, so holding a counter for all chunks in
  memory is not on.  (See the hassle I had squeezing stuff into memory
  of dm-cache).

- Commonly cloned blocks can be gleaned from the metadata.  eg, by
  walking the metadata for two snapshots and taking the common ones.
  It might be possible to come up with a 'commonly used set' once, and
  then keep using it for all future snaps.

- Doing speculative work like this makes it harder to predict
  performance.  At the moment any expense (ie. copy) is incurred
  immediately as the triggering write comes in.

- Could this be done from userland?  Metadata snapshots let userland see
  the mappings, alternatively dm-era let's userland track where io has
  gone.  A simple read then write of a block would trigger the sharing
  to be broken.


- Joe

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH] thin_dump: added --device-id, --skip-mappings, and new output --format's

2016-03-19 Thread Joe Thornber
On Tue, Mar 15, 2016 at 10:59:15AM +, Thanos Makatos wrote:
> On 15 March 2016 at 01:45, Eric Wheeler  wrote:
> > Hi Joe,
> >
> > Please review the patch below when you have a moment.  I am interested in
> > your feedback, and also interested in having this functionality merged
> > upstream.  This was written against thin-provisioning-tools.git tag
> > v0.5.6.
> >
> > We use thin_dump on live dm-thin metadata snapshots all the time.  In our
> > case, we want to dump the XML for new (snapshot) volumes instead of
> > dumping the entire 16gb metadata device (37.8% used) which takes ~20-30
> > minutes instead of ~5 seconds for a single volume with --device-id.
> 
> I'm interested in extracting the mappings of a particular device, too.
> In fact, I've implemented this (along with extracting the mappings in
> binary format) and have only recently opened a PR:
> https://github.com/jthornber/thin-provisioning-tools/pull/49. It seems
> that we've replicated some work here, I'm not sure whether we're
> supposed to send patches to this list or open PR on github.

I've not forgotten you patches, just haven't got round to them yet.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH] thin_dump: added --device-id, --skip-mappings, and new output --format's

2016-03-19 Thread Joe Thornber
If you're skipping the mappings does the new thin_ls provide enough
information for you?

- Joe

On Tue, Mar 15, 2016 at 01:45:15AM +, Eric Wheeler wrote:
> Hi Joe,
> 
> Please review the patch below when you have a moment.  I am interested in 
> your feedback, and also interested in having this functionality merged 
> upstream.  This was written against thin-provisioning-tools.git tag 
> v0.5.6.
> 
> We use thin_dump on live dm-thin metadata snapshots all the time.  In our 
> case, we want to dump the XML for new (snapshot) volumes instead of 
> dumping the entire 16gb metadata device (37.8% used) which takes ~20-30 
> minutes instead of ~5 seconds for a single volume with --device-id.
> 
> I started by adding --device-id to pass the `block_address dev_id` and 
> skip the call to emit_mappings() within mapping_tree_emitter::visit() 
> unless the dev_id matches --device-id as passed to thin_dump.  
> 
> It is sometimes nice to list all of the device supers without waiting for 
> emit_mappings() so I added the --skip-mappings option.  This allows you to 
> get the dev_id without reading the mappings:
>   thin_dump --skip-mappings -m /dev/mapper/vg-p_tmeta
> 
> Like `bool repair`, I added skip_mappings and dev_id as arguments to 
> thin_provisioning::metadata_dump() and passed them down to 
> mapping_tree_emitter with new members set_skip_mappings() and set_dev_id() 
> to set private attributes (default values are assigned in 
> metadata_dumper.h in case of backward compatible calls).
> 
> We work with the device metadata in a simplified format, and without 
> superblock information:
> 
>   origin_offset:length:tdata_offset (hence, "o:L:d" or "old") 
> 
> Therefore, the output --format 'old' was added.  I added the format 'null' 
> as well for benchmarking purposes so the ostream needn't write large 
> volumes of text to /dev/null.  
> 
> Benchmarks of the various formats on our sample metadata snapshot on an 
> idle machine:
> 
> for i in xml human_readable old null; do
>   time thin_dump -m -f $i /dev/mapper/vg-p_tmeta -o /dev/null
> done
> 
>   realusersys
> xml   29:01.2725:35.8701:54.49
> human 27:39.8124:12.9001:52.39
> old   27:07.7123:40.6401:52.97
> null  23:39.3821:17.2200:49.04
> 
> I have this as a branch in bitbucket if you prefer to see it online:
>   
> https://bitbucket.org/ewheelerinc/thin-provisioning-tools/branch/v0.5.6-device-id
>   or
> git pull https://bitbucket.org/ewheelerinc/thin-provisioning-tools.git 
> v0.5.6-device-id
> 
> 
> --
> Eric Wheeler
> 
> 
> diff --git a/thin-provisioning/thin_dump.cc b/thin-provisioning/thin_dump.cc
> index 191acb5..7df5d85 100644
> --- a/thin-provisioning/thin_dump.cc
> +++ b/thin-provisioning/thin_dump.cc
> @@ -22,6 +22,8 @@
>  #include 
>  
>  #include "human_readable_format.h"
> +#include "old_format.h"
> +#include "null_format.h"
>  #include "metadata_dumper.h"
>  #include "metadata.h"
>  #include "xml_format.h"
> @@ -36,6 +38,8 @@ using namespace thin_provisioning;
>  struct flags {
>   bool find_metadata_snap;
>   bool repair;
> + block_address dev_id;
> + bool skip_mappings;
>  };
>  
>  namespace {
> @@ -49,12 +53,16 @@ namespace {
>   e = create_xml_emitter(out);
>   else if (format == "human_readable")
>   e = create_human_readable_emitter(out);
> + else if (format == "old")
> + e = create_old_emitter(out);
> + else if (format == "null")
> + e = create_null_emitter(out);
>   else {
>   cerr << "unknown format '" << format << "'" << 
> endl;
>   exit(1);
>   }
>  
> - metadata_dump(md, e, flags.repair);
> + metadata_dump(md, e, flags.repair, flags.skip_mappings, 
> flags.dev_id);
>  
>   } catch (std::exception ) {
>   cerr << e.what() << endl;
> @@ -77,10 +85,12 @@ namespace {
>   out << "Usage: " << cmd << " [options] {device|file}" << endl
>   << "Options:" << endl
>   << "  {-h|--help}" << endl
> - << "  {-f|--format} {xml|human_readable}" << endl
> + << "  {-f|--format} {xml|human_readable|old|null}" << endl
>   << "  {-r|--repair}" << endl
>   << "  {-m|--metadata-snap} [block#]" << endl
>   << "  {-o }" << endl
> + << "  {-d device_id}" << endl
> + << "  {-s|--skip-mappings}" << endl
>   << "  {-V|--version}" << endl;
>   }
>  }
> @@ -89,17 +99,20 @@ int thin_dump_main(int argc, char **argv)
>  {
>   int c;
>   char const *output = NULL;
> - const char shortopts[] = "hm::o:f:rV";
> + const 

Re: [dm-devel] dm-cache: blocks don't get cached on 3.18.21-17.el6.x86_64

2016-03-14 Thread Joe Thornber
On Mon, Mar 14, 2016 at 09:54:06AM +, Thanos Makatos wrote:
> (I've already reported this issue to centos and centos-devel. and
> waited long enough but didn't get any reply.)
> 
> I'm evaluating dm-cache on CentOS 6 kernels  3.18.21-17.el6.x86_64 (Xen 4) and
> 2.6.32-573.7.1.el6.x86_64 (KVM). The test I do is a simple sequential
> read using dd(1). read_promote_adjustment and sequential_threshold
> have been set to 1 and 0, respectively. For the 2.6.32 kernel, all
> seems to be working fine, "#usedcacheblocks" correctly reflects the
> number of cached blocks based on what I'm reading with dd(1),
> performance is pretty much native SSD performance. However, the same
> test on the 3.18.21-17.el6.x86_64 kernel results in "#usedcacheblocks"
> being stuck to "2" and no performance improvement is observed.
> 
> Any ideas what could be wrong? How can I further debug this?

There may not be anything wrong, dm-cache is very cagey these days
about promoting blocks to the ssd without evidence that they're
hotspots.  Hitting blocks once with a dd is not enough.

- Joe

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH] [dm-cache] Make the mq policy an alias for smq

2016-02-11 Thread Joe Thornber
On Wed, Feb 10, 2016 at 12:06:00PM -0500, John Stoffel wrote:
> 
> Can you add in some documentation on how you tell which dm_cache
> policy is actually being used, and how to measure it, etc?  It's a
> black box and some info would be nice.

You can get some stats on the cache performance via the status ioctl
(eg, dmsetup status ...).  This will tell you about
promotions/demotion to the cache, write hits, read hits etc.

There is documentation on cache policies in 
Documentation/device-mapper/cache-policies.txt

https://github.com/jthornber/linux-2.6/blob/2016-02-10-thin-dev-on-4.4/Documentation/device-mapper/cache-policies.txt

As for knowing which policy is running; I'm not sure what to say.
It'll be the one you ask for.  If the above patch goes in, then
there'll be a kernel version where mq becomes the same as smq.  I'll
bump the policy version number to make it clear that mq has undergone
a big change.

- Joe

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] I/O block when removing thin device on the same pool

2016-01-29 Thread Joe Thornber
On Fri, Jan 29, 2016 at 03:50:31PM +0100, Lars Ellenberg wrote:
> On Fri, Jan 22, 2016 at 04:43:46PM +0000, Joe Thornber wrote:
> > On Fri, Jan 22, 2016 at 02:38:28PM +0100, Lars Ellenberg wrote:
> > > We have seen lvremove of thin snapshots sometimes minutes,
> > > even ~20 minutes before.
> > 
> > I did some work on speeding up thin removal in autumn '14, in
> > particular agressively prefetching metadata pages sped up the tree
> > traversal hugely.  Could you confirm you're seeing pauses of this
> > duration with currently kernels please?
> 
> There is 
> https://bugzilla.redhat.com/show_bug.cgi?id=990583
> Bug 990583 - lvremove of thin snapshots takes 5 to 20 minutes (single
> core cpu bound?) 
> 
> >From August 2013, closed by you in October 2015,
> as "not a bug", also pointing to meta data prefetch.
> 
> Now, you tell me, how prefetching meta data (doing disk IO
> more efficiently) helps with something that is clearly CPU bound
> (eating 100% single core CPU traversing whatever)...
> 
> Reason I mention this bug again here is:
> there should be a lvm thin meta data dump in there,
> which you could use for benchmarking improvements yourself.

There is no metadata dump attached to that bug.  I do benchmark stuff
myself, and I found prefetching to make a big difference (obviously
I'm not cpu bound like you).  We all have different hardware, which is
why I ask people with more real world scenarios to test stuff separately.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel