Re: [linux-lvm] exposing snapshot block device
On Tue, 22 Oct 2019, Gionatan Danti wrote: The main thing that somewhat scares me is that (if things had not changed) thinvol uses a single root btree node: losing it means losing *all* thin volumes of a specific thin pool. Coupled with the fact that metadata dump are not as handy as with the old LVM code (no vgcfgrestore), it worries me. If you can find all the leaf nodes belonging to the root (in my btree database they are marked with the root id and can be found by sequential scan of the volume), then reconstructing the btree data is straightforward - even in place. I remember realizing this was the only way to recover a major customer's data - and had the utility written, tested, and applied in a 36 hour programming marathon (which I hope to never repeat). If this hasn't occured to thin pool programmers, I am happy to flesh out the procedure. Having such a utility available as a last resort would ratchet up the reliability of thin pools. -- Stuart D. Gathman "Confutatis maledictis, flamis acribus addictis" - background song for a Microsoft sponsored "Where do you want to go from here?" commercial. ___ linux-lvm mailing list linux-lvm@redhat.com https://www.redhat.com/mailman/listinfo/linux-lvm read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/
Re: [linux-lvm] exposing snapshot block device
Hi, Il 22-10-2019 18:15 Stuart D. Gathman ha scritto: "Old" snapshots are exactly as efficient as thin when there is exactly one. They only get inefficient with multiple snapshots. On the other hand, thin volumes are as inefficient as an old LV with one snapshot. An old LV is as efficient, and as anti-fragile, as a partition. Thin volumes are much more flexible, but depend on much more fragile database like meta-data. this is both true and false: while in the single-snapshot case performance remains acceptable even from fat snapshots, the btree representation (and more modern code) of the "new" (7+ years old now) thin snapshots gurantees significantly higher performance, at least on my tests. Note #1: I know that the old snapshot code uses 4K chunks by default, versus the 64K chunks of thinsnap. That said, I recorded higher thinsnap performance even when using a 64K chunk size for old fat snapshots. Note #2: I generally disable thinpool zeroing (as I use a filesystem layer on top of thin volumes). I 100% agree that old LVM code, with its plain text metadata and continuous plain-text backups, is extremely reliable and easy to fix/correct. For this reason, I always prefer "old" LVs when the functionality of thin LVs are not actually needed. I can even manually recover from trashed meta data by editing it, as it is human readable text. My main use of fat logical volumes is for boot and root filesystems, while thin vols (and zfs datasets, but this is another story...) are used for data partitions. The main thing that somewhat scares me is that (if things had not changed) thinvol uses a single root btree node: losing it means losing *all* thin volumes of a specific thin pool. Coupled with the fact that metadata dump are not as handy as with the old LVM code (no vgcfgrestore), it worries me. The "rollforward" must be applied to the backup image of the snapshot. If the admin gets it paired with the wrong backup, massive corruption ensues. This could be automated. E.g. the full image backup and external cow would have unique matching names. Or the full image backup could compute an md5 in parallel, which would be store with the cow. But none of those tools currently exist. This is the reason why I have not used thin_delta in production: an error from my part in recovering the volume (ie: applying the wrong delta) would cause massive data corruption. My current setup for instant recovery *and* added resiliance is somewhat similar to that: RAID -> DRBD -> THINPOOL -> THINVOL w/periodic snapshots (with the DRBD layer replicating to a sibling machine). Regards. -- Danti Gionatan Supporto Tecnico Assyoma S.r.l. - www.assyoma.it email: g.da...@assyoma.it - i...@assyoma.it GPG public key ID: FF5F32A8 ___ linux-lvm mailing list linux-lvm@redhat.com https://www.redhat.com/mailman/listinfo/linux-lvm read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/
Re: [linux-lvm] exposing snapshot block device
On Tue, 22 Oct 2019, Zdenek Kabelac wrote: Dne 22. 10. 19 v 17:29 Dalebjörk, Tomas napsal(a): But, it would be better if the cow device could be recreated in a faster way, mentioning that all blocks are present on an external device, so that the LV volume can be restored much quicker using "lvconvert --merge" command. I do not want to break your imagination here, but that is exactly the thing you can do with thin provisioning and thin_delta tool. lvconvert --merge does a "rollback" to the point at which the snapshot was taken. The master LV already has current data. What Tomas wants to be able to do a "rollforward" from the point at which the snapshot was taken. He also wants to be able to put the cow volume on an extern/remote medium, and add a snapshot using an already existing cow. This way, restoring means copying the full volume from backup, creating a snapshot using existing external cow, then lvconvert --merge instantly logically applies the cow changes while updating the master LV. Pros: "Old" snapshots are exactly as efficient as thin when there is exactly one. They only get inefficient with multiple snapshots. On the other hand, thin volumes are as inefficient as an old LV with one snapshot. An old LV is as efficient, and as anti-fragile, as a partition. Thin volumes are much more flexible, but depend on much more fragile database like meta-data. For this reason, I always prefer "old" LVs when the functionality of thin LVs are not actually needed. I can even manually recover from trashed meta data by editing it, as it is human readable text. Updates to the external cow can be pipelined (but then properly handling reads becomes non trivial - there are mature remote block device implementations for linux that will do the job). Cons: For the external cow to be useful, updates to it must be *strictly* serialized. This is doable, but not as obvious or trivial as it might seem at first glance. (Remote block device software will take care of this as well.) The "rollforward" must be applied to the backup image of the snapshot. If the admin gets it paired with the wrong backup, massive corruption ensues. This could be automated. E.g. the full image backup and external cow would have unique matching names. Or the full image backup could compute an md5 in parallel, which would be store with the cow. But none of those tools currently exist. -- Stuart D. Gathman "Confutatis maledictis, flamis acribus addictis" - background song for a Microsoft sponsored "Where do you want to go from here?" commercial.___ linux-lvm mailing list linux-lvm@redhat.com https://www.redhat.com/mailman/listinfo/linux-lvm read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/
Re: [linux-lvm] exposing snapshot block device
Thanks for feedback, I know that thick LV snapshots are out dated, and that one should use thin LV snapshots. But my understanding is that the dm- cow and dm - origin are still present and available in thin too? Example of a scenario: 1. Create a snapshot of LV testlv with the name snaplv 2. Perform a full copy of the snaplv using for example dd to a block device 3. Delete the snapshot Now I would like to re-attach this external block device as a snapshot again. After all, it is just a dm and LVM config, right? So for example: 1. create a snapshot of testlv with the name snaplv 2. re create the -cow meta data device : ... Recreate this -cow meta data device by telling the origin that all data has been changed and are in the cow device (the raw device) 3. If the above were possible to perform, than it could be possible to instantly get at copy of the LV data using the lvconvert --merge command I have already invented a way to perform "block level incremental forever"; using the -cow device. And a possibility to reverse the blocks, to copy back only changed content from external devices. But, it would be better if the cow device could be recreated in a faster way, mentioning that all blocks are present on an external device, so that the LV volume can be restored much quicker using "lvconvert --merge" command. That would be super cool! Imagine backing up multi terrabyte sized volumes in minutes to external destinations, and restoring the data in seconds using instant recovery by re-creating or emulating the cow device, and associating all blocks to an external device? Regards Tomas Den 2019-10-22 kl. 15:57, skrev Zdenek Kabelac: Dne 22. 10. 19 v 12:47 Dalebjörk, Tomas napsal(a): Hi When you create a snapshot of a logical volume. A new virtual dm- device will be created with the content of the changes from the origin. This cow device can than be used to read changed contents etc. In case of an incident, this cow device can be used to read back the changed content to its origin using the "lvmerge" command. The question I have is if there is a way to couple an external cow device to an empty equaly sized logical volume, so that the empty logical volume is aware of that all changed content are placed on this attached cow device? If that is possible, than it will help making instant recovery of LV volumes from an external source using native lvmerge command, from for example a backup server. For most info how old snapshot for so called 'thick' LVs works - check these papers: http://people.redhat.com/agk/talks/ lvconvert --merge is in fact 'instant' operation - when it happens - you can immediately access 'already merged' content while the merge is happening in the background (you can look for copies percentage in lvs command) However 'thick' LVs with old snapshots are rather 'dated' technology you should probably checkout the usage of thinly provisioned LVs. Regards Zdenek ___ linux-lvm mailing list linux-lvm@redhat.com https://www.redhat.com/mailman/listinfo/linux-lvm read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/
Re: [linux-lvm] exposing snapshot block device
Dne 22. 10. 19 v 12:47 Dalebjörk, Tomas napsal(a): Hi When you create a snapshot of a logical volume. A new virtual dm- device will be created with the content of the changes from the origin. This cow device can than be used to read changed contents etc. In case of an incident, this cow device can be used to read back the changed content to its origin using the "lvmerge" command. The question I have is if there is a way to couple an external cow device to an empty equaly sized logical volume, so that the empty logical volume is aware of that all changed content are placed on this attached cow device? If that is possible, than it will help making instant recovery of LV volumes from an external source using native lvmerge command, from for example a backup server. For most info how old snapshot for so called 'thick' LVs works - check these papers: http://people.redhat.com/agk/talks/ lvconvert --merge is in fact 'instant' operation - when it happens - you can immediately access 'already merged' content while the merge is happening in the background (you can look for copies percentage in lvs command) However 'thick' LVs with old snapshots are rather 'dated' technology you should probably checkout the usage of thinly provisioned LVs. Regards Zdenek ___ linux-lvm mailing list linux-lvm@redhat.com https://www.redhat.com/mailman/listinfo/linux-lvm read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/
[linux-lvm] exposing snapshot block device
Hi When you create a snapshot of a logical volume. A new virtual dm- device will be created with the content of the changes from the origin. This cow device can than be used to read changed contents etc. In case of an incident, this cow device can be used to read back the changed content to its origin using the "lvmerge" command. The question I have is if there is a way to couple an external cow device to an empty equaly sized logical volume, so that the empty logical volume is aware of that all changed content are placed on this attached cow device? If that is possible, than it will help making instant recovery of LV volumes from an external source using native lvmerge command, from for example a backup server. [EMPTY LOGICAL VOLUME] ^ | lvmerge | [ATTACHED COW DEVICE] Regards Tomas ___ linux-lvm mailing list linux-lvm@redhat.com https://www.redhat.com/mailman/listinfo/linux-lvm read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/
[linux-lvm] resend patch - bcache may mistakenly write data to another disk when writes error
Hello List & David, This patch is responsible for legacy mail: [linux-lvm] pvresize will cause a meta-data corruption with error message "Error writing device at 4096 length 512" I had send it to our customer, the code ran as expected. I think this code is enough to fix this issue. Thanks zhm --(patch for branch stable-2.02) -- From d0d77d0bdad6136c792c966d73dd47b809cb Mon Sep 17 00:00:00 2001 From: Zhao Heming Date: Tue, 22 Oct 2019 17:22:17 +0800 Subject: [PATCH] bcache may mistakenly write data to another disk when writes error When bcache write data error, the errored fd and its data is saved in cache->errored, then this fd is closed. Later lvm will reuse this closed fd to new opened devs, but the fd related data still in cache->errored and flags with BF_DIRTY. It make the data may mistakenly write to another disk. Signed-off-by: Zhao Heming --- .gitignore| 2 +- lib/cache/lvmcache.c | 2 +- lib/device/bcache.c | 44 ++- lib/format_text/format-text.c | 11 --- lib/label/label.c | 29 ++-- lib/metadata/mirror.c | 1 - 6 files changed, 52 insertions(+), 37 deletions(-) diff --git a/.gitignore b/.gitignore index f51bb67fca..57bb007005 100644 --- a/.gitignore +++ b/.gitignore @@ -28,7 +28,7 @@ make.tmpl /config.log /config.status /configure.scan -/cscope.out +/cscope.* /tags /tmp/ diff --git a/lib/cache/lvmcache.c b/lib/cache/lvmcache.c index 9890325d2e..9c6e8032d6 100644 --- a/lib/cache/lvmcache.c +++ b/lib/cache/lvmcache.c @@ -1429,7 +1429,7 @@ int lvmcache_label_rescan_vg(struct cmd_context *cmd, const char *vgname, const * incorrectly placed PVs should have been moved from the orphan vginfo * onto their correct vginfo's, and the orphan vginfo should (in theory) * represent only real orphan PVs. (Note: if lvmcache_label_scan is run - * after vg_read udpates to lvmcache state, then the lvmcache will be + * after vg_read updates to lvmcache state, then the lvmcache will be * incorrect again, so do not run lvmcache_label_scan during the * processing phase.) * diff --git a/lib/device/bcache.c b/lib/device/bcache.c index d487ca2a77..f0fe07f921 100644 --- a/lib/device/bcache.c +++ b/lib/device/bcache.c @@ -293,6 +293,10 @@ static bool _async_issue(struct io_engine *ioe, enum dir d, int fd, if (r < 0) { _cb_free(e->cbs, cb); + ((struct block *)context)->error = r; + log_warn("io_submit <%c> off %llu bytes %llu return %d:%s", + (d == DIR_READ) ? 'R' : 'W', (long long unsigned)offset, + (long long unsigned)nbytes, r, strerror(-r)); return false; } @@ -869,7 +873,7 @@ static void _complete_io(void *context, int err) if (b->error) { dm_list_add(>errored, >list); - + log_warn("_complete_io fd: %d error: %d", b->fd, err); } else { _clear_flags(b, BF_DIRTY); _link_block(b); @@ -896,8 +900,7 @@ static void _issue_low_level(struct block *b, enum dir d) dm_list_move(>io_pending, >list); if (!cache->engine->issue(cache->engine, d, b->fd, sb, se, b->data, b)) { - /* FIXME: if io_submit() set an errno, return that instead of EIO? */ - _complete_io(b, -EIO); + _complete_io(b, b->error); return; } } @@ -921,16 +924,20 @@ static bool _wait_io(struct bcache *cache) * High level IO handling *--*/ -static void _wait_all(struct bcache *cache) +static bool _wait_all(struct bcache *cache) { + bool ret = true; while (!dm_list_empty(>io_pending)) - _wait_io(cache); + ret = _wait_io(cache); + return ret; } -static void _wait_specific(struct block *b) +static bool _wait_specific(struct block *b) { + bool ret = true; while (_test_flags(b, BF_IO_PENDING)) - _wait_io(b->cache); + ret = _wait_io(b->cache); + return ret; } static unsigned _writeback(struct bcache *cache, unsigned count) @@ -1290,10 +1297,7 @@ void bcache_put(struct block *b) bool bcache_flush(struct bcache *cache) { - // Only dirty data is on the errored list, since bad read blocks get - // recycled straight away. So we put these back on the dirty list, and - // try and rewrite everything. - dm_list_splice(>dirty, >errored); + bool write_ret = true, wait_ret = true; while (!dm_list_empty(>dirty)) { struct block *b = dm_list_item(_list_pop(>dirty), struct block); @@ -1303,11 +1307,16 @@ bool bcache_flush(struct bcache *cache) } _issue_write(b); + if (b->error) write_ret = false; }
[linux-lvm] pvmove fails on VG, managed by PCS resource agent in HA-LVM mode(Active-Passive), with tagging enabled.
Hi, pvmove seems to fail on VG, which is managed by PCS Resource agent with 'exclusive' activation enabled. The volume group(VG) is created on a shared disk, with '--addtag test' added. Content of my lvm.conf is #lvmconfig activation/volume_list volume_list=["@test"] I am able to create Logical volume over it, vgextend, vgremove everything works fine. When I tried to do pvmove, it fails with error that lvm cannot activate vg0/pvmove0. On probing little further, I found that when I create LVM PCS resource agent with 'exclusive=true', it strips-off the original tag 'test' and adds its own 'pacemaker' tag. Since VG was stripped-off with original tag, I think, that is the reason of pvmove is getting failed. I am out-of ideas to debug this further. Need some expert advice/solution to handle this situation. Also, how to use lvmconfig utility to modify volume_list, without the need manually update lvm.conf file? -Udai ___ linux-lvm mailing list linux-lvm@redhat.com https://www.redhat.com/mailman/listinfo/linux-lvm read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/
[linux-lvm] pvmove fails on VG, managed by PCS resource agent in HA-LVM mode(Active-Passive), with tagging enabled.
Hi, pvmove seems to fail on VG, which is managed by PCS Resource agent with 'exclusive' activation enabled. The volume group(VG) is created on a shared disk, with '--addtag test' added. Content of my lvm.conf is #lvmconfig activation/volume_list volume_list=["@test"] I am able to create Logical volume over it, vgextend, vgremove everything works fine. When I tried to do pvmove, it fails with error that lvm cannot activate vg0/pvmove0. On probing little further, I found that when I create LVM PCS resource agent with 'exclusive=true', it strips-off the original tag 'test' and adds its own 'pacemaker' tag. Since VG was stripped-off with original tag, I think, that is the reason of pvmove is getting failed. I am out-of ideas to debug this further. Need some expert advice/solution to handle this situation. Also, how to use lvmconfig utility to modify volume_list, without the need manually update lvm.conf file? -Udai ___ linux-lvm mailing list linux-lvm@redhat.com https://www.redhat.com/mailman/listinfo/linux-lvm read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/