Re: [linux-lvm] exposing snapshot block device

2019-10-22 Thread Stuart D. Gathman

On Tue, 22 Oct 2019, Gionatan Danti wrote:

The main thing that somewhat scares me is that (if things had not changed) 
thinvol uses a single root btree node: losing it means losing *all* thin 
volumes of a specific thin pool. Coupled with the fact that metadata dump are 
not as handy as with the old LVM code (no vgcfgrestore), it worries me.


If you can find all the leaf nodes belonging to the root (in my btree
database they are marked with the root id and can be found by sequential
scan of the volume), then reconstructing the btree data is
straightforward - even in place.

I remember realizing this was the only way to recover a major customer's
data - and had the utility written, tested, and applied in a 36 hour
programming marathon (which I hope to never repeat).  If this hasn't
occured to thin pool programmers, I am happy to flesh out the procedure.
Having such a utility available as a last resort would ratchet up the
reliability of thin pools.

--
  Stuart D. Gathman 
"Confutatis maledictis, flamis acribus addictis" - background song for
a Microsoft sponsored "Where do you want to go from here?" commercial.

___
linux-lvm mailing list
linux-lvm@redhat.com
https://www.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/



Re: [linux-lvm] exposing snapshot block device

2019-10-22 Thread Gionatan Danti

Hi,

Il 22-10-2019 18:15 Stuart D. Gathman ha scritto:

"Old" snapshots are exactly as efficient as thin when there is exactly
one.  They only get inefficient with multiple snapshots.  On the other
hand, thin volumes are as inefficient as an old LV with one snapshot.
An old LV is as efficient, and as anti-fragile, as a partition.  Thin
volumes are much more flexible, but depend on much more fragile 
database

like meta-data.


this is both true and false: while in the single-snapshot case 
performance remains acceptable even from fat snapshots, the btree 
representation (and more modern code) of the "new" (7+ years old now) 
thin snapshots gurantees significantly higher performance, at least on 
my tests.


Note #1: I know that the old snapshot code uses 4K chunks by default, 
versus the 64K chunks of thinsnap. That said, I recorded higher thinsnap 
performance even when using a 64K chunk size for old fat snapshots.
Note #2: I generally disable thinpool zeroing (as I use a filesystem 
layer on top of thin volumes).


I 100% agree that old LVM code, with its plain text metadata and 
continuous plain-text backups, is extremely reliable and easy to 
fix/correct.



For this reason, I always prefer "old" LVs when the functionality of
thin LVs are not actually needed.  I can even manually recover from
trashed meta data by editing it, as it is human readable text.


My main use of fat logical volumes is for boot and root filesystems, 
while thin vols (and zfs datasets, but this is another story...) are 
used for data partitions.


The main thing that somewhat scares me is that (if things had not 
changed) thinvol uses a single root btree node: losing it means losing 
*all* thin volumes of a specific thin pool. Coupled with the fact that 
metadata dump are not as handy as with the old LVM code (no 
vgcfgrestore), it worries me.



The "rollforward" must be applied to the backup image of the snapshot.
If the admin gets it paired with the wrong backup, massive corruption
ensues.  This could be automated.  E.g. the full image backup and
external cow would have unique matching names.  Or the full image 
backup

could compute an md5 in parallel, which would be store with the cow.
But none of those tools currently exist.


This is the reason why I have not used thin_delta in production: an 
error from my part in recovering the volume (ie: applying the wrong 
delta) would cause massive data corruption. My current setup for instant 
recovery *and* added resiliance is somewhat similar to that: RAID -> 
DRBD -> THINPOOL -> THINVOL w/periodic snapshots (with the DRBD layer 
replicating to a sibling machine).


Regards.

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.da...@assyoma.it - i...@assyoma.it
GPG public key ID: FF5F32A8

___
linux-lvm mailing list
linux-lvm@redhat.com
https://www.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/



Re: [linux-lvm] exposing snapshot block device

2019-10-22 Thread Stuart D. Gathman

On Tue, 22 Oct 2019, Zdenek Kabelac wrote:


Dne 22. 10. 19 v 17:29 Dalebjörk, Tomas napsal(a):
But, it would be better if the cow device could be recreated in a faster 
way, mentioning that all blocks are present on an external device, so that 
the LV volume can be restored much quicker using "lvconvert --merge" 
command.


I do not want to break your imagination here, but that is exactly the thing 
you can do with thin provisioning and thin_delta tool.


lvconvert --merge does a "rollback" to the point at which the snapshot
was taken.  The master LV already has current data.  What Tomas wants to
be able to do a "rollforward" from the point at which the snapshot was
taken.  He also wants to be able to put the cow volume on an
extern/remote medium, and add a snapshot using an already existing cow.

This way, restoring means copying the full volume from backup, creating
a snapshot using existing external cow, then lvconvert --merge 
instantly logically applies the cow changes while updating the master

LV.

Pros:

"Old" snapshots are exactly as efficient as thin when there is exactly
one.  They only get inefficient with multiple snapshots.  On the other
hand, thin volumes are as inefficient as an old LV with one snapshot.
An old LV is as efficient, and as anti-fragile, as a partition.  Thin
volumes are much more flexible, but depend on much more fragile database
like meta-data.

For this reason, I always prefer "old" LVs when the functionality of
thin LVs are not actually needed.  I can even manually recover from
trashed meta data by editing it, as it is human readable text.

Updates to the external cow can be pipelined (but then properly
handling reads becomes non trivial - there are mature remote block
device implementations for linux that will do the job).

Cons:

For the external cow to be useful, updates to it must be *strictly*
serialized.  This is doable, but not as obvious or trivial as it might
seem at first glance.  (Remote block device software will take care
of this as well.)

The "rollforward" must be applied to the backup image of the snapshot.
If the admin gets it paired with the wrong backup, massive corruption
ensues.  This could be automated.  E.g. the full image backup and
external cow would have unique matching names.  Or the full image backup
could compute an md5 in parallel, which would be store with the cow.
But none of those tools currently exist.

--
  Stuart D. Gathman 
"Confutatis maledictis, flamis acribus addictis" - background song for
a Microsoft sponsored "Where do you want to go from here?" commercial.___
linux-lvm mailing list
linux-lvm@redhat.com
https://www.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/

Re: [linux-lvm] exposing snapshot block device

2019-10-22 Thread Dalebjörk , Tomas

Thanks for feedback,


I know that thick LV snapshots are out dated, and that one should use 
thin LV snapshots.


But my understanding is that the dm- cow and dm - origin are still 
present and available in thin too?



Example of a scenario:

1. Create a snapshot of LV testlv with the name snaplv
2. Perform a full copy of the snaplv using for example dd to a block device
3. Delete the snapshot

Now I would like to re-attach this external block device as a snapshot 
again.


After all, it is just a dm and LVM config, right? So for example:

1. create a snapshot of testlv with the name snaplv
2. re create the -cow meta data device : 
...
    Recreate this -cow meta data device by telling the origin that all 
data has been changed and are in the cow device (the raw device)
3. If the above were possible to perform, than it could be possible to 
instantly get at copy of the LV data using the lvconvert --merge command


I have already invented a way to perform "block level incremental 
forever"; using the -cow device.


And a possibility to reverse the blocks, to copy back only changed 
content from external devices.


But, it would be better if the cow device could be recreated in a faster 
way, mentioning that all blocks are present on an external device, so 
that the LV volume can be restored much quicker using "lvconvert 
--merge" command.


That would be super cool!

Imagine backing up multi terrabyte sized volumes in minutes to external 
destinations, and restoring the data in seconds using instant recovery 
by re-creating or emulating the cow device, and associating all blocks 
to an external device?


Regards Tomas


Den 2019-10-22 kl. 15:57, skrev Zdenek Kabelac:


Dne 22. 10. 19 v 12:47 Dalebjörk, Tomas napsal(a):

Hi

When you create a snapshot of a logical volume.

A new virtual dm- device will be created with the content of the 
changes from the origin.


This cow device can than be used to read changed contents etc.


In case of an incident, this cow device can be used to read back the 
changed content to its origin using the "lvmerge" command.



The question I have is if there is a way to couple an external cow 
device to an empty equaly sized logical volume,


so that the empty logical volume is aware of that all changed content 
are placed on this attached cow device?


If that is possible, than it will help making instant recovery of LV 
volumes from an external source using native lvmerge command, from 
for example a backup server.


For most info how old snapshot for so called 'thick' LVs works - check
these papers: http://people.redhat.com/agk/talks/


lvconvert --merge

is in fact 'instant' operation - when it happens - you can immediately 
access

'already merged' content  while the merge is happening in the background
(you can look for copies percentage in lvs command)


However 'thick' LVs with old snapshots are rather 'dated' technology
you should probably checkout the usage of  thinly provisioned LVs.

Regards

Zdenek




___
linux-lvm mailing list
linux-lvm@redhat.com
https://www.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/



Re: [linux-lvm] exposing snapshot block device

2019-10-22 Thread Zdenek Kabelac

Dne 22. 10. 19 v 12:47 Dalebjörk, Tomas napsal(a):

Hi

When you create a snapshot of a logical volume.

A new virtual dm- device will be created with the content of the changes from 
the origin.


This cow device can than be used to read changed contents etc.


In case of an incident, this cow device can be used to read back the changed 
content to its origin using the "lvmerge" command.



The question I have is if there is a way to couple an external cow device to 
an empty equaly sized logical volume,


so that the empty logical volume is aware of that all changed content are 
placed on this attached cow device?


If that is possible, than it will help making instant recovery of LV volumes 
from an external source using native lvmerge command, from for example a 
backup server.


For most info how old snapshot for so called 'thick' LVs works - check
these papers: http://people.redhat.com/agk/talks/


lvconvert --merge

is in fact 'instant' operation - when it happens - you can immediately access
'already merged' content  while the merge is happening in the background
(you can look for copies percentage in lvs command)


However 'thick' LVs with old snapshots are rather 'dated' technology
you should probably checkout the usage of  thinly provisioned LVs.

Regards

Zdenek


___
linux-lvm mailing list
linux-lvm@redhat.com
https://www.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/



[linux-lvm] exposing snapshot block device

2019-10-22 Thread Dalebjörk , Tomas

Hi

When you create a snapshot of a logical volume.

A new virtual dm- device will be created with the content of the changes 
from the origin.


This cow device can than be used to read changed contents etc.


In case of an incident, this cow device can be used to read back the 
changed content to its origin using the "lvmerge" command.



The question I have is if there is a way to couple an external cow 
device to an empty equaly sized logical volume,


so that the empty logical volume is aware of that all changed content 
are placed on this attached cow device?


If that is possible, than it will help making instant recovery of LV 
volumes from an external source using native lvmerge command, from for 
example a backup server.



[EMPTY LOGICAL VOLUME]
    ^
    |
     lvmerge
    |

[ATTACHED COW DEVICE]

Regards Tomas

___
linux-lvm mailing list
linux-lvm@redhat.com
https://www.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/

[linux-lvm] resend patch - bcache may mistakenly write data to another disk when writes error

2019-10-22 Thread Heming Zhao
Hello List & David,

This patch is responsible for legacy mail:
[linux-lvm] pvresize will cause a meta-data corruption with error message 
"Error writing device at 4096 length 512"

I had send it to our customer, the code ran as expected. I think this code is 
enough to fix this issue.

Thanks
zhm

--(patch for branch stable-2.02) --
 From d0d77d0bdad6136c792c966d73dd47b809cb Mon Sep 17 00:00:00 2001
From: Zhao Heming 
Date: Tue, 22 Oct 2019 17:22:17 +0800
Subject: [PATCH] bcache may mistakenly write data to another disk when writes
  error

When bcache write data error, the errored fd and its data is saved in
cache->errored, then this fd is closed. Later lvm will reuse this
closed fd to new opened devs, but the fd related data still in
cache->errored and flags with BF_DIRTY. It make the data may mistakenly
write to another disk.

Signed-off-by: Zhao Heming 
---
  .gitignore|  2 +-
  lib/cache/lvmcache.c  |  2 +-
  lib/device/bcache.c   | 44 ++-
  lib/format_text/format-text.c | 11 ---
  lib/label/label.c | 29 ++--
  lib/metadata/mirror.c |  1 -
  6 files changed, 52 insertions(+), 37 deletions(-)

diff --git a/.gitignore b/.gitignore
index f51bb67fca..57bb007005 100644
--- a/.gitignore
+++ b/.gitignore
@@ -28,7 +28,7 @@ make.tmpl
  /config.log
  /config.status
  /configure.scan
-/cscope.out
+/cscope.*
  /tags
  /tmp/
  
diff --git a/lib/cache/lvmcache.c b/lib/cache/lvmcache.c
index 9890325d2e..9c6e8032d6 100644
--- a/lib/cache/lvmcache.c
+++ b/lib/cache/lvmcache.c
@@ -1429,7 +1429,7 @@ int lvmcache_label_rescan_vg(struct cmd_context *cmd, 
const char *vgname, const
   * incorrectly placed PVs should have been moved from the orphan vginfo
   * onto their correct vginfo's, and the orphan vginfo should (in theory)
   * represent only real orphan PVs.  (Note: if lvmcache_label_scan is run
- * after vg_read udpates to lvmcache state, then the lvmcache will be
+ * after vg_read updates to lvmcache state, then the lvmcache will be
   * incorrect again, so do not run lvmcache_label_scan during the
   * processing phase.)
   *
diff --git a/lib/device/bcache.c b/lib/device/bcache.c
index d487ca2a77..f0fe07f921 100644
--- a/lib/device/bcache.c
+++ b/lib/device/bcache.c
@@ -293,6 +293,10 @@ static bool _async_issue(struct io_engine *ioe, enum dir 
d, int fd,
  
if (r < 0) {
_cb_free(e->cbs, cb);
+   ((struct block *)context)->error = r;
+   log_warn("io_submit <%c> off %llu bytes %llu return %d:%s",
+   (d == DIR_READ) ? 'R' : 'W', (long long unsigned)offset,
+   (long long unsigned)nbytes, r, strerror(-r));
return false;
}
  
@@ -869,7 +873,7 @@ static void _complete_io(void *context, int err)
  
if (b->error) {
dm_list_add(>errored, >list);
-
+   log_warn("_complete_io fd: %d error: %d", b->fd, err);
} else {
_clear_flags(b, BF_DIRTY);
_link_block(b);
@@ -896,8 +900,7 @@ static void _issue_low_level(struct block *b, enum dir d)
dm_list_move(>io_pending, >list);
  
if (!cache->engine->issue(cache->engine, d, b->fd, sb, se, b->data, b)) 
{
-   /* FIXME: if io_submit() set an errno, return that instead of 
EIO? */
-   _complete_io(b, -EIO);
+   _complete_io(b, b->error);
return;
}
  }
@@ -921,16 +924,20 @@ static bool _wait_io(struct bcache *cache)
   * High level IO handling
   *--*/
  
-static void _wait_all(struct bcache *cache)
+static bool _wait_all(struct bcache *cache)
  {
+   bool ret = true;
while (!dm_list_empty(>io_pending))
-   _wait_io(cache);
+   ret = _wait_io(cache);
+   return ret;
  }
  
-static void _wait_specific(struct block *b)
+static bool _wait_specific(struct block *b)
  {
+   bool ret = true;
while (_test_flags(b, BF_IO_PENDING))
-   _wait_io(b->cache);
+   ret = _wait_io(b->cache);
+   return ret;
  }
  
  static unsigned _writeback(struct bcache *cache, unsigned count)
@@ -1290,10 +1297,7 @@ void bcache_put(struct block *b)
  
  bool bcache_flush(struct bcache *cache)
  {
-   // Only dirty data is on the errored list, since bad read blocks get
-   // recycled straight away.  So we put these back on the dirty list, and
-   // try and rewrite everything.
-   dm_list_splice(>dirty, >errored);
+   bool write_ret = true, wait_ret = true;
  
while (!dm_list_empty(>dirty)) {
struct block *b = dm_list_item(_list_pop(>dirty), struct 
block);
@@ -1303,11 +1307,16 @@ bool bcache_flush(struct bcache *cache)
}
  
_issue_write(b);
+   if (b->error) write_ret = false;
}
  

[linux-lvm] pvmove fails on VG, managed by PCS resource agent in HA-LVM mode(Active-Passive), with tagging enabled.

2019-10-22 Thread Udai Sharma
Hi,
pvmove seems to fail on VG, which is managed by PCS Resource agent with
'exclusive' activation enabled.

The volume group(VG) is created on a shared disk, with '--addtag test'
added.
Content of my lvm.conf is
#lvmconfig activation/volume_list
volume_list=["@test"]

I am able to create Logical volume over it, vgextend, vgremove everything
works fine.
When I tried to do pvmove, it fails with error that lvm cannot activate
vg0/pvmove0.

On probing little further, I found that when I create LVM PCS resource
agent with 'exclusive=true', it strips-off the original tag 'test'
and adds its own 'pacemaker' tag.
Since VG was stripped-off with original tag, I think, that is the reason of
pvmove is getting failed.

I am out-of ideas to debug this further. Need some expert advice/solution
to handle this situation.

Also, how to use lvmconfig utility to modify volume_list, without the need
manually update lvm.conf file?

-Udai
___
linux-lvm mailing list
linux-lvm@redhat.com
https://www.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/

[linux-lvm] pvmove fails on VG, managed by PCS resource agent in HA-LVM mode(Active-Passive), with tagging enabled.

2019-10-22 Thread Udai Sharma
Hi,
pvmove seems to fail on VG, which is managed by PCS Resource agent with
'exclusive' activation enabled.

The volume group(VG) is created on a shared disk, with '--addtag test'
added.
Content of my lvm.conf is
#lvmconfig activation/volume_list
volume_list=["@test"]

I am able to create Logical volume over it, vgextend, vgremove everything
works fine.
When I tried to do pvmove, it fails with error that lvm cannot activate
vg0/pvmove0.

On probing little further, I found that when I create LVM PCS resource
agent with 'exclusive=true', it strips-off the original tag 'test'
and adds its own 'pacemaker' tag.
Since VG was stripped-off with original tag, I think, that is the reason of
pvmove is getting failed.

I am out-of ideas to debug this further. Need some expert advice/solution
to handle this situation.

Also, how to use lvmconfig utility to modify volume_list, without the need
manually update lvm.conf file?

-Udai
___
linux-lvm mailing list
linux-lvm@redhat.com
https://www.redhat.com/mailman/listinfo/linux-lvm
read the LVM HOW-TO at http://tldp.org/HOWTO/LVM-HOWTO/