Re: [Discuss] lvm snapshot cloning
On Oct 23, 2011, at 7:08 PM, ma...@mohawksoft.com wrote: > > So, correct me if I'm wrong, is there a utility to create a an active 1:1 > copy of a snapshot in LVM2? dd Really. LVM volumes are block devices, and the simplest way to make a 1:1 copy of a block device is with dd. --Rich P. ___ Discuss mailing list Discuss@blu.org http://lists.blu.org/mailman/listinfo/discuss
Re: [Discuss] lvm snapshot cloning
> On Oct 23, 2011, at 7:08 PM, ma...@mohawksoft.com wrote: >> >> So, correct me if I'm wrong, is there a utility to create a an active >> 1:1 >> copy of a snapshot in LVM2? > > dd > > Really. LVM volumes are block devices, and the simplest way to make a 1:1 > copy of a block device is with dd. That may be the "easiest way" but it certainly not the most efficient way. Suppose you have a 1TB logical volume. You create a point in time snapshot for testing or backup. At a later point you want to create a copy of that snapshot to do some work on the data. A snapshot does not contain the data, it only contains the old data from a "copy on write" change in the real device. Since the snapshot was created, the disk has changed very little. Your way would take up a whole additional 1TB of space. OMG. If you can read/write 30MB/s it would take you 14 hours to copy. The real solution, and I have code to do it, is this: create a snapshot of device A, call it foo. (arbitrary length of time passes) create a second snapshot of A, call it bar. Both these snapshots are now "tracking" the real volume. Now, you use the COW data from the first snapshot and apply it to the second snapshot. This will create two identical snapshots. Both sparsely populated with only the differences between themselves and the real volume. The only drawback is that the two snapshots will contain duplicate COW data. > > --Rich P. > > ___ > Discuss mailing list > Discuss@blu.org > http://lists.blu.org/mailman/listinfo/discuss > ___ Discuss mailing list Discuss@blu.org http://lists.blu.org/mailman/listinfo/discuss
Re: [Discuss] lvm snapshot cloning
On Oct 23, 2011, at 8:31 PM, ma...@mohawksoft.com wrote: > Both these snapshots are now "tracking" the real volume. Now, you use the > COW data from the first snapshot and apply it to the second snapshot. This > will create two identical snapshots. Both sparsely populated with only the > differences between themselves and the real volume. The only drawback is > that the two snapshots will contain duplicate COW data. The "only" drawback is that you're treating LVM as if it were a file system. Recall what I wrote yesterday about LVM biting off your face. That's what's happening: LVM is gnawing on your nose right now. An LVM volume is a dynamic disk partition, and LVM snapshots are transaction logs of the low level block changes. These logs are presented to the OS as more block devices. If you want to clone a snapshot then you have to treat it as the block device that it is. What you really want is file system snapshots or a versioning file system. LVM can accomplish neither because it's just a block device driver. Unfortunately, neither of these exist for any of the main line file systems available in the Linux kernel. ZFS isn't happening and Btrfs isn't read for production. Both are under Oracle's thumb, and the other options are patches against the main kernel source that you have to apply and build yourself. --Rich P. ___ Discuss mailing list Discuss@blu.org http://lists.blu.org/mailman/listinfo/discuss
Re: [Discuss] lvm snapshot cloning
On Sun, Oct 23, 2011 at 8:31 PM, wrote: >> On Oct 23, 2011, at 7:08 PM, ma...@mohawksoft.com wrote: >>> > > The real solution, and I have code to do it, is this: > > create a snapshot of device A, call it foo. > (arbitrary length of time passes) > create a second snapshot of A, call it bar. If I understand what you are asking for at the end of this sequence, you would have two block level snapshots of device A which would have logically the same contents and the underlying implementation would have essentially the same physical contents (and space requirements) as if you had done both of the snapshots with no time delay between them. Given that devices can have more then one active snapshot at a time, it would seem theoretically possible to implement such a feature. I would call this "cloning a snapshot". Another possibility that might meet your requirements would be the ability to do a "snapshot of a snapshot". Unfortunately, I've never seen any suggestion that LVM can do either of these operations. What you intend to do with those snapshots. I'm guessing it would involve using them in read/write mode. Perhaps some kind of multi-branch tree of versions of a device? In any case, I was curious and found the following web pages which seem relevant: http://sourceware.org/lvm2/wiki/FeatureRequests/dm/snapshots (feature request page from 2009) which says that neither snapshot cloning nor taking a snapshot of a snapshot is currently possible as well as: http://www.redhat.com/archives/linux-lvm/2008-November/msg0.html which gives several hacky ways to do snapshot cloning. I would suggest more investigation/careful testing before depending on this though. Good Luck, Bill Bogstad ___ Discuss mailing list Discuss@blu.org http://lists.blu.org/mailman/listinfo/discuss
Re: [Discuss] lvm snapshot cloning
> On Sun, Oct 23, 2011 at 8:31 PM, wrote: >>> On Oct 23, 2011, at 7:08 PM, ma...@mohawksoft.com wrote: > >> >> The real solution, and I have code to do it, is this: >> >> create a snapshot of device A, call it foo. >> (arbitrary length of time passes) >> create a second snapshot of A, call it bar. > > If I understand what you are asking for at the end of this sequence, > you would have two block level snapshots of device A which would have > logically the same contents and the underlying implementation would > have essentially the same physical contents (and space requirements) > as if you had done both of the snapshots with no time delay between > them. Given that devices can have more then one active snapshot at a > time, it would seem theoretically possible to implement such a > feature. I would call this "cloning a snapshot". Another > possibility that might meet your requirements would be the ability to > do a "snapshot of a snapshot". Unfortunately, I've never seen any > suggestion that LVM can do either of these operations. > > What you intend to do with those snapshots. I'm guessing it would > involve using them in read/write mode. Perhaps some kind of > multi-branch tree of versions of a device? > > In any case, I was curious and found the following web pages which > seem relevant: > > http://sourceware.org/lvm2/wiki/FeatureRequests/dm/snapshots (feature > request page from 2009) > > which says that neither snapshot cloning nor taking a snapshot of a > snapshot is currently possible as well as: > > http://www.redhat.com/archives/linux-lvm/2008-November/msg0.html > > which gives several hacky ways to do snapshot cloning. I would > suggest more investigation/careful testing before depending on this > though. I wrote a program to copy the exception table from one COW device to snapshot device. At the end, both snapshots, in lvdisplay, use the same amount of space and get the same md5sum. Which is cool. My only question is, it were this easy, why am I seeing people lament about the lack of this feature? Obviously, I need to do some testing, but it looks viable. > > Good Luck, > Bill Bogstad > ___ Discuss mailing list Discuss@blu.org http://lists.blu.org/mailman/listinfo/discuss
Re: [Discuss] lvm snapshot cloning
> On Oct 23, 2011, at 8:31 PM, ma...@mohawksoft.com wrote: >> Both these snapshots are now "tracking" the real volume. Now, you use >> the >> COW data from the first snapshot and apply it to the second snapshot. >> This >> will create two identical snapshots. Both sparsely populated with only >> the >> differences between themselves and the real volume. The only drawback is >> that the two snapshots will contain duplicate COW data. > > > The "only" drawback is that you're treating LVM as if it were a file > system. Recall what I wrote yesterday about LVM biting off your face. > That's what's happening: LVM is gnawing on your nose right now. Why are you saying this? LVM is designed for much of this. The strategy described is perfectly valid within the context of read/write snapshots. > > An LVM volume is a dynamic disk partition, and LVM snapshots are > transaction logs of the low level block changes. Not really true, LVM snapshots are a copy-on-write devices. Yes, there is an exception log behind them to map the COW data. > These logs are presented > to the OS as more block devices. If you want to clone a snapshot then you > have to treat it as the block device that it is. Why? Snapshots are read/write and work fine. > What you really want is > file system snapshots or a versioning file system. LVM can accomplish > neither because it's just a block device driver. Its a bit more than a block driver. Yes, it is presented to the higher levels of the OS as a block driver, but it does remapping of blocks, it has modules for raid, etc. > > Unfortunately, neither of these exist for any of the main line file > systems available in the Linux kernel. ZFS isn't happening and Btrfs > isn't read for production. Both are under Oracle's thumb, and the other > options are patches against the main kernel source that you have to apply > and build yourself. Which is why I'm exploring LVM. > > --Rich P. > > ___ > Discuss mailing list > Discuss@blu.org > http://lists.blu.org/mailman/listinfo/discuss > ___ Discuss mailing list Discuss@blu.org http://lists.blu.org/mailman/listinfo/discuss
Re: [Discuss] lvm snapshot cloning
> On Sun, Oct 23, 2011 at 8:31 PM, wrote: >>> On Oct 23, 2011, at 7:08 PM, ma...@mohawksoft.com wrote: > [snip] > In any case, I was curious and found the following web pages which > seem relevant: > > http://sourceware.org/lvm2/wiki/FeatureRequests/dm/snapshots (feature > request page from 2009) > I saw this paper, which started me thinking. > which says that neither snapshot cloning nor taking a snapshot of a > snapshot is currently possible as well as: > > http://www.redhat.com/archives/linux-lvm/2008-November/msg0.html > I saw this post and there were so many issues with it, I wouldn't even call it wrong. :-) > which gives several hacky ways to do snapshot cloning. I would > suggest more investigation/careful testing before depending on this > though. Do you use LVM? Want to test some utilities I am writing? ___ Discuss mailing list Discuss@blu.org http://lists.blu.org/mailman/listinfo/discuss
Re: [Discuss] lvm snapshot cloning
On Oct 24, 2011, at 7:41 AM, ma...@mohawksoft.com wrote: > > Why are you saying this? LVM is designed for much of this. The strategy > described is perfectly valid within the context of read/write snapshots. That's where I say that you are mistaken, because it isn't, not really. It does a fair job of pretending but once you get deep into it you'll find it there grinning, waiting to bite your face off. > Not really true, LVM snapshots are a copy-on-write devices. Yes, there is > an exception log behind them to map the COW data. You're quibbling over implementation. In practice, LVM snapshots work like ever-expanding transaction logs that are presented to the VFS layer as block devices. > Why? Snapshots are read/write and work fine. They don't. See the previous discussion about LVM snapshot performance degradation. > Its a bit more than a block driver. Yes, it is presented to the higher > levels of the OS as a block driver, but it does remapping of blocks, it > has modules for raid, etc. Okay, it's a *fancy* block device driver. Because at the end of the day, when it comes to doing any real work on it, the kernel presents LVM volumes just like it does raw disk partitions which are (pause) block devices. This is a lot of why the features that you want will never be implemented. I'm not saying that this is a failing. In fact, this is central to LVM's versatility. You can put practically any file system on an LVM volume specifically because it is a block device. For example, I could put a FreeBSD domU on a Linux dom0 and give that domU an entire LVM volume to itself. FreeBSD doesn't care because the volume is just a block device. I can stick DRBD underneath that volume and replicate it block-for-block to my redundant dom0 because it is a block device. Linux and Xen don't care because (pause) the volume is just a block device. I could never do something like this with AdvFS, and I expect that doing it with zpools would be more work than lvcreate, add device to config file, boot installer image. This is what I'm on about. If you treat LVM volumes as block devices then you can do some amazing things with them. If you treat them as something else then you're going to have problems. --Rich P. ___ Discuss mailing list Discuss@blu.org http://lists.blu.org/mailman/listinfo/discuss
Re: [Discuss] lvm snapshot cloning
> On Oct 24, 2011, at 7:41 AM, ma...@mohawksoft.com wrote: >> >> Why are you saying this? LVM is designed for much of this. The strategy >> described is perfectly valid within the context of read/write snapshots. > > That's where I say that you are mistaken, because it isn't, not really. > It does a fair job of pretending but once you get deep into it you'll find > it there grinning, waiting to bite your face off. Well, I've spent a lot of my limited spare time the last few weeks really getting down low. I think you are mistaken. There are two issues with LVM, well actually one, and one with snapshots. With LVM, it does not transparently detect and reroute around bad blocks. This would be a cool feature, but since RAID is supported, it is an issue that can be worked around. With snapshots, its a bad design in that all snapshots receive all COW blocks. So if you have two snapshots with the same basic content, there will be duplicative data. This means that you have to allocate enough space for the changes N times depending on the number of snapshots. Other systems, chain snapshots so that at any time, only one snapshot is receiving COW data, and the subsequent snapshots only forward reference. Anyway, LVM snapshots are workable for the most part. > > >> Not really true, LVM snapshots are a copy-on-write devices. Yes, there >> is >> an exception log behind them to map the COW data. > > You're quibbling over implementation. In practice, LVM snapshots work > like ever-expanding transaction logs that are presented to the VFS layer > as block devices. Yes. > > >> Why? Snapshots are read/write and work fine. > > They don't. See the previous discussion about LVM snapshot performance > degradation. The performance degradation is transient. It is only incurred at the time of COW, then it is mostly trivial. The problem with many of the benchmarks is that they basically measure the worst case COW performance which is known as an issue. While not guaranteed, most applications do not create a large number of COW pages all at once. There is a momentary hit periodically, but performance decrease, on average, is minimal. > > >> Its a bit more than a block driver. Yes, it is presented to the higher >> levels of the OS as a block driver, but it does remapping of blocks, it >> has modules for raid, etc. > > Okay, it's a *fancy* block device driver. Because at the end of the day, > when it comes to doing any real work on it, the kernel presents LVM > volumes just like it does raw disk partitions which are (pause) block > devices. This is a lot of why the features that you want will never be > implemented. I don't understand this viewpoint. Block devices are far far easier to implement the sort of features needed. A block device has fewer constraints and has more ordered access than would a higher level file system. > > I'm not saying that this is a failing. In fact, this is central to LVM's > versatility. You can put practically any file system on an LVM volume > specifically because it is a block device. For example, I could put a > FreeBSD domU on a Linux dom0 and give that domU an entire LVM volume to > itself. FreeBSD doesn't care because the volume is just a block device. > I can stick DRBD underneath that volume and replicate it block-for-block > to my redundant dom0 because it is a block device. Linux and Xen don't > care because (pause) the volume is just a block device. I could never do > something like this with AdvFS, and I expect that doing it with zpools > would be more work than lvcreate, add device to config file, boot > installer image. > > This is what I'm on about. If you treat LVM volumes as block devices then > you can do some amazing things with them. If you treat them as something > else then you're going to have problems. They are being treated as a block device, but the capabilities as advertised can be used. > > --Rich P. > > ___ > Discuss mailing list > Discuss@blu.org > http://lists.blu.org/mailman/listinfo/discuss > ___ Discuss mailing list Discuss@blu.org http://lists.blu.org/mailman/listinfo/discuss
Re: [Discuss] lvm snapshot cloning
> On Oct 24, 2011, at 7:41 AM, ma...@mohawksoft.com wrote: >> Why? Snapshots are read/write and work fine. > > They don't. See the previous discussion about LVM snapshot performance > degradation. As was said, the performance degradation is transient, it only happens when a COW page is created, after that, it is minimal. ---Create a 100G volume--- root@dewy:~# lvcreate -L100G -i 2 -n test lvmvolumes Using default stripesize 64.00 KiB Logical volume "test" created root@dewy:~# sync --- Write to the disk root@dewy:~# time dd if=/dev/zero of=/dev/lvmvolumes/test bs=1M count=1000 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 6.0213 s, 174 MB/s real0m6.023s user0m0.000s sys 0m1.130s Good performance, 160 MB/s -- Create a snapshot root@dewy:~# lvcreate -s -n testsnap -L2G /dev/lvmvolumes/test Logical volume "testsnap" created --- Write to the disk --- root@dewy:~# time dd if=/dev/zero of=/dev/lvmvolumes/test bs=1M count=1000 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 63.3635 s, 16.5 MB/s real1m3.365s user0m0.000s sys 0m1.270s Worse performance, 15 MB/s --- Write to the disk --- root@dewy:~# time dd if=/dev/zero of=/dev/lvmvolumes/test bs=1M count=1000 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 5.91994 s, 177 MB/s real0m5.922s user0m0.000s sys 0m1.180s Good performance, 160 MB/s Once the COW data is allocated, the performance is fine. ___ Discuss mailing list Discuss@blu.org http://lists.blu.org/mailman/listinfo/discuss
Re: [Discuss] lvm snapshot cloning
On Oct 24, 2011, at 8:41 PM, ma...@mohawksoft.com wrote: > > With snapshots, its a bad design in that all snapshots receive all COW > blocks. So if you have two snapshots with the same basic content, there > will be duplicative data. This means that you have to allocate enough > space for the changes N times depending on the number of snapshots. This appears to be a bad design to you because you are thinking in terms of high-level file I/O. And yes, I agree: it is very inefficient from the perspective of the high-level file system. So don't do that. Treat LVM volumes as simple block devices like I described. If you won't do that then I can't help you. --Rich P. ___ Discuss mailing list Discuss@blu.org http://lists.blu.org/mailman/listinfo/discuss
Re: [Discuss] lvm snapshot cloning
> On Oct 24, 2011, at 8:41 PM, ma...@mohawksoft.com wrote: >> >> With snapshots, its a bad design in that all snapshots receive all COW >> blocks. So if you have two snapshots with the same basic content, there >> will be duplicative data. This means that you have to allocate enough >> space for the changes N times depending on the number of snapshots. > > This appears to be a bad design to you because you are thinking in terms > of high-level file I/O. And yes, I agree: it is very inefficient from the > perspective of the high-level file system. So don't do that. Treat LVM > volumes as simple block devices like I described. I can't understand your perspective here. We have various RAID drivers, we have linear drivers, sparse volumes, encrypted volumes, and so on. All implemented at the block device level. How are snapshots any more or less complex or problematic than a RAID5 or encrypted block device? I've posted proof that any performance degradation on LVM volumes with a snapshot is only transient. You'll mostly never notice it in the real world because its very unusual for large areas to undergo change in a short period of time. Obviously, this is not an absolute, but it is generally true. For instance, writing a whole 1TB volume at 160MB/s (best case in October 2011, SATA/SAS) will take about 2 hours. Not that this would not happen in some rare circumstance, but that's not typical. I have not been able to find a good alternative. ZFS is stalled with license issues and Btrfs is not stable. Both are under Oracle. LVM2 works now and with a little finesse seems to do what is needed. To address your concerns about it being a block device, I see this as making the system more capable. You can put any file system you want on it. Isn't it better to improve what is stable and working rather than wait for Oracle? ___ Discuss mailing list Discuss@blu.org http://lists.blu.org/mailman/listinfo/discuss
Re: [Discuss] lvm snapshot cloning
> From: discuss-bounces+blu=nedharvey@blu.org [mailto:discuss- > bounces+blu=nedharvey@blu.org] On Behalf Of > > Other systems, chain snapshots so that at any time, only one snapshot is > receiving COW data, and the subsequent snapshots only forward reference. Also, other systems don't make you pre-allocate disk space for snapshot usage, and other systems don't automatically destroy your snapshot if you run out of pre-allocated space. LVM snapshot is crap compared to other systems - but if you want to make a single snapshot so you can dump or tar your ext3/4 filesystem in a consistent state or something, then it's good for that purpose. And not much else. ___ Discuss mailing list Discuss@blu.org http://lists.blu.org/mailman/listinfo/discuss
Re: [Discuss] lvm snapshot cloning
> From: discuss-bounces+blu=nedharvey@blu.org [mailto:discuss- > bounces+blu=nedharvey@blu.org] On Behalf Of > > I have not been able to find a good alternative. ZFS is stalled with > license issues and Btrfs is not stable. What application are you intending to use it for? Does it need to be linux? In many situations, even if you're using linux, it might be ok to run either user-mode ZFS or ZFS on a separate system connected via NFS or ISCSI or whatever. Saying btrfs is unstable ... It's a matter of perspective, and I'll disagree. I would say it's lacking some important features still, but would not call it unstable. ___ Discuss mailing list Discuss@blu.org http://lists.blu.org/mailman/listinfo/discuss
Re: [Discuss] lvm snapshot cloning
>> From: discuss-bounces+blu=nedharvey@blu.org [mailto:discuss- >> bounces+blu=nedharvey@blu.org] On Behalf Of >> >> Other systems, chain snapshots so that at any time, only one snapshot is >> receiving COW data, and the subsequent snapshots only forward reference. > > Also, other systems don't make you pre-allocate disk space for snapshot > usage, and other systems don't automatically destroy your snapshot if you > run out of pre-allocated space. LVM snapshot is crap compared to other > systems - but if you want to make a single snapshot so you can dump or tar > your ext3/4 filesystem in a consistent state or something, then it's good > for that purpose. And not much else. As I have said before, that, most of the time, copying the WHOLE volume is not a practical option. 1TB at a very very fast rate of 160MB/s, copying a 1TB volume takes HOURS! As for the pre-allocate space issue, yes, that is bad, but using alerts and monitoring to increase allocated space will prevent the snapshot from being nuked. I'm not implying the LVM2 is great, but I do assert it is quite functional. > > > ___ Discuss mailing list Discuss@blu.org http://lists.blu.org/mailman/listinfo/discuss
Re: [Discuss] lvm snapshot cloning
>> From: discuss-bounces+blu=nedharvey@blu.org [mailto:discuss- >> bounces+blu=nedharvey@blu.org] On Behalf Of >> >> I have not been able to find a good alternative. ZFS is stalled with >> license issues and Btrfs is not stable. > > What application are you intending to use it for? Does it need to be > linux? Well, not technically, but from a cost/benefit sort of analysis, I don't know of a better platform. > In many situations, even if you're using linux, it might be ok to run > either > user-mode ZFS or ZFS on a separate system connected via NFS or ISCSI or > whatever. Still concerned over support and Oracle. When that is ironed out, I'll be able to reconsider it. > > Saying btrfs is unstable ... It's a matter of perspective, and I'll > disagree. I would say it's lacking some important features still, but > would > not call it unstable. Please note, I did not call it "unstable," I said it was not "stable." The difference is subtle, but important. The Btrfs branch is not considered "stable" yet. > > ___ Discuss mailing list Discuss@blu.org http://lists.blu.org/mailman/listinfo/discuss
Re: [Discuss] lvm snapshot cloning
On Oct 24, 2011, at 10:42 PM, ma...@mohawksoft.com wrote: > > I can't understand your perspective here. We have various RAID drivers, we > have linear drivers, sparse volumes, encrypted volumes, and so on. All > implemented at the block device level. Then let me paraphrase it: LVM is a logical partition manager. > How are snapshots any more or less complex or problematic than a RAID5 or > encrypted block device? Practical example: create your "master" volume (partition) with 1TB. Create a snapshot of the master, call it "a" and give it 100GB. Create another snapshot of the master, call it "b" and give it 1GB. The snapshots are created such that a does COW against master and b is mostly pointers to volume a's blocks so that you don't have the duplicated blocks. Now, delete volume a. Or copy 100GB+1byte to volume a. This will trigger LVM's reaper which prunes the snapshot to ensure that there is no data loss on master. --Rich P. ___ Discuss mailing list Discuss@blu.org http://lists.blu.org/mailman/listinfo/discuss
Re: [Discuss] lvm snapshot cloning
> On Oct 24, 2011, at 10:42 PM, ma...@mohawksoft.com wrote: >> >> I can't understand your perspective here. We have various RAID drivers, >> we >> have linear drivers, sparse volumes, encrypted volumes, and so on. All >> implemented at the block device level. > > Then let me paraphrase it: LVM is a logical partition manager. No, its a "Logical Volume Manager." Come on, lets be constructive. LVM snapshots are, in essence, sparsely allocated block devices with a "master" device acting as the background fill. And Yes, there is COW when blocks get written in the master or snapshot, but that's about it. > > >> How are snapshots any more or less complex or problematic than a RAID5 >> or >> encrypted block device? > > Practical example: create your "master" volume (partition) with 1TB. > Create a snapshot of the master, call it "a" and give it 100GB. Create > another snapshot of the master, call it "b" and give it 1GB. The > snapshots are created such that a does COW against master and b is mostly > pointers to volume a's blocks so that you don't have the duplicated > blocks. I wish this were true, but from what I've been able to learn is that two different snapshots will each have their own copies of the COW data. On other system, yes, master<-b<-a Where 'a' is older than 'b' On lvm its more like this: master ^ ^ a b Where they equally point to the master. Each snapshot stores its data in its own cow image. I don't see any provision for multiple snapshots to share cow data. > > Now, delete volume a. Or copy 100GB+1byte to volume a. This will trigger > LVM's reaper which prunes the snapshot to ensure that there is no data > loss on master. The snapshot has no effect on the master, and yes, we've already said and we already know it is a weakness in LVM that if you don't extend your snapshots you lose them. This can be mitigated by monitoring and automatic volume extension. BTW: Thanks, I have been looking at LVM and these discussions have forced me to really look closely at it. I'm still not decided, but if I decide it is the way we want to go, I will owe a lot of the debate to practice on BLU! ___ Discuss mailing list Discuss@blu.org http://lists.blu.org/mailman/listinfo/discuss
Re: [Discuss] lvm snapshot cloning
I should discuss more what I want to do with LVM and maybe it makes more sense. First, the statement of problem. Hard disks are big and fundamentally slow. Suppose you have a 1TB hard disk. How on earth do you back that up? If you use a 1TB USB hard disk (USB 2.0) At best case, you'll get about 30MB/s. (You won't get that fast, but it is a good round number for discussion) 1TB is 1,099,511,627,776 bytes. At 30MB/s that's a little over 9 1/2 hours to backup. If you use a dedicated high speed SATA or SAS drive, you may get a whopping 160MB/s. That's 1 hour 49 minutes. Think now if you want to pipe that up to the cloud for off-site backup. With snapshots, you can back up a consistent state, as if your 2 hour backup happened instantaneously, but you are still writing a lot of data. However, there is a better optimization here. The snapshot device "knows" what is different. You don't have to really backup 1TB every time, you only have to backup the changes to it since the last full backup. I can't give too much away, but I think you get my drift. If the data only changes 5% a day, and you can track that 5% a day. You can make an effective backup of 1TB of data in 29 minutes to the USB drive. Think about it. LVM is more than capable of doing this. ___ Discuss mailing list Discuss@blu.org http://lists.blu.org/mailman/listinfo/discuss
Re: [Discuss] lvm snapshot cloning
> From: ma...@mohawksoft.com [mailto:ma...@mohawksoft.com] > > >> From: discuss-bounces+blu=nedharvey@blu.org [mailto:discuss- > >> bounces+blu=nedharvey@blu.org] On Behalf Of > >> > >> Other systems, chain snapshots so that at any time, only one snapshot is > >> receiving COW data, and the subsequent snapshots only forward > reference. > > > > Also, other systems don't make you pre-allocate disk space for snapshot > > usage, and other systems don't automatically destroy your snapshot if you > > run out of pre-allocated space. LVM snapshot is crap compared to other > > systems - but if you want to make a single snapshot so you can dump or tar > > your ext3/4 filesystem in a consistent state or something, then it's good > > for that purpose. And not much else. > > As I have said before, that, most of the time, copying the WHOLE volume is > not a practical option. I don't know why you would mention that. How does it relate to anything that was said above? > > What application are you intending to use it for? Does it need to be > > linux? > > Well, not technically, but from a cost/benefit sort of analysis, I don't > know of a better platform. You could use openindiana, or nexenta, or freebsd. These are stable free platforms (and commercial) that do zfs. > > In many situations, even if you're using linux, it might be ok to run > > either > > user-mode ZFS or ZFS on a separate system connected via NFS or ISCSI or > > whatever. > > Still concerned over support and Oracle. When that is ironed out, I'll be > able to reconsider it. What specifically is the concern? Cost of Oracle support? Quality of support? (Compared to redhat support, which is ... on par with oracle support)... What specifically is the concern? ___ Discuss mailing list Discuss@blu.org http://lists.blu.org/mailman/listinfo/discuss
Re: [Discuss] lvm snapshot cloning
> From: discuss-bounces+blu=nedharvey@blu.org [mailto:discuss- > bounces+blu=nedharvey@blu.org] On Behalf Of > > Suppose you have a 1TB hard disk. How on earth do you back that up? If you > use a 1TB USB hard disk (USB 2.0) At best case, you'll get about 30MB/s. > (You won't get that fast, but it is a good round number for discussion) Yes, what you need is something that does incremental snapshots. Presently zfs, soon btrfs, and I heard you say it can be done with LVM although I've never seen that done. Also, I think there's probably a solution based on VSS. And of course WAFL and other enterprise products. Additionally, there are lots of "potential" solutions, for example, use FUSE to mount a filesystem, and because it's running through fuse, you'll be able to keep track of which blocks change and then you can perform optimal incrementals etc. But I'm not aware of anything that is actually implemented this way. > 1TB is 1,099,511,627,776 bytes. At 30MB/s that's a little over 9 1/2 hours > to backup. > > If you use a dedicated high speed SATA or SAS drive, you may get a > whopping 160MB/s. That's 1 hour 49 minutes. Take it for granted, you won't be stuck at USB2 speeds. USB has universally (to the extent I care about) been replaced by USB3, since about 2-3 years ago. The bottleneck is the speed of the external disk, which I typically measure at 1Gbit/sec per disk. (You can sometimes parallelize using multiple disks depending on your circumstances.) It's random IO that really hurts you. > With snapshots, you can back up a consistent state, as if your 2 hour > backup happened instantaneously, but you are still writing a lot of data. Right. Presently at work we have a bunch of ZFS and Netapp systems, that replicate globally every hour, and use various forms of WAN acceleration, compression and dedup. Although it is optimal, it is very significant, noticeable strain on the WAN. ___ Discuss mailing list Discuss@blu.org http://lists.blu.org/mailman/listinfo/discuss
Re: [Discuss] lvm snapshot cloning
>> From: discuss-bounces+blu=nedharvey@blu.org [mailto:discuss- >> bounces+blu=nedharvey@blu.org] On Behalf Of >> If you use a dedicated high speed SATA or SAS drive, you may get a >> whopping 160MB/s. That's 1 hour 49 minutes. > > Take it for granted, you won't be stuck at USB2 speeds. USB has > universally > (to the extent I care about) been replaced by USB3, since about 2-3 years > ago. The bottleneck is the speed of the external disk, which I typically > measure at 1Gbit/sec per disk. (You can sometimes parallelize using > multiple disks depending on your circumstances.) It's random IO that > really > hurts you. If you can get more than 160MB/s (sustained) on anything other than exotic hardware, I'd be surprised. 1Gbit/sec per disk sustained is currently not possible with COTS hardware that is available. Transfer rate is not "sustained," and "peak" is not "sustained." Yes, if can can manage to read/write to disk cache, you can get cool performance, but if you are doing backups, you will blow out cache quite quickly. ___ Discuss mailing list Discuss@blu.org http://lists.blu.org/mailman/listinfo/discuss
Re: [Discuss] lvm snapshot cloning
On Oct 25, 2011, at 7:51 PM, ma...@mohawksoft.com wrote: > > The snapshot has no effect on the master, and yes, we've already said and > we already know it is a weakness in LVM that if you don't extend your > snapshots you lose them. This can be mitigated by monitoring and automatic > volume extension. You missed it. This isn't about what happens to master. It's what happens to b when a disappears. If master<-a<-b and a disappears due to reaping then b becomes useless. Or b is reaped, too. Either way you're dealing with data loss. This is why LVM will not do what you originally asked about. Monitoring has problems. If the volume fills up faster than the monitor polls capacity then you lose your data. If the volume fills up faster than it can be extended then you lose your data. If the volume cannot be extended because the volume group has no more extents available then you lose your data. Like I wrote at the start: LVM will quite happily bite your face off. Now, to address your most recent question: How do I back up a 1TB disk. Think about this: how do you intend to do a restore from this backup? The most important part of a backup system is being able to restore from backup in a timely fashion. I have in production a compute server with two 8TB file systems and a 9TB file system, all sitting on LVM volumes. I have an automated backup that runs every night on this server. It's an incremental file system backup so I'm only backing up the changes every night. This is, as you might expect, quite faster than trying to do full backups of 25TB every night -- which I can't because it would take three days to do it. On smaller capacity volumes, in the several hundred GB range, I use rsnapshot to do incremental file snapshots to a storage server. Again, I don't back up the raw disk partitions every time. I only back up the changed files. In both cases -- and in fact with all my backups -- they are file level backups. The reason being that if I need to restore a single file or directory then I don't have to rebuild the entire volume to do so. I can restore as little or as much as I need to recover from a mistake or a disaster. Suppose the case of a live volume that needs to be in a frozen state for doing a backup. Database servers are prime examples of this. Here, I would freeze the database, make a snapshot of the underlying volume, and then thaw the database. Now I can do my backup of the read-only snapshot volume without interfering with the running system. I would delete the snapshot when the backup is complete. If I were using plain LVM and ext3 for my users' home directories then I would do something similar with read-only snapshots. There would be no freeze step, and I would keep several days worth of snapshots on the file server to make error recovery faster than going to tape or network storage. As it is, I use OpenAFS which has file system snapshots so I don't need to do any of this and users can go back in time just by looking in .clone in their home directories. I still have nightly backups to tape for long-term archives. Now, time to poke holes in your proposal. I have a physics graduate student doing his thesis research project on a shared compute server along with a dozen others. They collectively have 7.5TB of data on there. This is a real-world case on the aforementioned compute server. Said student accidentally wipes out his entire thesis project, 200GB worth of files. It's 9:30 PM and he needs his files by 8am or he fails his thesis defense, doesn't graduate and I'm looking for a new job. With my file level backup system I can have his files restored within a couple of hours at the outside without affecting anyone else's work. With your volume level backup system I would spend the night on Monster looking for a new job. The problem with it is that I can't restore individual files because it isn't individual files that are backed up. It's the disk blocks. I can't just drop those backed-up blocks onto the volume. Here: master->changes->changes->changes \->backup If I dumped the backup blocks onto the volume then I'd scramble the file system. Restoration would require me to replicate the entire volume at the block level as it was when the backup was made. This would destroy all the other researchers' work done in the past however many hours since that backup was made. I would fire myself for gross incompetence if I were relying on this kind of backup system. It's that bad. It gets worse. What happens when the whole thing fails outright? Total disaster on your 1TB disk. Now it's not just 29 minutes to restore last night's blocks. It's two hours to restore the initial replica and then 30 minutes times however many deltas have been made. Six deltas means 5 hours to do a full rebuild. I can do a complete restore from TSM or rsnapshot in half that time, maybe less depending on how much d
Re: [Discuss] lvm snapshot cloning
> On Oct 25, 2011, at 7:51 PM, ma...@mohawksoft.com wrote: >> >> The snapshot has no effect on the master, and yes, we've already said >> and >> we already know it is a weakness in LVM that if you don't extend your >> snapshots you lose them. This can be mitigated by monitoring and >> automatic >> volume extension. > > You missed it. This isn't about what happens to master. It's what > happens to b when a disappears. If master<-a<-b and a disappears due to > reaping then b becomes useless. Or b is reaped, too. Either way you're > dealing with data loss. This is why LVM will not do what you originally > asked about. Actually, in LVM 'a' and 'b' are completely independent, each have their own copy of the COW data. So, if 'a' gets nuked, 'b' is fine, and vice virca. If you know the behaviour of your system, you could allocate a large percentage of the existing volume size (even 100%) and mitigate any risk. You would get your snapshot quickly and still have full backing. > > Monitoring has problems. If the volume fills up faster than the monitor > polls capacity then you lose your data. If the volume fills up faster > than it can be extended then you lose your data. If the volume cannot be > extended because the volume group has no more extents available then you > lose your data. Like I wrote at the start: LVM will quite happily bite > your face off. Allocate more space at the beginning. It will never be bugger than the original volume. > > Now, to address your most recent question: > > How do I back up a 1TB disk. Think about this: how do you intend to do a > restore from this backup? The most important part of a backup system is > being able to restore from backup in a timely fashion. That is all dependent on whether or not we are recovering from a "disaster" i.e. main data loss, or for access to a previous incarnation within a working environment. In a "working environment" it is possible to use LVM snapshots, backed with enough storage, of course, to receive the deltas from where it is to where it should be. In a disaster, well, brute force will save the day. > > I have in production a compute server with two 8TB file systems and a 9TB > file system, all sitting on LVM volumes. I have an automated backup that > runs every night on this server. It's an incremental file system backup > so I'm only backing up the changes every night. This is, as you might > expect, quite faster than trying to do full backups of 25TB every night -- > which I can't because it would take three days to do it. This works when you have access to the file system. If you expose an LVM volume with something like iSCSI, you best hope that Linux has a file system driver for it. Even so, you will be backing up full files. In a system with large files, like a database server, you'll be backing up a lot of data that doesn't need to be backed up. > > On smaller capacity volumes, in the several hundred GB range, I use > rsnapshot to do incremental file snapshots to a storage server. Again, I > don't back up the raw disk partitions every time. I only back up the > changed files. Most file formats don't change too much. This is, of course a generalization, with a number of exceptions. I refer back to the database example. > > In both cases -- and in fact with all my backups -- they are file level > backups. The reason being that if I need to restore a single file or > directory then I don't have to rebuild the entire volume to do so. I can > restore as little or as much as I need to recover from a mistake or a > disaster. I can understand this, however, snapshots work too. One of the utilities I am writing allows you to clone a snapshot. Assuming you always have one snapshot that is a point in time reference. You can clone that snapshot and then apply diffs to get back to a specific point in time. While it is a judgement call, it is not clear which is more efficient. In the case of a database server, it may take much longer to restore a full file than it does to apply diffs. > > Suppose the case of a live volume that needs to be in a frozen state for > doing a backup. Database servers are prime examples of this. Here, I > would freeze the database, make a snapshot of the underlying volume, and > then thaw the database. Now I can do my backup of the read-only snapshot > volume without interfering with the running system. I would delete the > snapshot when the backup is complete. That is the most direct method of using snapshots, yes. > > If I were using plain LVM and ext3 for my users' home directories then I > would do something similar with read-only snapshots. There would be no > freeze step, and I would keep several days worth of snapshots on the file > server to make error recovery faster than going to tape or network > storage. As it is, I use OpenAFS which has file system snapshots so I > don't need to do any of this and users can go back in time just by looking > in .clone in their home dire
Re: [Discuss] lvm snapshot cloning
On Tue, Oct 25, 2011 at 11:20 PM, wrote: > Obviously, there are pros and cons to various approaches. You approach is > only faster if you know that the file in questions was recently modified. > If that is the case, you lucked out. What happens if the source of the > calculations, which is also needed, hasn't been modified in some time? Are you serious? If the source of the calculations hasn't been modified in some time, then what you do is restore it from the backups. Same as if it was modified recently. The older files weren't backed up recently because they hadn't changed. They didn't get erased from the backups, so they're still available for restoration, just like the recently changed files. Did you think that all backups older then one day were somehow erased? -- John Abreau / Executive Director, Boston Linux & Unix OLD GnuPG KeyID: D5C7B5D9 / Email: abre...@gmail.com OLD GnuPG FP: 72 FB 39 4F 3C 3B D6 5B E0 C8 5A 6E F1 2C BE 99 2011 GnuPG KeyID: 32A492D8 / Email: abre...@gmail.com 2011 GnuPG FP: ___ Discuss mailing list Discuss@blu.org http://lists.blu.org/mailman/listinfo/discuss
Re: [Discuss] lvm snapshot cloning
> On Tue, Oct 25, 2011 at 11:20 PM, wrote: > >> Obviously, there are pros and cons to various approaches. You approach >> is >> only faster if you know that the file in questions was recently >> modified. >> If that is the case, you lucked out. What happens if the source of the >> calculations, which is also needed, hasn't been modified in some time? > > > Are you serious? If the source of the calculations hasn't been modified > in some time, then what you do is restore it from the backups. Same as > if it was modified recently. I was thinking that it, if you are doing an incremental backup, it may not be present on the current backup medium, and then you'd have to search what ever catalogue system you have to find where it is. > > The older files weren't backed up recently because they hadn't changed. > They didn't get erased from the backups, so they're still available for > restoration, just like the recently changed files. > > Did you think that all backups older then one day were somehow erased? Typically, you ship some backups off-site to ensure recovery. > > > > > -- > John Abreau / Executive Director, Boston Linux & Unix > OLD GnuPG KeyID: D5C7B5D9 / Email: abre...@gmail.com > OLD GnuPG FP: 72 FB 39 4F 3C 3B D6 5B E0 C8 5A 6E F1 2C BE 99 > 2011 GnuPG KeyID: 32A492D8 / Email: abre...@gmail.com > 2011 GnuPG FP: > ___ Discuss mailing list Discuss@blu.org http://lists.blu.org/mailman/listinfo/discuss
Re: [Discuss] lvm snapshot cloning
On Oct 25, 2011, at 11:20 PM, ma...@mohawksoft.com wrote: > > Actually, in LVM 'a' and 'b' are completely independent, each have their > own copy of the COW data. So, if 'a' gets nuked, 'b' is fine, and vice > virca. Rather, this is how LVM works *because* of this situation. If LVM supported snapshots of snapshots then it'd be trivially easy to shoot yourself in the foot. > If you know the behaviour of your system, you could allocate a large > percentage of the existing volume size (even 100%) and mitigate any risk. > You would get your snapshot quickly and still have full backing. So for your hypothetical 1TB disk, let's assume that you actually have 1TB of data on it. You would need two more 1TB disks for each of the two snapshots. This would be unscalable to my 25TB compute server. I would need another 25TB+ to implement your scheme. This is a case where I can agree that yes, it is possible to coerce LVM into doing it but that doesn't make it useful. > In a disaster, well, brute force will save the day. My systems work for both individual files and total disaster. I've proven it. >> don't need to do any of this and users can go back in time just by looking >> in .clone in their home directories. I still have nightly backups to tape >> for long-term archives. > > Seems complicated. It isn't. It's a single AFS command to do the nightly snapshot and a second to run the nightly backup against that snapshot. > totally wrong!!! > > lvcreate -s -n disaster -L1024G /dev/vg0/phddata > (my utility) > lvclonesnapshot /dev/mapper/vg0-phdprev-cow /dev/vg0/disaster > > This will apply historical changes to the /dev/vg0/disaster, the volume > may then be used to restore data. Wait... wait... so you're saying that in order to restore some files I need to recreate the "disaster" volume, restore to it, and then I can copy files back over to the real volume? > You have a similar issue with file system backups. You have to find the > last time a particular file was backed up. > Yes, and it should be MUCH faster!!! I agree. *Snrk*. Neither of these are true. I don't have to "find" anything. I pick a point in time between the first backup and the most recent, inclusive, and restore whatever I need. Everything is there. I don't even need to look for tapes; TSM does all that for me. 400GB/hour sustained restore throughput is quite good for a network backup system. Remember, ease of restoration is the most important component of a backup system. Yours, apparently, fails to deliver. --Rich P. ___ Discuss mailing list Discuss@blu.org http://lists.blu.org/mailman/listinfo/discuss
Re: [Discuss] lvm snapshot cloning
On Tue, Oct 25, 2011 at 10:16 PM, Richard Pieri wrote: > On Oct 25, 2011, at 7:51 PM, ma...@mohawksoft.com wrote: > > > > The snapshot has no effect on the master, and yes, we've already said and > > we already know it is a weakness in LVM that if you don't extend your > > snapshots you lose them. This can be mitigated by monitoring and > automatic > > volume extension. > > You missed it. This isn't about what happens to master. It's what happens > to b when a disappears. If master<-a<-b and a disappears due to reaping > then b becomes useless. Or b is reaped, too. Either way you're dealing > with data loss. This is why LVM will not do what you originally asked > about. > > Monitoring has problems. If the volume fills up faster than the monitor > polls capacity then you lose your data. If the volume fills up faster than > it can be extended then you lose your data. If the volume cannot be > extended because the volume group has no more extents available then you > lose your data. Like I wrote at the start: LVM will quite happily bite your > face off. > > Now, to address your most recent question: > > How do I back up a 1TB disk. Think about this: how do you intend to do a > restore from this backup? The most important part of a backup system is > being able to restore from backup in a timely fashion. > > I have in production a compute server with two 8TB file systems and a 9TB > file system, all sitting on LVM volumes. I have an automated backup that > runs every night on this server. It's an incremental file system backup so > I'm only backing up the changes every night. This is, as you might expect, > quite faster than trying to do full backups of 25TB every night -- which I > can't because it would take three days to do it. > > On smaller capacity volumes, in the several hundred GB range, I use > rsnapshot to do incremental file snapshots to a storage server. Again, I > don't back up the raw disk partitions every time. I only back up the > changed files. > > In both cases -- and in fact with all my backups -- they are file level > backups. The reason being that if I need to restore a single file or > directory then I don't have to rebuild the entire volume to do so. I can > restore as little or as much as I need to recover from a mistake or a > disaster. > > Suppose the case of a live volume that needs to be in a frozen state for > doing a backup. Database servers are prime examples of this. Here, I would > freeze the database, make a snapshot of the underlying volume, and then thaw > the database. Now I can do my backup of the read-only snapshot volume > without interfering with the running system. I would delete the snapshot > when the backup is complete. > > If I were using plain LVM and ext3 for my users' home directories then I > would do something similar with read-only snapshots. There would be no > freeze step, and I would keep several days worth of snapshots on the file > server to make error recovery faster than going to tape or network storage. > As it is, I use OpenAFS which has file system snapshots so I don't need to > do any of this and users can go back in time just by looking in .clone in > their home directories. I still have nightly backups to tape for long-term > archives. > > Now, time to poke holes in your proposal. I have a physics graduate > student doing his thesis research project on a shared compute server along > with a dozen others. They collectively have 7.5TB of data on there. This > is a real-world case on the aforementioned compute server. Said student > accidentally wipes out his entire thesis project, 200GB worth of files. > It's 9:30 PM and he needs his files by 8am or he fails his thesis defense, > doesn't graduate and I'm looking for a new job. > > With my file level backup system I can have his files restored within a > couple of hours at the outside without affecting anyone else's work. > > With your volume level backup system I would spend the night on Monster > looking for a new job. The problem with it is that I can't restore > individual files because it isn't individual files that are backed up. It's > the disk blocks. I can't just drop those backed-up blocks onto the volume. > Here: > > master->changes->changes->changes > \->backup > > If I dumped the backup blocks onto the volume then I'd scramble the file > system. Restoration would require me to replicate the entire volume at the > block level as it was when the backup was made. This would destroy all the > other researchers' work done in the past however many hours since that > backup was made. I would fire myself for gross incompetence if I were > relying on this kind of backup system. It's that bad. > > It gets worse. What happens when the whole thing fails outright? Total > disaster on your 1TB disk. Now it's not just 29 minutes to restore last > night's blocks. It's two hours to restore the initial replica and then 30 > minutes times however many del
Re: [Discuss] lvm snapshot cloning
> On Oct 25, 2011, at 11:20 PM, ma...@mohawksoft.com wrote: >> >> Actually, in LVM 'a' and 'b' are completely independent, each have their >> own copy of the COW data. So, if 'a' gets nuked, 'b' is fine, and vice >> virca. > > Rather, this is how LVM works *because* of this situation. If LVM > supported snapshots of snapshots then it'd be trivially easy to shoot > yourself in the foot. Actually, I'm currently working on a system that snapshots of snapshots, its not LVM, obviously, but it quietly resolves an interior copy being removed or failing. Its a very "enterprise" system. > >> If you know the behaviour of your system, you could allocate a large >> percentage of the existing volume size (even 100%) and mitigate any >> risk. >> You would get your snapshot quickly and still have full backing. > > So for your hypothetical 1TB disk, let's assume that you actually have 1TB > of data on it. You would need two more 1TB disks for each of the two > snapshots. This would be unscalable to my 25TB compute server. I would > need another 25TB+ to implement your scheme. This is a case where I can > agree that yes, it is possible to coerce LVM into doing it but that > doesn't make it useful. Well, we all know that disks do not change 100% very quickly or at all. Its typically a very small percentage per day, even on active systems. So the process is to backup diffs using analysis of two snapshots. A start point and an end point. Just keep recycling the start point. > > >> In a disaster, well, brute force will save the day. > > My systems work for both individual files and total disaster. I've proven > it. Yes, backups that maintain data integrity work. That's sort of the job. The issue is reducing the amount of data that needs to be moved each time. With a block level backup, you move only the blocks. With a file level backup you move the whole files. Now, if the files are small, a file level backup will make sense. If the files are large, like VMs or databases, a block level backup makes sense. > >>> don't need to do any of this and users can go back in time just by >>> looking >>> in .clone in their home directories. I still have nightly backups to >>> tape >>> for long-term archives. >> >> Seems complicated. > > It isn't. It's a single AFS command to do the nightly snapshot and a > second to run the nightly backup against that snapshot. > > > >> totally wrong!!! >> >> lvcreate -s -n disaster -L1024G /dev/vg0/phddata >> (my utility) >> lvclonesnapshot /dev/mapper/vg0-phdprev-cow /dev/vg0/disaster >> >> This will apply historical changes to the /dev/vg0/disaster, the volume >> may then be used to restore data. > > Wait... wait... so you're saying that in order to restore some files I > need to recreate the "disaster" volume, restore to it, and then I can copy > files back over to the real volume? I can't tell from the snip the whole example, but I think I was saying that I could clone a snapshot, apply historical blocks to it, and then you'd be able to get a specific version of a file from it. Yes. If you are backing up many small files, rsync works well. If you are backing up VMs, databases, or iSCSI targets, a block level strategy works better. > > >> You have a similar issue with file system backups. You have to find the >> last time a particular file was backed up. > >> Yes, and it should be MUCH faster!!! I agree. > > *Snrk*. Neither of these are true. I don't have to "find" anything. I > pick a point in time between the first backup and the most recent, > inclusive, and restore whatever I need. Everything is there. I don't > even need to look for tapes; TSM does all that for me. 400GB/hour > sustained restore throughput is quite good for a network backup system. 400GB/hour? I doubt that number, but ok. It is still close to three hours. > > Remember, ease of restoration is the most important component of a backup > system. Yours, apparently, fails to deliver. Not really, we have been discussing technology. We haven't even been discussing user facing stuff. The difference is what you plan to do, I guess. I'm not backing up many small files. Think of it this way. A 2TB drive is less than $80 and about $0.25 a month in power. The economies open up a number of possibilities. ___ Discuss mailing list Discuss@blu.org http://lists.blu.org/mailman/listinfo/discuss
Re: [Discuss] lvm snapshot cloning
On Tue, Oct 25, 2011 at 11:20:40PM -0400, ma...@mohawksoft.com wrote: > > > > Data point: It takes ~19 hours to restore 7.5TB from enterprise-class > > Tivoli Storage Manager over 1GB Ethernet to a 12x1TB SATA (3Gb/s) RAID 6 > > volume. I had to do it this past spring after the RAID controller on that > > volume went stupid and corrupted the whole thing. > > Yes, and it should be MUCH faster!!! I agree. The bottleneck there is the the gigabit ethernet: 1000Mb/s * 3600s/h * B/8b = 45 MB/h or 450 GB/h. 7500/450 = 16.7 hours So the absolute best you could have done ignoring all overhead was 16.7 hours, and it took 19. Not awful. On the other hand, going to 10GE doesn't move the bottleneck to the disks -- if you can get 160MB/s per spindle and 10 of the 12 spindles effective, that's: 1600MB/s * 3600s/h = 5760GB/h which is still more than the 4500GB/h you might get from theoretical no-overhead 10GE. Assume 10GE with the same 87% efficiency as GE and you get 3.9TB/h, or about two hours to do your 7.5TB restore. OK, so 10GE is a win for time, no surprise. However, GE is essentially free (your motherboards have it, your switching infrastructure is in place) but 10GE involves $600/port upgrades to the NIC (if you have room) and $1000/port switch upgrades. Good thing to have next time around, probably not a routine upgrade for most operations. -dsr- -- http://tao.merseine.nu/~dsr/eula.html is hereby incorporated by reference. You can't fight for freedom by taking away rights. ___ Discuss mailing list Discuss@blu.org http://lists.blu.org/mailman/listinfo/discuss
Re: [Discuss] lvm snapshot cloning
On Wed, Oct 26, 2011 at 8:26 AM, wrote: > 400GB/hour? I doubt that number, but ok. It is still close to three hours. > Doubt if you want but I did it last spring. 7.5TB -- ~7500GB -- restored over 19 hours and change. That's 400GB/hour average throughput. Over a network. From a tape-based storage system. That's what an enterprise-class backup system can do. I see where you are going with this. One of the historically canonical examples is database servers that use raw disk for storage. A block-incremental backup mechanism would have been very useful 15-20 years ago when this was more common. Today, hardly anyone does it this way. Today, standard procedures are to either dump the DB to a flat file and back that up or to perform the freeze/snapshot/thaw cycle and back up the snapshot. This may not be the most efficient way to do it. On the other hand, if I have so much data to back up that efficiency would be an issue then I already have sufficient resources in my backup system to make it not be an issue. You mention virtual machines. I can see this if you want to back up the containers. The thing is, using an operating system's native tools is always my preferred choice for making backups. It may be inefficient at times but it ensures that I can always recover the backup correctly. More generally, it avoids issues with being locked into specific technologies. Going back to the example of running a FreeBSD domU on a Linux dom0. With LVM-based block backups I am locked into using LVM-based block restore for recovery. I can't restore this domU onto a FreeBSD or Solaris dom0 or even onto a real FreeBSD physical machine. On the other hand, if I use OS tools to do the backup then I can restore it anywhere that I please. It is clear to me that we have different philosophies about backup systems. In my mind, efficiency is all well and good but it will always take a back seat to the ease of restoring backups and the ease of creating them. I've had to recover from too many disasters and too many stupid mistakes -- some of them my own -- to see it any other way. ___ Discuss mailing list Discuss@blu.org http://lists.blu.org/mailman/listinfo/discuss
Re: [Discuss] lvm snapshot cloning
On Wed, Oct 26, 2011 at 9:42 AM, Dan Ritter wrote: > So the absolute best you could have done ignoring all overhead was 16.7 > hours, and it took 19. Not awful. > Indeed. TSM took about 20 minutes at the start to build the restore index, but it fairly screamed once it got going moving data around. TSM uses two "threads" during the restore process, one doing the actual read from storage and send to host and one scanning the archive system for the next file to restore. It does a very good job keeping throughput up. > 1600MB/s * 3600s/h = 5760GB/h which is still more than the > 4500GB/h you might get from theoretical no-overhead 10GE. > A little less, actually. It's 11x1TB in RAID 6. I have a hot spare on the controller. Still, close enough for an approximation. ___ Discuss mailing list Discuss@blu.org http://lists.blu.org/mailman/listinfo/discuss
Re: [Discuss] lvm snapshot cloning
> From: ma...@mohawksoft.com [mailto:ma...@mohawksoft.com] > > If you can get more than 160MB/s (sustained) on anything other than exotic > hardware, I'd be surprised. 1Gbit/sec per disk sustained is currently not > possible with COTS hardware that is available. > > Transfer rate is not "sustained," and "peak" is not "sustained." Yes, if > can can manage to read/write to disk cache, you can get cool performance, > but if you are doing backups, you will blow out cache quite quickly. Go measure it before you say anymore. Because I've spent a lot of time in the last 4 years benchmarking disks. I can say the typical sequential throughput, read or write, for nearly all disks (7.2krpm sata up to 15krpm sas) is 1.0 Gbit/sec. Sustained sequential read/write. For let's say, the entire disk. Or at least tens of GB. Even laptops (7.2krpm sata dell) are able to sustain this speed. ___ Discuss mailing list Discuss@blu.org http://lists.blu.org/mailman/listinfo/discuss
Re: [Discuss] lvm snapshot cloning
On Wed, Oct 26, 2011 at 03:18:06PM -0400, Edward Ned Harvey wrote: > > From: ma...@mohawksoft.com [mailto:ma...@mohawksoft.com] > > > > If you can get more than 160MB/s (sustained) on anything other than exotic > > hardware, I'd be surprised. 1Gbit/sec per disk sustained is currently not > > possible with COTS hardware that is available. > > > > Transfer rate is not "sustained," and "peak" is not "sustained." Yes, if > > can can manage to read/write to disk cache, you can get cool performance, > > but if you are doing backups, you will blow out cache quite quickly. > > Go measure it before you say anymore. Because I've spent a lot of time in > the last 4 years benchmarking disks. I can say the typical sequential > throughput, read or write, for nearly all disks (7.2krpm sata up to 15krpm > sas) is 1.0 Gbit/sec. Sustained sequential read/write. For let's say, the > entire disk. Or at least tens of GB. > > Even laptops (7.2krpm sata dell) are able to sustain this speed. Erm. 1 Gb/s * 1024 M/G * 1B/8b = 128MB/s Anything which can do 160 is clearly capable of doing 128. You two are arguing in different directions. -dsr- ___ Discuss mailing list Discuss@blu.org http://lists.blu.org/mailman/listinfo/discuss
Re: [Discuss] lvm snapshot cloning
ma...@mohawksoft.com wrote: > Suppose you have a 1TB hard disk. How on earth do you back that up? > Think now if you want to pipe that up to the cloud for off-site backup. > > The snapshot device "knows" > what is different. You don't have to really backup 1TB every time, you > only have to backup the changes to it since the last full backup. If you are at the point of considering developing software to do this, instead of just using off-the-shelf solutions, then you should consider using inotify[1]. I believe using this library you could log to a database the inodes that have been altered over a given period of time, which another tool could then use to package up the data and send it to your local or remote backup server. Of course without snapshots you'd needs to take other steps to insure consistency, if that matters for your application. Seems DRBD[2] would be another way to address this, though not practical for a remote backup server, as it'll send every change to the remote server in real time. Generally though, it seems like you are trying to re-invent ZFS, which does both the snapshotting you want, as we'll as communicating differential changes to a remote server. But I understand your objections to ZFS. 1. https://github.com/rvoicilas/inotify-tools/wiki/ 2. http://www.drbd.org/ > I can't give too much away... Why? Developing something you plan to patent? Turn into a commercial product and you want to keep the technology a trade secrete? If those don't apply, then why not detail your scheme? Better to idea vetted before you've invested a lot of time into building it. -Tom -- Tom Metro Venture Logic, Newton, MA, USA "Enterprise solutions through open source." Professional Profile: http://tmetro.venturelogic.com/ ___ Discuss mailing list Discuss@blu.org http://lists.blu.org/mailman/listinfo/discuss
Re: [Discuss] lvm snapshot cloning
> ma...@mohawksoft.com wrote: >> Suppose you have a 1TB hard disk. How on earth do you back that up? >> Think now if you want to pipe that up to the cloud for off-site backup. >> >> The snapshot device "knows" >> what is different. You don't have to really backup 1TB every time, you >> only have to backup the changes to it since the last full backup. > > If you are at the point of considering developing software to do this, > instead of just using off-the-shelf solutions, then you should consider > using inotify[1]. I believe using this library you could log to a > database the inodes that have been altered over a given period of time, > which another tool could then use to package up the data and send it to > your local or remote backup server. That's for a file level backup which has been done to death. > > Of course without snapshots you'd needs to take other steps to insure > consistency, if that matters for your application. > > Seems DRBD[2] would be another way to address this, though not practical > for a remote backup server, as it'll send every change to the remote > server in real time. > > Generally though, it seems like you are trying to re-invent ZFS, which > does both the snapshotting you want, as we'll as communicating > differential changes to a remote server. But I understand your > objections to ZFS. > > 1. https://github.com/rvoicilas/inotify-tools/wiki/ > 2. http://www.drbd.org/ > > >> I can't give too much away... > > Why? Developing something you plan to patent? Turn into a commercial > product and you want to keep the technology a trade secrete? Potentially. > > If those don't apply, then why not detail your scheme? Better to idea > vetted before you've invested a lot of time into building it. Well, lets just say I'm doing three things I'm doing some research for my current employer's product. I'm thinking about creating an open source tool kit for managing a category of resources. Lastly, a "pro" version ($$$) of fore mentioned tool it. I have to be careful in that I do not overlap my current employer (I have NO intention to do so, but there is technology overlap, so I'm very sensitive in how I phrase things.) > > -Tom > > -- > Tom Metro > Venture Logic, Newton, MA, USA > "Enterprise solutions through open source." > Professional Profile: http://tmetro.venturelogic.com/ > ___ Discuss mailing list Discuss@blu.org http://lists.blu.org/mailman/listinfo/discuss
Re: [Discuss] lvm snapshot cloning
On Wed, Oct 26, 2011 at 11:49 PM, Tom Metro wrote: > ma...@mohawksoft.com wrote: >> Suppose you have a 1TB hard disk. How on earth do you back that up? >> Think now if you want to pipe that up to the cloud for off-site backup. >> >> The snapshot device "knows" >> what is different. You don't have to really backup 1TB every time, you >> only have to backup the changes to it since the last full backup. > > If you are at the point of considering developing software to do this, > instead of just using off-the-shelf solutions, then you should consider > using inotify[1]. I believe using this library you could log to a > database the inodes that have been altered over a given period of time, > which another tool could then use to package up the data and send it to > your local or remote backup server. I'm not sure how inotify would help that much. Mark seems to be focused on large files (database files, VM images). He knows they've changed, but for efficiency reasons he wants to replicate/backup only the blocks that have changed due to the files' large sizes. His proposed solution is to piggyback on LVM snapshots which automatically keeps track of changes at a block level granularity. As I see it, he's trying to bring back the functionality of the old dump/restore backup programs except in a filesystem agnostic way. Maintaining dump/restore became intractable when the number of filesystems exploded. Tracking LVM is likely to be easier. Your suggestion of DRBD seems like another way to go. Although as you note, the real-time nature of the data stream is an issue. Using LVM to consolidate changes to a single block is a possible advantage. Of course, this is all based on guesswork by me about Mark's goals... Bill Bogstad ___ Discuss mailing list Discuss@blu.org http://lists.blu.org/mailman/listinfo/discuss
Re: [Discuss] lvm snapshot cloning
One thing I haven't seen on this thread is the "REFLINK" mechanism invented in the most recent 1.6 release of OCFS2. (It's described in the user guide: http://oss.oracle.com/projects/ocfs2/dist/documentation/v1.6/ocfs2-1_6-usersguide.pdf). The open-source developers have thought long and hard about this performance problem and come up with an innovative solution for writeable snapshots. It doesn't directly solve some of the issues discussed here (regarding cloning of snapshots) but it provides a better platform for solving them. And the filesystem driver has been kernel-embedded for several Linux releases. -rich ___ Discuss mailing list Discuss@blu.org http://lists.blu.org/mailman/listinfo/discuss
Re: [Discuss] lvm snapshot cloning
Bill Bogstad wrote: >> ...you should consider using inotify[1]. > > I'm not sure how inotify would help that much. Mark...wants to > replicate/backup only the blocks that have changed due to the files' > large sizes. You are correct. I was mistakenly thinking that inodes==blocks, which of course they don't. I haven't looked at the inotify library in a while, but it isn't out of the realm of possibility that in addition to the notification messages identifying the inode, it may also identify the blocks or byte range impacted by the I/O operation. -Tom -- Tom Metro Venture Logic, Newton, MA, USA "Enterprise solutions through open source." Professional Profile: http://tmetro.venturelogic.com/ ___ Discuss mailing list Discuss@blu.org http://lists.blu.org/mailman/listinfo/discuss
Re: [Discuss] lvm snapshot cloning
On Sun, Oct 30, 2011 at 11:01 AM, Tom Metro wrote: > Bill Bogstad wrote: > > You are correct. I was mistakenly thinking that inodes==blocks, which of > course they don't. > > I haven't looked at the inotify library in a while, but it isn't out of > the realm of possibility that in addition to the notification messages > Â identifying the inode, it may also identify the blocks or byte range > impacted by the I/O operation. I checked the docs quickly and no such luck. In any case, it would have been more likely that inotify would inform you about byte ranges in the file that changed rather then blocks within the filesystem implementation. You could possibly use something like this to monitor files for backup through the filesytem rather then through the block device, but this would require monitoring all the directories on the filesystem. That is likely to be too expensive to do through the inotify system in the general case. Bill Bogstad ___ Discuss mailing list Discuss@blu.org http://lists.blu.org/mailman/listinfo/discuss