Re: [Discuss] lvm snapshot cloning

2011-10-23 Thread Richard Pieri
On Oct 23, 2011, at 7:08 PM, ma...@mohawksoft.com wrote:
> 
> So, correct me if I'm wrong, is there a utility to create a an active 1:1
> copy of a snapshot in LVM2?

dd

Really.  LVM volumes are block devices, and the simplest way to make a 1:1 copy 
of a block device is with dd.

--Rich P.

___
Discuss mailing list
Discuss@blu.org
http://lists.blu.org/mailman/listinfo/discuss


Re: [Discuss] lvm snapshot cloning

2011-10-23 Thread markw
> On Oct 23, 2011, at 7:08 PM, ma...@mohawksoft.com wrote:
>>
>> So, correct me if I'm wrong, is there a utility to create a an active
>> 1:1
>> copy of a snapshot in LVM2?
>
> dd
>
> Really.  LVM volumes are block devices, and the simplest way to make a 1:1
> copy of a block device is with dd.

That may be the "easiest way" but it certainly not the most efficient way.

Suppose you have a 1TB logical volume. You create a point in time snapshot
for testing or backup. At a later point you want to create a copy of that
snapshot to do some work on the data. A snapshot does not contain the
data, it only contains the old data from a "copy on write" change in the
real device.

Since the snapshot was created, the disk has changed very little. Your way
would take up a whole additional 1TB of space. OMG. If you can read/write
30MB/s it would take you 14 hours to copy.

The real solution, and I have code to do it, is this:

create a snapshot of device A, call it foo.
(arbitrary length of time passes)
create a second snapshot of A, call it bar.

Both these snapshots are now "tracking" the real volume. Now, you use the
COW data from the first snapshot and apply it to the second snapshot. This
will create two identical snapshots. Both sparsely populated with only the
differences between themselves and the real volume. The only drawback is
that the two snapshots will contain duplicate COW data.


>
> --Rich P.
>
> ___
> Discuss mailing list
> Discuss@blu.org
> http://lists.blu.org/mailman/listinfo/discuss
>


___
Discuss mailing list
Discuss@blu.org
http://lists.blu.org/mailman/listinfo/discuss


Re: [Discuss] lvm snapshot cloning

2011-10-23 Thread Richard Pieri
On Oct 23, 2011, at 8:31 PM, ma...@mohawksoft.com wrote:
> Both these snapshots are now "tracking" the real volume. Now, you use the
> COW data from the first snapshot and apply it to the second snapshot. This
> will create two identical snapshots. Both sparsely populated with only the
> differences between themselves and the real volume. The only drawback is
> that the two snapshots will contain duplicate COW data.


The "only" drawback is that you're treating LVM as if it were a file system.  
Recall what I wrote yesterday about LVM biting off your face.  That's what's 
happening: LVM is gnawing on your nose right now.

An LVM volume is a dynamic disk partition, and LVM snapshots are transaction 
logs of the low level block changes.  These logs are presented to the OS as 
more block devices.  If you want to clone a snapshot then you have to treat it 
as the block device that it is.  What you really want is file system snapshots 
or a versioning file system.  LVM can accomplish neither because it's just a 
block device driver.

Unfortunately, neither of these exist for any of the main line file systems 
available in the Linux kernel.  ZFS isn't happening and Btrfs isn't read for 
production.  Both are under Oracle's thumb, and the other options are patches 
against the main kernel source that you have to apply and build yourself.

--Rich P.

___
Discuss mailing list
Discuss@blu.org
http://lists.blu.org/mailman/listinfo/discuss


Re: [Discuss] lvm snapshot cloning

2011-10-23 Thread Bill Bogstad
On Sun, Oct 23, 2011 at 8:31 PM,   wrote:
>> On Oct 23, 2011, at 7:08 PM, ma...@mohawksoft.com wrote:
>>>

>
> The real solution, and I have code to do it, is this:
>
> create a snapshot of device A, call it foo.
> (arbitrary length of time passes)
> create a second snapshot of A, call it bar.

If I understand what you are asking for at the end of this sequence,
you would have two block level snapshots of device A which would have
logically the same contents and the underlying implementation would
have essentially the same physical contents (and space requirements)
as if you had done both of the snapshots with no time delay between
them.  Given that devices can have more then one active snapshot at a
time, it would seem theoretically possible to implement such a
feature.   I would call this "cloning a snapshot".  Another
possibility that might meet your requirements would be the ability to
do a "snapshot of a snapshot".  Unfortunately, I've never seen any
suggestion that LVM can do either of these operations.

What you intend to do with those snapshots.  I'm guessing it would
involve using them in read/write mode.  Perhaps some kind of
multi-branch tree of versions of a device?

In any case, I was curious and found the following web pages which
seem relevant:

http://sourceware.org/lvm2/wiki/FeatureRequests/dm/snapshots  (feature
request page from 2009)

which says that neither snapshot cloning nor taking a snapshot of a
snapshot is currently possible as well as:

http://www.redhat.com/archives/linux-lvm/2008-November/msg0.html

which gives several hacky ways to do snapshot cloning.   I would
suggest more investigation/careful testing before depending on this
though.

Good Luck,
Bill Bogstad
___
Discuss mailing list
Discuss@blu.org
http://lists.blu.org/mailman/listinfo/discuss


Re: [Discuss] lvm snapshot cloning

2011-10-24 Thread markw
> On Sun, Oct 23, 2011 at 8:31 PM,   wrote:
>>> On Oct 23, 2011, at 7:08 PM, ma...@mohawksoft.com wrote:

>
>>
>> The real solution, and I have code to do it, is this:
>>
>> create a snapshot of device A, call it foo.
>> (arbitrary length of time passes)
>> create a second snapshot of A, call it bar.
>
> If I understand what you are asking for at the end of this sequence,
> you would have two block level snapshots of device A which would have
> logically the same contents and the underlying implementation would
> have essentially the same physical contents (and space requirements)
> as if you had done both of the snapshots with no time delay between
> them.  Given that devices can have more then one active snapshot at a
> time, it would seem theoretically possible to implement such a
> feature.   I would call this "cloning a snapshot".  Another
> possibility that might meet your requirements would be the ability to
> do a "snapshot of a snapshot".  Unfortunately, I've never seen any
> suggestion that LVM can do either of these operations.
>
> What you intend to do with those snapshots.  I'm guessing it would
> involve using them in read/write mode.  Perhaps some kind of
> multi-branch tree of versions of a device?
>
> In any case, I was curious and found the following web pages which
> seem relevant:
>
> http://sourceware.org/lvm2/wiki/FeatureRequests/dm/snapshots  (feature
> request page from 2009)
>
> which says that neither snapshot cloning nor taking a snapshot of a
> snapshot is currently possible as well as:
>
> http://www.redhat.com/archives/linux-lvm/2008-November/msg0.html
>
> which gives several hacky ways to do snapshot cloning.   I would
> suggest more investigation/careful testing before depending on this
> though.

I wrote a program to copy the exception table from one COW device to
snapshot device.  At the end, both snapshots, in lvdisplay, use the same
amount of space and get the same md5sum. Which is cool.

My only question is, it were this easy, why am I seeing people lament
about the lack of this feature?

Obviously, I need to do some testing, but it looks viable.


>
> Good Luck,
> Bill Bogstad
>


___
Discuss mailing list
Discuss@blu.org
http://lists.blu.org/mailman/listinfo/discuss


Re: [Discuss] lvm snapshot cloning

2011-10-24 Thread markw
> On Oct 23, 2011, at 8:31 PM, ma...@mohawksoft.com wrote:
>> Both these snapshots are now "tracking" the real volume. Now, you use
>> the
>> COW data from the first snapshot and apply it to the second snapshot.
>> This
>> will create two identical snapshots. Both sparsely populated with only
>> the
>> differences between themselves and the real volume. The only drawback is
>> that the two snapshots will contain duplicate COW data.
>
>
> The "only" drawback is that you're treating LVM as if it were a file
> system.  Recall what I wrote yesterday about LVM biting off your face.
> That's what's happening: LVM is gnawing on your nose right now.

Why are you saying this? LVM is designed for much of this. The strategy
described is perfectly valid within the context of read/write snapshots.

>
> An LVM volume is a dynamic disk partition, and LVM snapshots are
> transaction logs of the low level block changes.

Not really true, LVM snapshots are a copy-on-write devices. Yes, there is
an exception log behind them to map the COW data.

>  These logs are presented
> to the OS as more block devices.  If you want to clone a snapshot then you
> have to treat it as the block device that it is.

Why? Snapshots are read/write and work fine.

> What you really want is
> file system snapshots or a versioning file system.  LVM can accomplish
> neither because it's just a block device driver.

Its a bit more than a block driver. Yes, it is presented to the higher
levels of the OS as a block driver, but it does remapping of blocks, it
has modules for raid, etc.

>
> Unfortunately, neither of these exist for any of the main line file
> systems available in the Linux kernel.  ZFS isn't happening and Btrfs
> isn't read for production.  Both are under Oracle's thumb, and the other
> options are patches against the main kernel source that you have to apply
> and build yourself.

Which is why I'm exploring LVM.

>
> --Rich P.
>
> ___
> Discuss mailing list
> Discuss@blu.org
> http://lists.blu.org/mailman/listinfo/discuss
>


___
Discuss mailing list
Discuss@blu.org
http://lists.blu.org/mailman/listinfo/discuss


Re: [Discuss] lvm snapshot cloning

2011-10-24 Thread markw
> On Sun, Oct 23, 2011 at 8:31 PM,   wrote:
>>> On Oct 23, 2011, at 7:08 PM, ma...@mohawksoft.com wrote:

>
[snip]
> In any case, I was curious and found the following web pages which
> seem relevant:
>
> http://sourceware.org/lvm2/wiki/FeatureRequests/dm/snapshots  (feature
> request page from 2009)
>
I saw this paper, which started me thinking.

> which says that neither snapshot cloning nor taking a snapshot of a
> snapshot is currently possible as well as:
>
> http://www.redhat.com/archives/linux-lvm/2008-November/msg0.html
>
I saw this post and there were so many issues with it, I wouldn't even
call it wrong. :-)


> which gives several hacky ways to do snapshot cloning.   I would
> suggest more investigation/careful testing before depending on this
> though.


Do you use LVM? Want to test some utilities I am writing?

___
Discuss mailing list
Discuss@blu.org
http://lists.blu.org/mailman/listinfo/discuss


Re: [Discuss] lvm snapshot cloning

2011-10-24 Thread Richard Pieri
On Oct 24, 2011, at 7:41 AM, ma...@mohawksoft.com wrote:
> 
> Why are you saying this? LVM is designed for much of this. The strategy
> described is perfectly valid within the context of read/write snapshots.

That's where I say that you are mistaken, because it isn't, not really.  It 
does a fair job of pretending but once you get deep into it you'll find it 
there grinning, waiting to bite your face off.


> Not really true, LVM snapshots are a copy-on-write devices. Yes, there is
> an exception log behind them to map the COW data.

You're quibbling over implementation.  In practice, LVM snapshots work like 
ever-expanding transaction logs that are presented to the VFS layer as block 
devices.


> Why? Snapshots are read/write and work fine.

They don't.  See the previous discussion about LVM snapshot performance 
degradation.


> Its a bit more than a block driver. Yes, it is presented to the higher
> levels of the OS as a block driver, but it does remapping of blocks, it
> has modules for raid, etc.

Okay, it's a *fancy* block device driver.  Because at the end of the day, when 
it comes to doing any real work on it, the kernel presents LVM volumes just 
like it does raw disk partitions which are (pause) block devices.  This is a 
lot of why the features that you want will never be implemented.

I'm not saying that this is a failing.  In fact, this is central to LVM's 
versatility.  You can put practically any file system on an LVM volume 
specifically because it is a block device.  For example, I could put a FreeBSD 
domU on a Linux dom0 and give that domU an entire LVM volume to itself.  
FreeBSD doesn't care because the volume is just a block device.  I can stick 
DRBD underneath that volume and replicate it block-for-block to my redundant 
dom0 because it is a block device.  Linux and Xen don't care because (pause) 
the volume is just a block device.  I could never do something like this with 
AdvFS, and I expect that doing it with zpools would be more work than lvcreate, 
add device to config file, boot installer image.

This is what I'm on about.  If you treat LVM volumes as block devices then you 
can do some amazing things with them.  If you treat them as something else then 
you're going to have problems.

--Rich P.

___
Discuss mailing list
Discuss@blu.org
http://lists.blu.org/mailman/listinfo/discuss


Re: [Discuss] lvm snapshot cloning

2011-10-24 Thread markw
> On Oct 24, 2011, at 7:41 AM, ma...@mohawksoft.com wrote:
>>
>> Why are you saying this? LVM is designed for much of this. The strategy
>> described is perfectly valid within the context of read/write snapshots.
>
> That's where I say that you are mistaken, because it isn't, not really.
> It does a fair job of pretending but once you get deep into it you'll find
> it there grinning, waiting to bite your face off.

Well, I've spent a lot of my limited spare time the last few weeks really
getting down low. I think you are mistaken.

There are two issues with LVM, well actually one, and one with snapshots.

With LVM, it does not transparently detect and reroute around bad blocks.
This would be a cool feature, but since RAID is supported, it is an issue
that can be worked around.

With snapshots, its a bad design in that all snapshots receive all COW
blocks. So if you have two snapshots with the same basic content, there
will be duplicative data. This means that you have to allocate enough
space for the changes N times depending on the number of snapshots.

Other systems, chain snapshots so that at any time, only one snapshot is
receiving COW data, and the subsequent snapshots only forward reference.

Anyway, LVM snapshots are workable for the most part.
>
>
>> Not really true, LVM snapshots are a copy-on-write devices. Yes, there
>> is
>> an exception log behind them to map the COW data.
>
> You're quibbling over implementation.  In practice, LVM snapshots work
> like ever-expanding transaction logs that are presented to the VFS layer
> as block devices.

Yes.
>
>
>> Why? Snapshots are read/write and work fine.
>
> They don't.  See the previous discussion about LVM snapshot performance
> degradation.

The performance degradation is transient. It is only incurred at the time
of COW, then it is mostly trivial. The problem with many of the benchmarks
is that they basically measure the worst case COW performance which is
known as an issue.

While not guaranteed, most applications do not create a large number of
COW pages all at once. There is a momentary hit periodically, but
performance decrease, on average, is minimal.

>
>
>> Its a bit more than a block driver. Yes, it is presented to the higher
>> levels of the OS as a block driver, but it does remapping of blocks, it
>> has modules for raid, etc.
>
> Okay, it's a *fancy* block device driver.  Because at the end of the day,
> when it comes to doing any real work on it, the kernel presents LVM
> volumes just like it does raw disk partitions which are (pause) block
> devices.  This is a lot of why the features that you want will never be
> implemented.

I don't understand this viewpoint. Block devices are far far easier to
implement the sort of features needed. A block device has fewer
constraints and has more ordered access than would a higher level file
system.

>
> I'm not saying that this is a failing.  In fact, this is central to LVM's
> versatility.  You can put practically any file system on an LVM volume
> specifically because it is a block device.  For example, I could put a
> FreeBSD domU on a Linux dom0 and give that domU an entire LVM volume to
> itself.  FreeBSD doesn't care because the volume is just a block device.
> I can stick DRBD underneath that volume and replicate it block-for-block
> to my redundant dom0 because it is a block device.  Linux and Xen don't
> care because (pause) the volume is just a block device.  I could never do
> something like this with AdvFS, and I expect that doing it with zpools
> would be more work than lvcreate, add device to config file, boot
> installer image.
>
> This is what I'm on about.  If you treat LVM volumes as block devices then
> you can do some amazing things with them.  If you treat them as something
> else then you're going to have problems.

They are being treated as a block device, but the capabilities as
advertised can be used.
>
> --Rich P.
>
> ___
> Discuss mailing list
> Discuss@blu.org
> http://lists.blu.org/mailman/listinfo/discuss
>


___
Discuss mailing list
Discuss@blu.org
http://lists.blu.org/mailman/listinfo/discuss


Re: [Discuss] lvm snapshot cloning

2011-10-24 Thread markw
> On Oct 24, 2011, at 7:41 AM, ma...@mohawksoft.com wrote:
>> Why? Snapshots are read/write and work fine.
>
> They don't.  See the previous discussion about LVM snapshot performance
> degradation.

As was said, the performance degradation is transient, it only happens
when a COW page is created, after that, it is minimal.

---Create a 100G volume---
root@dewy:~# lvcreate -L100G -i 2 -n test lvmvolumes
  Using default stripesize 64.00 KiB
  Logical volume "test" created
root@dewy:~# sync

--- Write to the disk 
root@dewy:~# time dd if=/dev/zero of=/dev/lvmvolumes/test bs=1M count=1000
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 6.0213 s, 174 MB/s

real0m6.023s
user0m0.000s
sys 0m1.130s

Good performance, 160 MB/s

-- Create a snapshot 

root@dewy:~# lvcreate -s -n testsnap -L2G /dev/lvmvolumes/test
  Logical volume "testsnap" created

--- Write to the disk ---

root@dewy:~# time dd if=/dev/zero of=/dev/lvmvolumes/test bs=1M count=1000
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 63.3635 s, 16.5 MB/s

real1m3.365s
user0m0.000s
sys 0m1.270s

Worse performance, 15 MB/s

--- Write to the disk ---
root@dewy:~# time dd if=/dev/zero of=/dev/lvmvolumes/test bs=1M count=1000
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 5.91994 s, 177 MB/s

real0m5.922s
user0m0.000s
sys 0m1.180s

Good performance, 160 MB/s

Once the COW data is allocated, the performance is fine.

___
Discuss mailing list
Discuss@blu.org
http://lists.blu.org/mailman/listinfo/discuss


Re: [Discuss] lvm snapshot cloning

2011-10-24 Thread Richard Pieri
On Oct 24, 2011, at 8:41 PM, ma...@mohawksoft.com wrote:
> 
> With snapshots, its a bad design in that all snapshots receive all COW
> blocks. So if you have two snapshots with the same basic content, there
> will be duplicative data. This means that you have to allocate enough
> space for the changes N times depending on the number of snapshots.

This appears to be a bad design to you because you are thinking in terms of 
high-level file I/O.  And yes, I agree: it is very inefficient from the 
perspective of the high-level file system.  So don't do that.  Treat LVM 
volumes as simple block devices like I described.

If you won't do that then I can't help you.

--Rich P.

___
Discuss mailing list
Discuss@blu.org
http://lists.blu.org/mailman/listinfo/discuss


Re: [Discuss] lvm snapshot cloning

2011-10-24 Thread markw
> On Oct 24, 2011, at 8:41 PM, ma...@mohawksoft.com wrote:
>>
>> With snapshots, its a bad design in that all snapshots receive all COW
>> blocks. So if you have two snapshots with the same basic content, there
>> will be duplicative data. This means that you have to allocate enough
>> space for the changes N times depending on the number of snapshots.
>
> This appears to be a bad design to you because you are thinking in terms
> of high-level file I/O.  And yes, I agree: it is very inefficient from the
> perspective of the high-level file system.  So don't do that.  Treat LVM
> volumes as simple block devices like I described.

I can't understand your perspective here. We have various RAID drivers, we
have linear drivers, sparse volumes, encrypted volumes, and so on. All
implemented at the block device level.

How are snapshots any more or less complex or problematic than a RAID5 or
encrypted block device?

I've posted proof that any performance degradation on LVM volumes with a
snapshot is only transient. You'll mostly never notice it in the real
world because its very unusual for large areas to undergo change in a
short period of time. Obviously, this is not an absolute, but it is
generally true.

For instance, writing a whole 1TB volume at 160MB/s (best case in October
2011, SATA/SAS) will take about 2 hours. Not that this would not happen in
some rare circumstance, but that's not typical.

I have not been able to find a good alternative. ZFS is stalled with
license issues and Btrfs is not stable. Both are under Oracle. LVM2 works
now and with a little finesse seems to do what is needed.

To address your concerns about it being a block device, I see this as
making the system more capable. You can put any file system you want on
it.

Isn't it better to improve what is stable and working rather than wait for
Oracle?

___
Discuss mailing list
Discuss@blu.org
http://lists.blu.org/mailman/listinfo/discuss


Re: [Discuss] lvm snapshot cloning

2011-10-25 Thread Edward Ned Harvey
> From: discuss-bounces+blu=nedharvey@blu.org [mailto:discuss-
> bounces+blu=nedharvey@blu.org] On Behalf Of
> 
> Other systems, chain snapshots so that at any time, only one snapshot is
> receiving COW data, and the subsequent snapshots only forward reference.

Also, other systems don't make you pre-allocate disk space for snapshot
usage, and other systems don't automatically destroy your snapshot if you
run out of pre-allocated space.  LVM snapshot is crap compared to other
systems - but if you want to make a single snapshot so you can dump or tar
your ext3/4 filesystem in a consistent state or something, then it's good
for that purpose.  And not much else.


___
Discuss mailing list
Discuss@blu.org
http://lists.blu.org/mailman/listinfo/discuss


Re: [Discuss] lvm snapshot cloning

2011-10-25 Thread Edward Ned Harvey
> From: discuss-bounces+blu=nedharvey@blu.org [mailto:discuss-
> bounces+blu=nedharvey@blu.org] On Behalf Of
> 
> I have not been able to find a good alternative. ZFS is stalled with
> license issues and Btrfs is not stable. 

What application are you intending to use it for?  Does it need to be linux?
In many situations, even if you're using linux, it might be ok to run either
user-mode ZFS or ZFS on a separate system connected via NFS or ISCSI or
whatever.

Saying btrfs is unstable ... It's a matter of perspective, and I'll
disagree.  I would say it's lacking some important features still, but would
not call it unstable.

___
Discuss mailing list
Discuss@blu.org
http://lists.blu.org/mailman/listinfo/discuss


Re: [Discuss] lvm snapshot cloning

2011-10-25 Thread markw
>> From: discuss-bounces+blu=nedharvey@blu.org [mailto:discuss-
>> bounces+blu=nedharvey@blu.org] On Behalf Of
>>
>> Other systems, chain snapshots so that at any time, only one snapshot is
>> receiving COW data, and the subsequent snapshots only forward reference.
>
> Also, other systems don't make you pre-allocate disk space for snapshot
> usage, and other systems don't automatically destroy your snapshot if you
> run out of pre-allocated space.  LVM snapshot is crap compared to other
> systems - but if you want to make a single snapshot so you can dump or tar
> your ext3/4 filesystem in a consistent state or something, then it's good
> for that purpose.  And not much else.

As I have said before, that, most of the time, copying the WHOLE volume is
not a practical option.

1TB at a very very fast rate of 160MB/s, copying a 1TB volume takes HOURS!

As for the pre-allocate space issue, yes, that is bad, but using alerts
and monitoring to increase allocated space will prevent the snapshot from
being nuked.

I'm not implying the LVM2 is great, but I do assert it is quite functional.
>
>
>


___
Discuss mailing list
Discuss@blu.org
http://lists.blu.org/mailman/listinfo/discuss


Re: [Discuss] lvm snapshot cloning

2011-10-25 Thread markw
>> From: discuss-bounces+blu=nedharvey@blu.org [mailto:discuss-
>> bounces+blu=nedharvey@blu.org] On Behalf Of
>>
>> I have not been able to find a good alternative. ZFS is stalled with
>> license issues and Btrfs is not stable.
>
> What application are you intending to use it for?  Does it need to be
> linux?

Well, not technically, but from a cost/benefit sort of analysis, I don't
know of a better platform.

> In many situations, even if you're using linux, it might be ok to run
> either
> user-mode ZFS or ZFS on a separate system connected via NFS or ISCSI or
> whatever.

Still concerned over support and Oracle. When that is ironed out, I'll be
able to reconsider it.

>
> Saying btrfs is unstable ... It's a matter of perspective, and I'll
> disagree.  I would say it's lacking some important features still, but
> would
> not call it unstable.

Please note, I did not call it "unstable," I said it was not "stable." The
difference is subtle, but important. The Btrfs branch is not considered
"stable" yet.

>
>


___
Discuss mailing list
Discuss@blu.org
http://lists.blu.org/mailman/listinfo/discuss


Re: [Discuss] lvm snapshot cloning

2011-10-25 Thread Richard Pieri
On Oct 24, 2011, at 10:42 PM, ma...@mohawksoft.com wrote:
> 
> I can't understand your perspective here. We have various RAID drivers, we
> have linear drivers, sparse volumes, encrypted volumes, and so on. All
> implemented at the block device level.

Then let me paraphrase it:  LVM is a logical partition manager.


> How are snapshots any more or less complex or problematic than a RAID5 or
> encrypted block device?

Practical example:  create your "master" volume (partition) with 1TB.  Create a 
snapshot of the master, call it "a" and give it 100GB.  Create another snapshot 
of the master, call it "b" and give it 1GB.  The snapshots are created such 
that a does COW against master and b is mostly pointers to volume a's blocks so 
that you don't have the duplicated blocks.

Now, delete volume a.  Or copy 100GB+1byte to volume a.  This will trigger 
LVM's reaper which prunes the snapshot to ensure that there is no data loss on 
master.

--Rich P.

___
Discuss mailing list
Discuss@blu.org
http://lists.blu.org/mailman/listinfo/discuss


Re: [Discuss] lvm snapshot cloning

2011-10-25 Thread markw
> On Oct 24, 2011, at 10:42 PM, ma...@mohawksoft.com wrote:
>>
>> I can't understand your perspective here. We have various RAID drivers,
>> we
>> have linear drivers, sparse volumes, encrypted volumes, and so on. All
>> implemented at the block device level.
>
> Then let me paraphrase it:  LVM is a logical partition manager.

No, its a "Logical Volume Manager." Come on, lets be constructive. LVM
snapshots are, in essence, sparsely allocated block devices with a
"master" device acting as the background fill. And Yes, there is COW when
blocks get written in the master or snapshot, but that's about it.

>
>
>> How are snapshots any more or less complex or problematic than a RAID5
>> or
>> encrypted block device?
>
> Practical example:  create your "master" volume (partition) with 1TB.
> Create a snapshot of the master, call it "a" and give it 100GB.  Create
> another snapshot of the master, call it "b" and give it 1GB.  The
> snapshots are created such that a does COW against master and b is mostly
> pointers to volume a's blocks so that you don't have the duplicated
> blocks.

I wish this were true, but from what I've been able to learn is that two
different snapshots will each have their own copies of the COW data.

On other system, yes, master<-b<-a
Where 'a' is older than 'b'

On lvm its more like this:
master
^  ^
a  b
Where they equally point to the master. Each snapshot stores its data in
its own cow image. I don't see any provision for multiple snapshots to
share cow data.

>
> Now, delete volume a.  Or copy 100GB+1byte to volume a.  This will trigger
> LVM's reaper which prunes the snapshot to ensure that there is no data
> loss on master.

The snapshot has no effect on the master, and yes, we've already said and
we already know it is a weakness in LVM that if you don't extend your
snapshots you lose them. This can be mitigated by monitoring and automatic
volume extension.

BTW: Thanks, I have been looking at LVM and these discussions have forced
me to really look closely at it. I'm still not decided, but if I decide it
is the way we want to go, I will owe a lot of the debate to practice on
BLU!



___
Discuss mailing list
Discuss@blu.org
http://lists.blu.org/mailman/listinfo/discuss


Re: [Discuss] lvm snapshot cloning

2011-10-25 Thread markw
I should discuss more what I want to do with LVM and maybe it makes more
sense.

First, the statement of problem.
Hard disks are big and fundamentally slow.

Suppose you have a 1TB hard disk. How on earth do you back that up? If you
use a 1TB USB hard disk (USB 2.0) At best case, you'll get about 30MB/s.
(You won't get that fast, but it is a good round number for discussion)

1TB is 1,099,511,627,776 bytes. At 30MB/s that's a little over 9 1/2 hours
to backup.

If you use a dedicated high speed SATA or SAS drive, you may get a
whopping 160MB/s. That's 1 hour 49 minutes.

Think now if you want to pipe that up to the cloud for off-site backup.

With snapshots, you can back up a consistent state, as if your 2 hour
backup happened instantaneously, but you are still writing a lot of data.

However, there is a better optimization here. The snapshot device "knows"
what is different. You don't have to really backup 1TB every time, you
only have to backup the changes to it since the last full backup.

I can't give too much away, but I think you get my drift. If the data only
changes 5% a day, and you can track that 5% a day. You can make an
effective backup of 1TB of data in 29 minutes to the USB drive.

Think about it.

LVM is more than capable of doing this.


___
Discuss mailing list
Discuss@blu.org
http://lists.blu.org/mailman/listinfo/discuss


Re: [Discuss] lvm snapshot cloning

2011-10-25 Thread Edward Ned Harvey
> From: ma...@mohawksoft.com [mailto:ma...@mohawksoft.com]
> 
> >> From: discuss-bounces+blu=nedharvey@blu.org [mailto:discuss-
> >> bounces+blu=nedharvey@blu.org] On Behalf Of
> >>
> >> Other systems, chain snapshots so that at any time, only one snapshot
is
> >> receiving COW data, and the subsequent snapshots only forward
> reference.
> >
> > Also, other systems don't make you pre-allocate disk space for snapshot
> > usage, and other systems don't automatically destroy your snapshot if
you
> > run out of pre-allocated space.  LVM snapshot is crap compared to other
> > systems - but if you want to make a single snapshot so you can dump or
tar
> > your ext3/4 filesystem in a consistent state or something, then it's
good
> > for that purpose.  And not much else.
> 
> As I have said before, that, most of the time, copying the WHOLE volume is
> not a practical option.

I don't know why you would mention that.  How does it relate to anything
that was said above?


> > What application are you intending to use it for?  Does it need to be
> > linux?
> 
> Well, not technically, but from a cost/benefit sort of analysis, I don't
> know of a better platform.

You could use openindiana, or nexenta, or freebsd.  These are stable free
platforms (and commercial) that do zfs.


> > In many situations, even if you're using linux, it might be ok to run
> > either
> > user-mode ZFS or ZFS on a separate system connected via NFS or ISCSI or
> > whatever.
> 
> Still concerned over support and Oracle. When that is ironed out, I'll be
> able to reconsider it.

What specifically is the concern?  Cost of Oracle support?  Quality of
support?  (Compared to redhat support, which is ... on par with oracle
support)...  What specifically is the concern?

___
Discuss mailing list
Discuss@blu.org
http://lists.blu.org/mailman/listinfo/discuss


Re: [Discuss] lvm snapshot cloning

2011-10-25 Thread Edward Ned Harvey
> From: discuss-bounces+blu=nedharvey@blu.org [mailto:discuss-
> bounces+blu=nedharvey@blu.org] On Behalf Of
> 
> Suppose you have a 1TB hard disk. How on earth do you back that up? If you
> use a 1TB USB hard disk (USB 2.0) At best case, you'll get about 30MB/s.
> (You won't get that fast, but it is a good round number for discussion)

Yes, what you need is something that does incremental snapshots.  Presently
zfs, soon btrfs, and I heard you say it can be done with LVM although I've
never seen that done.  Also, I think there's probably a solution based on
VSS.  And of course WAFL and other enterprise products.

Additionally, there are lots of "potential" solutions, for example, use FUSE
to mount a filesystem, and because it's running through fuse, you'll be able
to keep track of which blocks change and then you can perform optimal
incrementals etc.  But I'm not aware of anything that is actually
implemented this way.


> 1TB is 1,099,511,627,776 bytes. At 30MB/s that's a little over 9 1/2 hours
> to backup.
> 
> If you use a dedicated high speed SATA or SAS drive, you may get a
> whopping 160MB/s. That's 1 hour 49 minutes.

Take it for granted, you won't be stuck at USB2 speeds.  USB has universally
(to the extent I care about) been replaced by USB3, since about 2-3 years
ago.  The bottleneck is the speed of the external disk, which I typically
measure at 1Gbit/sec per disk.  (You can sometimes parallelize using
multiple disks depending on your circumstances.)  It's random IO that really
hurts you.


> With snapshots, you can back up a consistent state, as if your 2 hour
> backup happened instantaneously, but you are still writing a lot of data.

Right.  Presently at work we have a bunch of ZFS and Netapp systems, that
replicate globally every hour, and use various forms of WAN acceleration,
compression and dedup.  Although it is optimal, it is very significant,
noticeable strain on the WAN.

___
Discuss mailing list
Discuss@blu.org
http://lists.blu.org/mailman/listinfo/discuss


Re: [Discuss] lvm snapshot cloning

2011-10-25 Thread markw
>> From: discuss-bounces+blu=nedharvey@blu.org [mailto:discuss-
>> bounces+blu=nedharvey@blu.org] On Behalf Of

>> If you use a dedicated high speed SATA or SAS drive, you may get a
>> whopping 160MB/s. That's 1 hour 49 minutes.
>
> Take it for granted, you won't be stuck at USB2 speeds.  USB has
> universally
> (to the extent I care about) been replaced by USB3, since about 2-3 years
> ago.  The bottleneck is the speed of the external disk, which I typically
> measure at 1Gbit/sec per disk.  (You can sometimes parallelize using
> multiple disks depending on your circumstances.)  It's random IO that
> really
> hurts you.

If you can get more than 160MB/s (sustained) on anything other than exotic
hardware, I'd be surprised. 1Gbit/sec per disk sustained is currently not
possible with COTS hardware that is available.

Transfer rate is not "sustained," and "peak" is not "sustained." Yes, if
can can manage to read/write to disk cache, you can get cool performance,
but if you are doing backups, you will blow out cache quite quickly.

___
Discuss mailing list
Discuss@blu.org
http://lists.blu.org/mailman/listinfo/discuss


Re: [Discuss] lvm snapshot cloning

2011-10-25 Thread Richard Pieri
On Oct 25, 2011, at 7:51 PM, ma...@mohawksoft.com wrote:
> 
> The snapshot has no effect on the master, and yes, we've already said and
> we already know it is a weakness in LVM that if you don't extend your
> snapshots you lose them. This can be mitigated by monitoring and automatic
> volume extension.

You missed it.  This isn't about what happens to master.  It's what happens to 
b when a disappears.  If master<-a<-b and a disappears due to reaping then b 
becomes useless.  Or b is reaped, too.  Either way you're dealing with data 
loss.  This is why LVM will not do what you originally asked about.

Monitoring has problems.  If the volume fills up faster than the monitor polls 
capacity then you lose your data.  If the volume fills up faster than it can be 
extended then you lose your data.  If the volume cannot be extended because the 
volume group has no more extents available then you lose your data.  Like I 
wrote at the start: LVM will quite happily bite your face off.

Now, to address your most recent question:

How do I back up a 1TB disk.  Think about this: how do you intend to do a 
restore from this backup?  The most important part of a backup system is being 
able to restore from backup in a timely fashion.

I have in production a compute server with two 8TB file systems and a 9TB file 
system, all sitting on LVM volumes.  I have an automated backup that runs every 
night on this server.  It's an incremental file system backup so I'm only 
backing up the changes every night.  This is, as you might expect, quite faster 
than trying to do full backups of 25TB every night -- which I can't because it 
would take three days to do it.

On smaller capacity volumes, in the several hundred GB range, I use rsnapshot 
to do incremental file snapshots to a storage server.  Again, I don't back up 
the raw disk partitions every time.  I only back up the changed files.

In both cases -- and in fact with all my backups -- they are file level 
backups.  The reason being that if I need to restore a single file or directory 
then I don't have to rebuild the entire volume to do so.  I can restore as 
little or as much as I need to recover from a mistake or a disaster.

Suppose the case of a live volume that needs to be in a frozen state for doing 
a backup.  Database servers are prime examples of this.  Here, I would freeze 
the database, make a snapshot of the underlying volume, and then thaw the 
database.  Now I can do my backup of the read-only snapshot volume without 
interfering with the running system.  I would delete the snapshot when the 
backup is complete.

If I were using plain LVM and ext3 for my users' home directories then I would 
do something similar with read-only snapshots.  There would be no freeze step, 
and I would keep several days worth of snapshots on the file server to make 
error recovery faster than going to tape or network storage.  As it is, I use 
OpenAFS which has file system snapshots so I don't need to do any of this and 
users can go back in time just by looking in .clone in their home directories.  
I still have nightly backups to tape for long-term archives.

Now, time to poke holes in your proposal.  I have a physics graduate student 
doing his thesis research project on a shared compute server along with a dozen 
others.  They collectively have 7.5TB of data on there.  This is a real-world 
case on the aforementioned compute server.  Said student accidentally wipes out 
his entire thesis project, 200GB worth of files.  It's 9:30 PM and he needs his 
files by 8am or he fails his thesis defense, doesn't graduate and I'm looking 
for a new job.

With my file level backup system I can have his files restored within a couple 
of hours at the outside without affecting anyone else's work.

With your volume level backup system I would spend the night on Monster looking 
for a new job.  The problem with it is that I can't restore individual files 
because it isn't individual files that are backed up.  It's the disk blocks.  I 
can't just drop those backed-up blocks onto the volume.  Here:

  master->changes->changes->changes
   \->backup

If I dumped the backup blocks onto the volume then I'd scramble the file 
system.  Restoration would require me to replicate the entire volume at the 
block level as it was when the backup was made.  This would destroy all the 
other researchers' work done in the past however many hours since that backup 
was made.  I would fire myself for gross incompetence if I were relying on this 
kind of backup system.  It's that bad.

It gets worse.  What happens when the whole thing fails outright?  Total 
disaster on your 1TB disk.  Now it's not just 29 minutes to restore last 
night's blocks.  It's two hours to restore the initial replica and then 30 
minutes times however many deltas have been made.  Six deltas means 5 hours to 
do a full rebuild.  I can do a complete restore from TSM or rsnapshot in half 
that time, maybe less depending on how much d

Re: [Discuss] lvm snapshot cloning

2011-10-25 Thread markw
> On Oct 25, 2011, at 7:51 PM, ma...@mohawksoft.com wrote:
>>
>> The snapshot has no effect on the master, and yes, we've already said
>> and
>> we already know it is a weakness in LVM that if you don't extend your
>> snapshots you lose them. This can be mitigated by monitoring and
>> automatic
>> volume extension.
>
> You missed it.  This isn't about what happens to master.  It's what
> happens to b when a disappears.  If master<-a<-b and a disappears due to
> reaping then b becomes useless.  Or b is reaped, too.  Either way you're
> dealing with data loss.  This is why LVM will not do what you originally
> asked about.

Actually, in LVM 'a' and 'b' are completely independent, each have their
own copy of the COW data. So, if 'a' gets nuked, 'b' is fine, and vice
virca.

If you know the behaviour of your system, you could allocate a large
percentage of the existing volume size (even 100%) and mitigate any risk.
You would get your snapshot quickly and still have full backing.


>
> Monitoring has problems.  If the volume fills up faster than the monitor
> polls capacity then you lose your data.  If the volume fills up faster
> than it can be extended then you lose your data.  If the volume cannot be
> extended because the volume group has no more extents available then you
> lose your data.  Like I wrote at the start: LVM will quite happily bite
> your face off.

Allocate more space at the beginning. It will never be bugger than the
original volume.

>
> Now, to address your most recent question:
>
> How do I back up a 1TB disk.  Think about this: how do you intend to do a
> restore from this backup?  The most important part of a backup system is
> being able to restore from backup in a timely fashion.

That is all dependent on whether or not we are recovering from a
"disaster" i.e. main data loss, or for access to a previous incarnation
within a working environment.

In a "working environment" it is possible to use LVM snapshots, backed
with enough storage, of course, to receive the deltas from where it is to
where it should be.

In a disaster, well, brute force will save the day.
>
> I have in production a compute server with two 8TB file systems and a 9TB
> file system, all sitting on LVM volumes.  I have an automated backup that
> runs every night on this server.  It's an incremental file system backup
> so I'm only backing up the changes every night.  This is, as you might
> expect, quite faster than trying to do full backups of 25TB every night --
> which I can't because it would take three days to do it.

This works when you have access to the file system. If you expose an LVM
volume with something like iSCSI, you best hope that Linux has a file
system driver for it.

Even so, you will be backing up full files. In a system with large files,
like a database server, you'll be backing up a lot of data that doesn't
need to be backed up.

>
> On smaller capacity volumes, in the several hundred GB range, I use
> rsnapshot to do incremental file snapshots to a storage server.  Again, I
> don't back up the raw disk partitions every time.  I only back up the
> changed files.

Most file formats don't change too much. This is, of course a
generalization, with a number of exceptions. I refer back to the database
example.

>
> In both cases -- and in fact with all my backups -- they are file level
> backups.  The reason being that if I need to restore a single file or
> directory then I don't have to rebuild the entire volume to do so.  I can
> restore as little or as much as I need to recover from a mistake or a
> disaster.

I can understand this, however, snapshots work too. One of the utilities I
am writing allows you to clone a snapshot.

Assuming you always have one snapshot that is a point in time reference.
You can clone that snapshot and then apply diffs to get back to a specific
point in time.

While it is a judgement call, it is not clear which is more efficient. In
the case of a database server, it may take much longer to restore a full
file than it does to apply diffs.

>
> Suppose the case of a live volume that needs to be in a frozen state for
> doing a backup.  Database servers are prime examples of this.  Here, I
> would freeze the database, make a snapshot of the underlying volume, and
> then thaw the database.  Now I can do my backup of the read-only snapshot
> volume without interfering with the running system.  I would delete the
> snapshot when the backup is complete.

That is the most direct method of using snapshots, yes.

>
> If I were using plain LVM and ext3 for my users' home directories then I
> would do something similar with read-only snapshots.  There would be no
> freeze step, and I would keep several days worth of snapshots on the file
> server to make error recovery faster than going to tape or network
> storage.  As it is, I use OpenAFS which has file system snapshots so I
> don't need to do any of this and users can go back in time just by looking
> in .clone in their home dire

Re: [Discuss] lvm snapshot cloning

2011-10-25 Thread John Abreau
On Tue, Oct 25, 2011 at 11:20 PM,   wrote:

> Obviously, there are pros and cons to various approaches. You approach is
> only faster if you know that the file in questions was recently modified.
> If that is the case, you lucked out. What happens if the source of the
> calculations, which is also needed, hasn't been modified in some time?


Are you serious? If the source of the calculations hasn't been modified
in some time, then what you do is restore it from the backups. Same as
if it was modified recently.

The older files weren't backed up recently because they hadn't changed.
They didn't get erased from the backups, so they're still available for
restoration, just like the recently changed files.

Did you think that all backups older then one day were somehow erased?




-- 
John Abreau / Executive Director, Boston Linux & Unix
OLD GnuPG KeyID: D5C7B5D9 / Email: abre...@gmail.com
OLD GnuPG FP: 72 FB 39 4F 3C 3B D6 5B E0 C8 5A 6E F1 2C BE 99
2011 GnuPG KeyID: 32A492D8 / Email: abre...@gmail.com
2011 GnuPG FP:
___
Discuss mailing list
Discuss@blu.org
http://lists.blu.org/mailman/listinfo/discuss


Re: [Discuss] lvm snapshot cloning

2011-10-26 Thread markw
> On Tue, Oct 25, 2011 at 11:20 PM,   wrote:
>
>> Obviously, there are pros and cons to various approaches. You approach
>> is
>> only faster if you know that the file in questions was recently
>> modified.
>> If that is the case, you lucked out. What happens if the source of the
>> calculations, which is also needed, hasn't been modified in some time?
>
>
> Are you serious? If the source of the calculations hasn't been modified
> in some time, then what you do is restore it from the backups. Same as
> if it was modified recently.

I was thinking that it, if you are doing an incremental backup, it may not
be present on the current backup medium, and then you'd have to search
what ever catalogue system you have to find where it is.

>
> The older files weren't backed up recently because they hadn't changed.
> They didn't get erased from the backups, so they're still available for
> restoration, just like the recently changed files.
>
> Did you think that all backups older then one day were somehow erased?

Typically, you ship some backups off-site to ensure recovery.

>
>
>
>
> --
> John Abreau / Executive Director, Boston Linux & Unix
> OLD GnuPG KeyID: D5C7B5D9 / Email: abre...@gmail.com
> OLD GnuPG FP: 72 FB 39 4F 3C 3B D6 5B E0 C8 5A 6E F1 2C BE 99
> 2011 GnuPG KeyID: 32A492D8 / Email: abre...@gmail.com
> 2011 GnuPG FP:
>


___
Discuss mailing list
Discuss@blu.org
http://lists.blu.org/mailman/listinfo/discuss


Re: [Discuss] lvm snapshot cloning

2011-10-26 Thread Richard Pieri
On Oct 25, 2011, at 11:20 PM, ma...@mohawksoft.com wrote:
> 
> Actually, in LVM 'a' and 'b' are completely independent, each have their
> own copy of the COW data. So, if 'a' gets nuked, 'b' is fine, and vice
> virca.

Rather, this is how LVM works *because* of this situation.  If LVM supported 
snapshots of snapshots then it'd be trivially easy to shoot yourself in the 
foot.

> If you know the behaviour of your system, you could allocate a large
> percentage of the existing volume size (even 100%) and mitigate any risk.
> You would get your snapshot quickly and still have full backing.

So for your hypothetical 1TB disk, let's assume that you actually have 1TB of 
data on it.  You would need two more 1TB disks for each of the two snapshots.  
This would be unscalable to my 25TB compute server.  I would need another 25TB+ 
to implement your scheme.  This is a case where I can agree that yes, it is 
possible to coerce LVM into doing it but that doesn't make it useful.


> In a disaster, well, brute force will save the day.

My systems work for both individual files and total disaster.  I've proven it.

>> don't need to do any of this and users can go back in time just by looking
>> in .clone in their home directories.  I still have nightly backups to tape
>> for long-term archives.
> 
> Seems complicated.

It isn't.  It's a single AFS command to do the nightly snapshot and a second to 
run the nightly backup against that snapshot.



> totally wrong!!!
> 
> lvcreate -s -n disaster -L1024G /dev/vg0/phddata
> (my utility)
> lvclonesnapshot /dev/mapper/vg0-phdprev-cow /dev/vg0/disaster
> 
> This will apply historical changes to the /dev/vg0/disaster, the volume
> may then be used to restore data.

Wait... wait... so you're saying that in order to restore some files I need to 
recreate the "disaster" volume, restore to it, and then I can copy files back 
over to the real volume?


> You have a similar issue with file system backups. You have to find the
> last time a particular file was backed up.

> Yes, and it should be MUCH faster!!! I agree.

*Snrk*.  Neither of these are true.  I don't have to "find" anything.  I pick a 
point in time between the first backup and the most recent, inclusive, and 
restore whatever I need.  Everything is there.  I don't even need to look for 
tapes; TSM does all that for me.  400GB/hour sustained restore throughput is 
quite good for a network backup system.

Remember, ease of restoration is the most important component of a backup 
system.  Yours, apparently, fails to deliver.

--Rich P.

___
Discuss mailing list
Discuss@blu.org
http://lists.blu.org/mailman/listinfo/discuss


Re: [Discuss] lvm snapshot cloning

2011-10-26 Thread Matt Shields
On Tue, Oct 25, 2011 at 10:16 PM, Richard Pieri wrote:

> On Oct 25, 2011, at 7:51 PM, ma...@mohawksoft.com wrote:
> >
> > The snapshot has no effect on the master, and yes, we've already said and
> > we already know it is a weakness in LVM that if you don't extend your
> > snapshots you lose them. This can be mitigated by monitoring and
> automatic
> > volume extension.
>
> You missed it.  This isn't about what happens to master.  It's what happens
> to b when a disappears.  If master<-a<-b and a disappears due to reaping
> then b becomes useless.  Or b is reaped, too.  Either way you're dealing
> with data loss.  This is why LVM will not do what you originally asked
> about.
>
> Monitoring has problems.  If the volume fills up faster than the monitor
> polls capacity then you lose your data.  If the volume fills up faster than
> it can be extended then you lose your data.  If the volume cannot be
> extended because the volume group has no more extents available then you
> lose your data.  Like I wrote at the start: LVM will quite happily bite your
> face off.
>
> Now, to address your most recent question:
>
> How do I back up a 1TB disk.  Think about this: how do you intend to do a
> restore from this backup?  The most important part of a backup system is
> being able to restore from backup in a timely fashion.
>
> I have in production a compute server with two 8TB file systems and a 9TB
> file system, all sitting on LVM volumes.  I have an automated backup that
> runs every night on this server.  It's an incremental file system backup so
> I'm only backing up the changes every night.  This is, as you might expect,
> quite faster than trying to do full backups of 25TB every night -- which I
> can't because it would take three days to do it.
>
> On smaller capacity volumes, in the several hundred GB range, I use
> rsnapshot to do incremental file snapshots to a storage server.  Again, I
> don't back up the raw disk partitions every time.  I only back up the
> changed files.
>
> In both cases -- and in fact with all my backups -- they are file level
> backups.  The reason being that if I need to restore a single file or
> directory then I don't have to rebuild the entire volume to do so.  I can
> restore as little or as much as I need to recover from a mistake or a
> disaster.
>
> Suppose the case of a live volume that needs to be in a frozen state for
> doing a backup.  Database servers are prime examples of this.  Here, I would
> freeze the database, make a snapshot of the underlying volume, and then thaw
> the database.  Now I can do my backup of the read-only snapshot volume
> without interfering with the running system.  I would delete the snapshot
> when the backup is complete.
>
> If I were using plain LVM and ext3 for my users' home directories then I
> would do something similar with read-only snapshots.  There would be no
> freeze step, and I would keep several days worth of snapshots on the file
> server to make error recovery faster than going to tape or network storage.
>  As it is, I use OpenAFS which has file system snapshots so I don't need to
> do any of this and users can go back in time just by looking in .clone in
> their home directories.  I still have nightly backups to tape for long-term
> archives.
>
> Now, time to poke holes in your proposal.  I have a physics graduate
> student doing his thesis research project on a shared compute server along
> with a dozen others.  They collectively have 7.5TB of data on there.  This
> is a real-world case on the aforementioned compute server.  Said student
> accidentally wipes out his entire thesis project, 200GB worth of files.
>  It's 9:30 PM and he needs his files by 8am or he fails his thesis defense,
> doesn't graduate and I'm looking for a new job.
>
> With my file level backup system I can have his files restored within a
> couple of hours at the outside without affecting anyone else's work.
>
> With your volume level backup system I would spend the night on Monster
> looking for a new job.  The problem with it is that I can't restore
> individual files because it isn't individual files that are backed up.  It's
> the disk blocks.  I can't just drop those backed-up blocks onto the volume.
>  Here:
>
>  master->changes->changes->changes
>   \->backup
>
> If I dumped the backup blocks onto the volume then I'd scramble the file
> system.  Restoration would require me to replicate the entire volume at the
> block level as it was when the backup was made.  This would destroy all the
> other researchers' work done in the past however many hours since that
> backup was made.  I would fire myself for gross incompetence if I were
> relying on this kind of backup system.  It's that bad.
>
> It gets worse.  What happens when the whole thing fails outright?  Total
> disaster on your 1TB disk.  Now it's not just 29 minutes to restore last
> night's blocks.  It's two hours to restore the initial replica and then 30
> minutes times however many del

Re: [Discuss] lvm snapshot cloning

2011-10-26 Thread markw
> On Oct 25, 2011, at 11:20 PM, ma...@mohawksoft.com wrote:
>>
>> Actually, in LVM 'a' and 'b' are completely independent, each have their
>> own copy of the COW data. So, if 'a' gets nuked, 'b' is fine, and vice
>> virca.
>
> Rather, this is how LVM works *because* of this situation.  If LVM
> supported snapshots of snapshots then it'd be trivially easy to shoot
> yourself in the foot.

Actually, I'm currently working on a system that snapshots of snapshots,
its not LVM, obviously, but it quietly resolves an interior copy being
removed or failing. Its a very "enterprise" system.

>
>> If you know the behaviour of your system, you could allocate a large
>> percentage of the existing volume size (even 100%) and mitigate any
>> risk.
>> You would get your snapshot quickly and still have full backing.
>
> So for your hypothetical 1TB disk, let's assume that you actually have 1TB
> of data on it.  You would need two more 1TB disks for each of the two
> snapshots.  This would be unscalable to my 25TB compute server.  I would
> need another 25TB+ to implement your scheme.  This is a case where I can
> agree that yes, it is possible to coerce LVM into doing it but that
> doesn't make it useful.

Well, we all know that disks do not change 100% very quickly or at all.
Its typically a very small percentage per day, even on active systems.

So the process is to backup diffs using analysis of two snapshots. A start
point and an end point. Just keep recycling the start point.

>
>
>> In a disaster, well, brute force will save the day.
>
> My systems work for both individual files and total disaster.  I've proven
> it.

Yes, backups that maintain data integrity work. That's sort of the job.
The issue is reducing the amount of data that needs to be moved each time.

With a block level backup, you move only the blocks. With a file level
backup you move the whole files. Now, if the files are small, a file level
backup will make sense. If the files are large, like VMs or databases, a
block level backup makes sense.


>
>>> don't need to do any of this and users can go back in time just by
>>> looking
>>> in .clone in their home directories.  I still have nightly backups to
>>> tape
>>> for long-term archives.
>>
>> Seems complicated.
>
> It isn't.  It's a single AFS command to do the nightly snapshot and a
> second to run the nightly backup against that snapshot.
>
>
>
>> totally wrong!!!
>>
>> lvcreate -s -n disaster -L1024G /dev/vg0/phddata
>> (my utility)
>> lvclonesnapshot /dev/mapper/vg0-phdprev-cow /dev/vg0/disaster
>>
>> This will apply historical changes to the /dev/vg0/disaster, the volume
>> may then be used to restore data.
>
> Wait... wait... so you're saying that in order to restore some files I
> need to recreate the "disaster" volume, restore to it, and then I can copy
> files back over to the real volume?

I can't tell from the snip the whole example, but I think I was saying
that I could clone a snapshot, apply historical blocks to it, and then
you'd be able to get a specific version of a file from it. Yes.

If you are backing up many small files, rsync works well. If you are
backing up VMs, databases, or iSCSI targets, a block level strategy works
better.
>
>
>> You have a similar issue with file system backups. You have to find the
>> last time a particular file was backed up.
>
>> Yes, and it should be MUCH faster!!! I agree.
>
> *Snrk*.  Neither of these are true.  I don't have to "find" anything.  I
> pick a point in time between the first backup and the most recent,
> inclusive, and restore whatever I need.  Everything is there.  I don't
> even need to look for tapes; TSM does all that for me.  400GB/hour
> sustained restore throughput is quite good for a network backup system.

400GB/hour? I doubt that number, but ok. It is still close to three hours.

>
> Remember, ease of restoration is the most important component of a backup
> system.  Yours, apparently, fails to deliver.

Not really, we have been discussing technology. We haven't even been
discussing user facing stuff.

The difference is what you plan to do, I guess. I'm not backing up many
small files.

Think of it this way. A 2TB drive is less than $80 and about $0.25 a month
in power. The economies open up a number of possibilities.

___
Discuss mailing list
Discuss@blu.org
http://lists.blu.org/mailman/listinfo/discuss


Re: [Discuss] lvm snapshot cloning

2011-10-26 Thread Dan Ritter
On Tue, Oct 25, 2011 at 11:20:40PM -0400, ma...@mohawksoft.com wrote:
> >
> > Data point: It takes ~19 hours to restore 7.5TB from enterprise-class
> > Tivoli Storage Manager over 1GB Ethernet to a 12x1TB SATA (3Gb/s) RAID 6
> > volume.  I had to do it this past spring after the RAID controller on that
> > volume went stupid and corrupted the whole thing.
> 
> Yes, and it should be MUCH faster!!! I agree.


The bottleneck there is the the gigabit ethernet:

1000Mb/s * 3600s/h * B/8b = 45 MB/h or 450 GB/h. 
7500/450 = 16.7 hours

So the absolute best you could have done ignoring all overhead was 16.7
hours, and it took 19. Not awful.

On the other hand, going to 10GE doesn't move the bottleneck to
the disks -- if you can get 160MB/s per spindle and 10 of the 12
spindles effective, that's:

1600MB/s * 3600s/h = 5760GB/h which is still more than the
4500GB/h you might get from theoretical no-overhead 10GE.

Assume 10GE with the same 87% efficiency as GE and you get 3.9TB/h,
or about two hours to do your 7.5TB restore. OK, so 10GE is a win for
time, no surprise. However, GE is essentially free (your motherboards
have it, your switching infrastructure is in place) but 10GE involves
$600/port upgrades to the NIC (if you have room) and $1000/port switch
upgrades. Good thing to have next time around, probably not a routine
upgrade for most operations.

-dsr-

-- 
http://tao.merseine.nu/~dsr/eula.html is hereby incorporated by reference.
You can't fight for freedom by taking away rights.
___
Discuss mailing list
Discuss@blu.org
http://lists.blu.org/mailman/listinfo/discuss


Re: [Discuss] lvm snapshot cloning

2011-10-26 Thread Richard Pieri
On Wed, Oct 26, 2011 at 8:26 AM,  wrote:

> 400GB/hour? I doubt that number, but ok. It is still close to three hours.
>

Doubt if you want but I did it last spring.  7.5TB -- ~7500GB -- restored
over 19 hours and change.  That's 400GB/hour average throughput.  Over a
network.  From a tape-based storage system.  That's what an enterprise-class
backup system can do.

I see where you are going with this.  One of the historically canonical
examples is database servers that use raw disk for storage. A
block-incremental backup mechanism would have been very useful 15-20 years
ago when this was more common.  Today, hardly anyone does it this way.
 Today, standard procedures are to either dump the DB to a flat file and
back that up or to perform the freeze/snapshot/thaw cycle and back up the
snapshot.  This may not be the most efficient way to do it.  On the other
hand, if I have so much data to back up that efficiency would be an issue
then I already have sufficient resources in my backup system to make it not
be an issue.

You mention virtual machines.  I can see this if you want to back up the
containers.  The thing is, using an operating system's native tools is
always my preferred choice for making backups.  It may be inefficient at
times but it ensures that I can always recover the backup correctly.  More
generally, it avoids issues with being locked into specific technologies.
 Going back to the example of running a FreeBSD domU on a Linux dom0.  With
LVM-based block backups I am locked into using LVM-based block restore for
recovery.  I can't restore this domU onto a FreeBSD or Solaris dom0 or even
onto a real FreeBSD physical machine.  On the other hand, if I use OS tools
to do the backup then I can restore it anywhere that I please.

It is clear to me that we have different philosophies about backup systems.
 In my mind, efficiency is all well and good but it will always take a back
seat to the ease of restoring backups and the ease of creating them.  I've
had to recover from too many disasters and too many stupid mistakes -- some
of them my own -- to see it any other way.
___
Discuss mailing list
Discuss@blu.org
http://lists.blu.org/mailman/listinfo/discuss


Re: [Discuss] lvm snapshot cloning

2011-10-26 Thread Richard Pieri
On Wed, Oct 26, 2011 at 9:42 AM, Dan Ritter  wrote:

> So the absolute best you could have done ignoring all overhead was 16.7
> hours, and it took 19. Not awful.
>

Indeed.  TSM took about 20 minutes at the start to build the restore index,
but it fairly screamed once it got going moving data around.  TSM uses two
"threads" during the restore process, one doing the actual read from storage
and send to host and one scanning the archive system for the next file to
restore.  It does a very good job keeping throughput up.


> 1600MB/s * 3600s/h = 5760GB/h which is still more than the
> 4500GB/h you might get from theoretical no-overhead 10GE.
>

A little less, actually.  It's 11x1TB in RAID 6.  I have a hot spare on the
controller.  Still, close enough for an approximation.
___
Discuss mailing list
Discuss@blu.org
http://lists.blu.org/mailman/listinfo/discuss


Re: [Discuss] lvm snapshot cloning

2011-10-26 Thread Edward Ned Harvey
> From: ma...@mohawksoft.com [mailto:ma...@mohawksoft.com]
>
> If you can get more than 160MB/s (sustained) on anything other than exotic
> hardware, I'd be surprised. 1Gbit/sec per disk sustained is currently not
> possible with COTS hardware that is available.
> 
> Transfer rate is not "sustained," and "peak" is not "sustained." Yes, if
> can can manage to read/write to disk cache, you can get cool performance,
> but if you are doing backups, you will blow out cache quite quickly.

Go measure it before you say anymore.  Because I've spent a lot of time in
the last 4 years benchmarking disks.  I can say the typical sequential
throughput, read or write, for nearly all disks (7.2krpm sata up to 15krpm
sas) is 1.0 Gbit/sec.  Sustained sequential read/write.  For let's say, the
entire disk.  Or at least tens of GB.

Even laptops (7.2krpm sata dell) are able to sustain this speed.

___
Discuss mailing list
Discuss@blu.org
http://lists.blu.org/mailman/listinfo/discuss


Re: [Discuss] lvm snapshot cloning

2011-10-26 Thread Dan Ritter
On Wed, Oct 26, 2011 at 03:18:06PM -0400, Edward Ned Harvey wrote:
> > From: ma...@mohawksoft.com [mailto:ma...@mohawksoft.com]
> >
> > If you can get more than 160MB/s (sustained) on anything other than exotic
> > hardware, I'd be surprised. 1Gbit/sec per disk sustained is currently not
> > possible with COTS hardware that is available.
> > 
> > Transfer rate is not "sustained," and "peak" is not "sustained." Yes, if
> > can can manage to read/write to disk cache, you can get cool performance,
> > but if you are doing backups, you will blow out cache quite quickly.
> 
> Go measure it before you say anymore.  Because I've spent a lot of time in
> the last 4 years benchmarking disks.  I can say the typical sequential
> throughput, read or write, for nearly all disks (7.2krpm sata up to 15krpm
> sas) is 1.0 Gbit/sec.  Sustained sequential read/write.  For let's say, the
> entire disk.  Or at least tens of GB.
> 
> Even laptops (7.2krpm sata dell) are able to sustain this speed.

Erm. 

1 Gb/s * 1024 M/G * 1B/8b = 128MB/s

Anything which can do
160 is clearly capable of doing 128.

You two are arguing in different directions.

-dsr-
___
Discuss mailing list
Discuss@blu.org
http://lists.blu.org/mailman/listinfo/discuss


Re: [Discuss] lvm snapshot cloning

2011-10-26 Thread Tom Metro
ma...@mohawksoft.com wrote:
> Suppose you have a 1TB hard disk. How on earth do you back that up? 
> Think now if you want to pipe that up to the cloud for off-site backup.
> 
> The snapshot device "knows"
> what is different. You don't have to really backup 1TB every time, you
> only have to backup the changes to it since the last full backup.

If you are at the point of considering developing software to do this,
instead of just using off-the-shelf solutions, then you should consider
using inotify[1]. I believe using this library you could log to a
database the inodes that have been altered over a given period of time,
which another tool could then use to package up the data and send it to
your local or remote backup server.

Of course without snapshots you'd needs to take other steps to insure
consistency, if that matters for your application.

Seems DRBD[2] would be another way to address this, though not practical
for a remote backup server, as it'll send every change to the remote
server in real time.

Generally though, it seems like you are trying to re-invent ZFS, which
does both the snapshotting you want, as we'll as communicating
differential changes to a remote server. But I understand your
objections to ZFS.

1. https://github.com/rvoicilas/inotify-tools/wiki/
2. http://www.drbd.org/


> I can't give too much away...

Why? Developing something you plan to patent? Turn into a commercial
product and you want to keep the technology a trade secrete?

If those don't apply, then why not detail your scheme? Better to idea
vetted before you've invested a lot of time into building it.

 -Tom

-- 
Tom Metro
Venture Logic, Newton, MA, USA
"Enterprise solutions through open source."
Professional Profile: http://tmetro.venturelogic.com/
___
Discuss mailing list
Discuss@blu.org
http://lists.blu.org/mailman/listinfo/discuss


Re: [Discuss] lvm snapshot cloning

2011-10-27 Thread markw
> ma...@mohawksoft.com wrote:
>> Suppose you have a 1TB hard disk. How on earth do you back that up?
>> Think now if you want to pipe that up to the cloud for off-site backup.
>>
>> The snapshot device "knows"
>> what is different. You don't have to really backup 1TB every time, you
>> only have to backup the changes to it since the last full backup.
>
> If you are at the point of considering developing software to do this,
> instead of just using off-the-shelf solutions, then you should consider
> using inotify[1]. I believe using this library you could log to a
> database the inodes that have been altered over a given period of time,
> which another tool could then use to package up the data and send it to
> your local or remote backup server.

That's for a file level backup which has been done to death.
>
> Of course without snapshots you'd needs to take other steps to insure
> consistency, if that matters for your application.
>
> Seems DRBD[2] would be another way to address this, though not practical
> for a remote backup server, as it'll send every change to the remote
> server in real time.
>
> Generally though, it seems like you are trying to re-invent ZFS, which
> does both the snapshotting you want, as we'll as communicating
> differential changes to a remote server. But I understand your
> objections to ZFS.
>
> 1. https://github.com/rvoicilas/inotify-tools/wiki/
> 2. http://www.drbd.org/
>
>
>> I can't give too much away...
>
> Why? Developing something you plan to patent? Turn into a commercial
> product and you want to keep the technology a trade secrete?

Potentially.
>
> If those don't apply, then why not detail your scheme? Better to idea
> vetted before you've invested a lot of time into building it.

Well, lets just say I'm doing three things

I'm doing some research for my current employer's product.
I'm thinking about creating an open source tool kit for managing a
category of resources.
Lastly, a "pro" version ($$$) of fore mentioned tool it.

I have to be careful in that I do not overlap my current employer (I have
NO intention to do so, but there is technology overlap, so I'm very
sensitive in how I phrase things.)

>
>  -Tom
>
> --
> Tom Metro
> Venture Logic, Newton, MA, USA
> "Enterprise solutions through open source."
> Professional Profile: http://tmetro.venturelogic.com/
>


___
Discuss mailing list
Discuss@blu.org
http://lists.blu.org/mailman/listinfo/discuss


Re: [Discuss] lvm snapshot cloning

2011-10-27 Thread Bill Bogstad
On Wed, Oct 26, 2011 at 11:49 PM, Tom Metro  wrote:
> ma...@mohawksoft.com wrote:
>> Suppose you have a 1TB hard disk. How on earth do you back that up?
>> Think now if you want to pipe that up to the cloud for off-site backup.
>>
>> The snapshot device "knows"
>> what is different. You don't have to really backup 1TB every time, you
>> only have to backup the changes to it since the last full backup.
>
> If you are at the point of considering developing software to do this,
> instead of just using off-the-shelf solutions, then you should consider
> using inotify[1]. I believe using this library you could log to a
> database the inodes that have been altered over a given period of time,
> which another tool could then use to package up the data and send it to
> your local or remote backup server.

I'm not sure how inotify would help that much.   Mark seems to be
focused on large files (database files, VM images).  He knows they've
changed, but for efficiency reasons he wants to replicate/backup only
the blocks that have changed due to the files' large sizes.  His
proposed solution is to piggyback on LVM snapshots which automatically
keeps track of changes at a block level granularity.  As I see it,
he's trying to bring back the functionality of the old dump/restore
backup programs except in a filesystem agnostic way.   Maintaining
dump/restore became intractable when the number of filesystems
exploded.  Tracking LVM is likely to be easier.

Your suggestion of DRBD seems like another way to go.   Although as
you note, the real-time nature of the data stream is an issue.  Using
LVM to consolidate changes to a single block is a possible advantage.

Of course, this is all based on guesswork by me about Mark's goals...

Bill Bogstad
___
Discuss mailing list
Discuss@blu.org
http://lists.blu.org/mailman/listinfo/discuss


Re: [Discuss] lvm snapshot cloning

2011-10-27 Thread Rich Braun
One thing I haven't seen on this thread is the "REFLINK" mechanism invented in
the most recent 1.6 release of OCFS2.  (It's described in the user guide:
http://oss.oracle.com/projects/ocfs2/dist/documentation/v1.6/ocfs2-1_6-usersguide.pdf).
 The open-source developers have thought long and hard about this performance
problem and come up with an innovative solution for writeable snapshots.

It doesn't directly solve some of the issues discussed here (regarding cloning
of snapshots) but it provides a better platform for solving them.  And the
filesystem driver has been kernel-embedded for several Linux releases.

-rich


___
Discuss mailing list
Discuss@blu.org
http://lists.blu.org/mailman/listinfo/discuss


Re: [Discuss] lvm snapshot cloning

2011-10-31 Thread Tom Metro
Bill Bogstad wrote:
>> ...you should consider using inotify[1].
> 
> I'm not sure how inotify would help that much. Mark...wants to
> replicate/backup only the blocks that have changed due to the files'
> large sizes.

You are correct. I was mistakenly thinking that inodes==blocks, which of
course they don't.

I haven't looked at the inotify library in a while, but it isn't out of
the realm of possibility that in addition to the notification messages
 identifying the inode, it may also identify the blocks or byte range
impacted by the I/O operation.

 -Tom


-- 
Tom Metro
Venture Logic, Newton, MA, USA
"Enterprise solutions through open source."
Professional Profile: http://tmetro.venturelogic.com/
___
Discuss mailing list
Discuss@blu.org
http://lists.blu.org/mailman/listinfo/discuss


Re: [Discuss] lvm snapshot cloning

2011-10-31 Thread Bill Bogstad
On Sun, Oct 30, 2011 at 11:01 AM, Tom Metro  wrote:
> Bill Bogstad wrote:
>
> You are correct. I was mistakenly thinking that inodes==blocks, which of
> course they don't.
>
> I haven't looked at the inotify library in a while, but it isn't out of
> the realm of possibility that in addition to the notification messages
>  identifying the inode, it may also identify the blocks or byte range
> impacted by the I/O operation.

I checked the docs quickly and no such luck.   In any case, it would
have been more likely that inotify would inform you about byte ranges
in the file that changed rather then blocks within the filesystem
implementation.   You could possibly use something like this to
monitor files for backup through the filesytem rather then through the
block device, but this would require monitoring all the directories on
the filesystem.   That is likely to be too expensive to do through the
inotify system in the general case.

Bill Bogstad
___
Discuss mailing list
Discuss@blu.org
http://lists.blu.org/mailman/listinfo/discuss