Re: [Hampshire] Open source network backup with de-dupe.

2010-07-19 Thread Damian Brasher
Adrian Bridget wrote:

> I've come to the conclusion that there aren't any decent open source
> backup products.  Yes, I do actually have it on my todo list to write
> one :-)

:-) It's hard, I've not been able to find a decent contract for my company so
have had to cease development of DIASER, shame. We reached beta-2 but have
had to create a cut-off point. So the whole kaboodle may well be up for sale
in a few weeks time as Interlinux Ltd has brought the project IP and finances
to safety by entering a dormant state - this is a gradual process.

I don't want the community to use something that can't be or isn't well
maintained or sustained.

 -- Damian

-- 
http://interlinux.co.uk

DIASER RoadMap http://bit.ly/1Vtdp5

-- 
This message has been scanned by MailScanner.


--
Please post to: Hampshire@mailman.lug.org.uk
Web Interface: https://mailman.lug.org.uk/mailman/listinfo/hampshire
LUG URL: http://www.hantslug.org.uk
--


Re: [Hampshire] Open source network backup with de-dupe.

2010-07-17 Thread Adrian Bridgett
On Thu, Jul 15, 2010 at 21:11:25 +0100 (+0100), Keith Edmunds wrote:
> However, Chris is right: you cannot *know* that two files are the same
> unless you compare them, byte by byte. If hashes are good enough for you,
> just backup the hashes and save lots of time and diskspace!

My understanding on this point is that in fact a hash _is_ good enough
- or rather the odds of a hash not being good enough are sufficiently
low (cf corruption on hard disks etc) that it's irrevevant.  For
instance see:

http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=122945

Adrian
-- 
bitcube.co.uk - Expert Linux infrastructure consultancy
Puppet, Debian, Red Hat, Ubuntu, CentOS

--
Please post to: Hampshire@mailman.lug.org.uk
Web Interface: https://mailman.lug.org.uk/mailman/listinfo/hampshire
LUG URL: http://www.hantslug.org.uk
--


Re: [Hampshire] Open source network backup with de-dupe.

2010-07-17 Thread Vic

> Being able to take snapshots every few minutes and sync them to a
> remote datacentre really is rather nice

Yep. I've done that with LVM snapshots and rsync. Very handy :-)

Vic.


--
Please post to: Hampshire@mailman.lug.org.uk
Web Interface: https://mailman.lug.org.uk/mailman/listinfo/hampshire
LUG URL: http://www.hantslug.org.uk
--


Re: [Hampshire] Open source network backup with de-dupe.

2010-07-17 Thread Adrian Bridgett
On Thu, Jul 15, 2010 at 09:15:23 +0100 (+0100), James Courtier-Dutton wrote:
> > I've come to the conclusion that there aren't any decent open source
> > backup products.  Yes, I do actually have it on my todo list to write
[snip]
> So, in summary, it is not good enough to replace the system currently
> at my customer's that cost over £10 !

There's a good reason they can charge 100K :-)

VSS and other snapshotting technologies (particularly those built into
decent storage arrays) are the way forward if you can afford it.

Being able to take snapshots every few minutes and sync them to a
remote datacentre really is rather nice :D  See also "continuous data
protection".

TBH my personal attitude is that snapshots are great for box restores,
file level are good for digging out a single file.  Good sysadmin
practice should almost remove the need to ever use backups in
enterprise environemnts.  Testing on pre-production environments then
rolling out onto production boxes (or flip-flopping environments where
you clone A to B, upgrade B, then flip the service over to B) works a
treat.

Backups then become emergency only "oh crap, we've been hacked" and/or
"database is corrupted" and it's an issue of how much loss of data
you can afford time wise.

Adrian
-- 
bitcube.co.uk - Expert Linux infrastructure consultancy
Puppet, Debian, Red Hat, Ubuntu, CentOS

--
Please post to: Hampshire@mailman.lug.org.uk
Web Interface: https://mailman.lug.org.uk/mailman/listinfo/hampshire
LUG URL: http://www.hantslug.org.uk
--


Re: [Hampshire] Open source network backup with de-dupe.

2010-07-15 Thread Keith Edmunds
On Thu, 15 Jul 2010 20:56:01 +0100, james.dut...@gmail.com said:

> Say you change one byte in a large file. rsync will not send the
> entire file again, it will only send the changes.

That's what BackupPC does.

However, Chris is right: you cannot *know* that two files are the same
unless you compare them, byte by byte. If hashes are good enough for you,
just backup the hashes and save lots of time and diskspace!

-- 
Keith Edmunds

+-+
|Tiger Computing Ltd|  Helping businesses make the most of Linux  |
|  "The Linux Specialists"  |   http://www.tiger-computing.co.uk  |
+-+

--
Please post to: Hampshire@mailman.lug.org.uk
Web Interface: https://mailman.lug.org.uk/mailman/listinfo/hampshire
LUG URL: http://www.hantslug.org.uk
--


Re: [Hampshire] Open source network backup with de-dupe.

2010-07-15 Thread James Courtier-Dutton
On 15 July 2010 20:44, Chris Dennis  wrote:
> On 15/07/10 15:39, James Courtier-Dutton wrote:
>>
>> Take 1 central site PC called "A"
>> Take two remote sites PC called "B" and "C".
>>
>> B has already sent a full backup to A.
>> C wishes to send a full backup to A, but lots of the data on C is the same
>> as B.
>> C generates HASHs of its files, and only sends the HASHs to A.
>> A responses to C saying which HASHs it has not already got from B.
>> C then only sends a subset of the data, I.e. data that was not already
>> sent from B.
>>
>> Thus, as lot of WAN bandwidth is saved.
>
> The problem is that hash collisions can occur.  Two files with the same hash
> are /probably/ the same file, but probably isn't good enough -- a backup
> system has to 100% sure.  And the only way to be certain is to get both
> files and compare them byte by byte.
>

There are algorithms that detect collisions without having to send the
entire file.
For example, rsync uses them.
Say you change one byte in a large file. rsync will not send the
entire file again, it will only send the changes.
If I followed your statement, I would have to stop using rsync.

Kind Regards

James

--
Please post to: Hampshire@mailman.lug.org.uk
Web Interface: https://mailman.lug.org.uk/mailman/listinfo/hampshire
LUG URL: http://www.hantslug.org.uk
--

Re: [Hampshire] Open source network backup with de-dupe.

2010-07-15 Thread Chris Dennis

On 15/07/10 15:39, James Courtier-Dutton wrote:


Take 1 central site PC called "A"
Take two remote sites PC called "B" and "C".

B has already sent a full backup to A.
C wishes to send a full backup to A, but lots of the data on C is the same as B.
C generates HASHs of its files, and only sends the HASHs to A.
A responses to C saying which HASHs it has not already got from B.
C then only sends a subset of the data, I.e. data that was not already
sent from B.

Thus, as lot of WAN bandwidth is saved.


The problem is that hash collisions can occur.  Two files with the same 
hash are /probably/ the same file, but probably isn't good enough -- a 
backup system has to 100% sure.  And the only way to be certain is to 
get both files and compare them byte by byte.


BackupPC uses hashes for file names, but also checks for hash collisions 
and deals with them when they happen.




There is also the possibility of doing this on a site bases. So, one
machine at the site de-dupes all the data for that site, and then just
sends the de-duped data over the WAN link.


That could work, but needs more software at the client end.  Does rsync 
do anything like that?


cheers

Chris
--
Chris Dennis  cgden...@btinternet.com
Fordingbridge, Hampshire, UK

--
Please post to: Hampshire@mailman.lug.org.uk
Web Interface: https://mailman.lug.org.uk/mailman/listinfo/hampshire
LUG URL: http://www.hantslug.org.uk
--


Re: [Hampshire] Open source network backup with de-dupe.

2010-07-15 Thread James Courtier-Dutton
On 15 July 2010 15:14, Keith Edmunds  wrote:
> Hi James
>
> You're being unrealistic.
>
>> The documentation gives no explanation of what WAN bandwidth it will use.
>
> How can it? It depends on how much data you backup; more accurately, it
> depends on how much data has changed since the last backup.
>
>> It reads as if it gets all the data into a central location, and then
>> de-dupes it.
>
> It does.
>
>> This is not good for WAN bandwidth at all. If the same file is on two
>> computers, I only want one computer to send the file once.
>
> Explain how the server can ascertain that the data is the same on both
> clients without getting a full copy of the data. Note: not ascertain that
> it may be the same, but that it IS the same.
>

Take 1 central site PC called "A"
Take two remote sites PC called "B" and "C".

B has already sent a full backup to A.
C wishes to send a full backup to A, but lots of the data on C is the same as B.
C generates HASHs of its files, and only sends the HASHs to A.
A responses to C saying which HASHs it has not already got from B.
C then only sends a subset of the data, I.e. data that was not already
sent from B.

Thus, as lot of WAN bandwidth is saved.

There is also the possibility of doing this on a site bases. So, one
machine at the site de-dupes all the data for that site, and then just
sends the de-duped data over the WAN link.

Another view of this is can be:
When sending files from C to A, A compares the hashes sent by C with
its entire file store, and not just the single file that C is sending.

Kind Regards

James

--
Please post to: Hampshire@mailman.lug.org.uk
Web Interface: https://mailman.lug.org.uk/mailman/listinfo/hampshire
LUG URL: http://www.hantslug.org.uk
--


Re: [Hampshire] Open source network backup with de-dupe.

2010-07-15 Thread Keith Edmunds
Hi James

You're being unrealistic.

> The documentation gives no explanation of what WAN bandwidth it will use.

How can it? It depends on how much data you backup; more accurately, it
depends on how much data has changed since the last backup.

> It reads as if it gets all the data into a central location, and then
> de-dupes it.

It does.

> This is not good for WAN bandwidth at all. If the same file is on two
> computers, I only want one computer to send the file once.

Explain how the server can ascertain that the data is the same on both
clients without getting a full copy of the data. Note: not ascertain that
it may be the same, but that it IS the same.

> Also, it sounds very much like a Linux only solution

In your original post you made no mention of the requirement that it would
back up Windows systems, and given that this is a Linux ML, it's not
unreasonable to discuss Linux solutions. However, BackupPC does backup
Windows systems, although it doesn't use VSS.

> So, in summary, it is not good enough to replace the system currently
> at my customer's that cost over £10 !
> That product uses RPC to set off a VSS on the remote windows machine
> so that backups work better.
> It also handles WAN bandwidth better.

So keep using it then. Those who want to use Windows servers usually
understand that such a decision will be costly; those who don't understand
that usually find it out quite quickly.


--
Please post to: Hampshire@mailman.lug.org.uk
Web Interface: https://mailman.lug.org.uk/mailman/listinfo/hampshire
LUG URL: http://www.hantslug.org.uk
--

Re: [Hampshire] Open source network backup with de-dupe.

2010-07-15 Thread James Courtier-Dutton
On 14 July 2010 19:56, Adrian Bridgett  wrote:
> On Wed, Jul 14, 2010 at 19:45:00 +0100 (+0100), Keith Edmunds wrote:
>> On Wed, 14 Jul 2010 12:25:10 +0100, james.dut...@gmail.com said:
>>
>> > Does anyone know of any open source backup programs that do de-dupe
>> > for the express purposes of reducing traffic over the WAN.
>>
>> BackupPC. Recommended.
>
> Snap :-)
>
> + dedupes between backups and across boxes
> + nice gui
> - file layout is sadly not rsyncable from the raw FS
>
> I've come to the conclusion that there aren't any decent open source
> backup products.  Yes, I do actually have it on my todo list to write
> one :-)
>
> PS: hantslug.org.uk is backed up using backuppc
>

The documentation gives no explanation of what WAN bandwidth it will use.
It reads as if it gets all the data into a central location, and then
de-dupes it.
This is not good for WAN bandwidth at all. If the same file is on two
computers, I only want one computer to send the file once.
Also, it sounds very much like a Linux only solution, because its
explanation of how it does windows backup does not use VSS and
therefore will have problems with locked files.
So, in summary, it is not good enough to replace the system currently
at my customer's that cost over £10 !
That product uses RPC to set off a VSS on the remote windows machine
so that backups work better.
It also handles WAN bandwidth better.

Kind Regards

James

--
Please post to: Hampshire@mailman.lug.org.uk
Web Interface: https://mailman.lug.org.uk/mailman/listinfo/hampshire
LUG URL: http://www.hantslug.org.uk
--

Re: [Hampshire] Open source network backup with de-dupe.

2010-07-14 Thread Adrian Bridgett
On Wed, Jul 14, 2010 at 19:45:00 +0100 (+0100), Keith Edmunds wrote:
> On Wed, 14 Jul 2010 12:25:10 +0100, james.dut...@gmail.com said:
> 
> > Does anyone know of any open source backup programs that do de-dupe
> > for the express purposes of reducing traffic over the WAN.
> 
> BackupPC. Recommended.

Snap :-)

+ dedupes between backups and across boxes
+ nice gui
- file layout is sadly not rsyncable from the raw FS

I've come to the conclusion that there aren't any decent open source
backup products.  Yes, I do actually have it on my todo list to write
one :-)

PS: hantslug.org.uk is backed up using backuppc

Adrian
-- 
bitcube.co.uk - Expert Linux infrastructure consultancy
Puppet, Debian, Red Hat, Ubuntu, CentOS

--
Please post to: Hampshire@mailman.lug.org.uk
Web Interface: https://mailman.lug.org.uk/mailman/listinfo/hampshire
LUG URL: http://www.hantslug.org.uk
--


Re: [Hampshire] Open source network backup with de-dupe.

2010-07-14 Thread Keith Edmunds
On Wed, 14 Jul 2010 12:25:10 +0100, james.dut...@gmail.com said:

> If two sites have the same data, and one site has already sent the
> initial backup seeding to the central site, the second site should not
> need to also send the same data.

Further to my earlier reply: yes, it does need to send the data (otherwise
the server won't know it's the same). However, BackupPC only *stores* one
copy of the data.

-- 
Keith Edmunds

+-+
|Tiger Computing Ltd|  Helping businesses make the most of Linux  |
|  "The Linux Specialists"  |   http://www.tiger-computing.co.uk  |
+-+

--
Please post to: Hampshire@mailman.lug.org.uk
Web Interface: https://mailman.lug.org.uk/mailman/listinfo/hampshire
LUG URL: http://www.hantslug.org.uk
--


Re: [Hampshire] Open source network backup with de-dupe.

2010-07-14 Thread Keith Edmunds
On Wed, 14 Jul 2010 12:25:10 +0100, james.dut...@gmail.com said:

> Does anyone know of any open source backup programs that do de-dupe
> for the express purposes of reducing traffic over the WAN.

BackupPC. Recommended.

-- 
Keith Edmunds

+-+
|Tiger Computing Ltd|  Helping businesses make the most of Linux  |
|  "The Linux Specialists"  |   http://www.tiger-computing.co.uk  |
+-+

--
Please post to: Hampshire@mailman.lug.org.uk
Web Interface: https://mailman.lug.org.uk/mailman/listinfo/hampshire
LUG URL: http://www.hantslug.org.uk
--