Re: [BackupPC-users] Matching files against the pool remotely.

2010-01-04 Thread Malik Recoing .
Les Mikesell  gmail.com> writes:

> 
> Shawn Perry wrote:
> > Take a look at how Unison does it's compares.
> > 
> 
> It's not impossible - it just can't be done with existing tools 
> and storage scheme.
> 

This look indeed like a complicated problem. 

There is an rsync patch which add an option --link-by-hash=DIR pretty similar
to the pool of backuppc. It create hard links of similar files ordered by 
 md4 sum on the fly during the sync process. I tested it and I'm looking
at the C code right now. 


I'm sure there is less than 5 lines to make it skip a remote file whom the hash
match a file in the pool. It imply a patched rsync. That may be acceptable for
precise cases like mine where bandwidth is critical but only on the backuppc
side of rsync (the client). This mean the rsyncd part will stay regular rsync
and send sums of the whole files not following BackupPC_Link logic. So It would
imply in turn to change the way BackupPC_Link manage the pool, etc... And there
is also the duplicate hash problem.

My better idea so far would be to use a patched rsync and a separate pool
directory for rsyncd method only. Matching against this pool would be done only
if the classic matching by path failed. The matching file would also not be
just hard-linked but rather serve as base file for rsync update, so in case we
have several candidate one can safely use the first coming file. This way it can
be integrated in BackupPC only with some host configuration.

Malik.


--
This SF.Net email is sponsored by the Verizon Developer Community
Take advantage of Verizon's best-in-class app development support
A streamlined, 14 day to market process makes app distribution fast and easy
Join now and get one step closer to millions of Verizon customers
http://p.sf.net/sfu/verizon-dev2dev 
___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/


Re: [BackupPC-users] Matching files against the pool remotely.

2009-12-19 Thread Les Mikesell
Shawn Perry wrote:
> Take a look at how Unison does it's compares.
> 

It's not impossible - it just can't be done with existing tools and storage 
scheme.

-- 
   Les Mikesell
lesmikes...@gmail.com

--
This SF.Net email is sponsored by the Verizon Developer Community
Take advantage of Verizon's best-in-class app development support
A streamlined, 14 day to market process makes app distribution fast and easy
Join now and get one step closer to millions of Verizon customers
http://p.sf.net/sfu/verizon-dev2dev 
___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/


Re: [BackupPC-users] Matching files against the pool remotely.

2009-12-18 Thread Shawn Perry
Take a look at how Unison does it's compares.

On Fri, Dec 18, 2009 at 9:12 AM, Les Mikesell  wrote:
> Malik Recoing. wrote:
>>
 I know a file will be skiped if it is present in the previous backup, but
 what
 appens if the file have been backed up for another host ?
>>> It is required to be uploaded first as otherwise there's nothing to
>>> compare it to (yeah, I know, that's a pain[1]).
>>>
>>> It might theoretically be sufficient to let the remote side calculate a
>>> hash and compare it against the files in the pool with matching hashes,
>>> and then let rsync do full compares against all the matching hashes in the
>>> pool (since hash collisions happen), but I don't believe anyone has tried
>>> to code this up yet, and it would only be of limited uses in systems that
>>> were network bandwidth constrained rather than disk bandwidth constrained.
>>
>> I'm quite sure it will be an improvement for both. Globaly there will be no
>> overhead. More : the hash calculation will be kind of "clustered" delegating 
>> it
>> to the client. The matching of identical hash is anyway done by 
>> BackupPC_Link.
>> Thus BackupPC_Link will became pointless in a "rsync-only" configuration. The
>> disk and the network trafic will be reduced as many files won't be 
>> transfered at
>> all.
>
> There are two problems: one is that the remote agent is a standard rsync
> binary that knows nothing about backuppc's hashes; the other is that
> hash collisions are normal and expected - and disambiguated by a full
> data comparison.
>
>> I tougth of a similar solution. When your client are mostly "full system 
>> tree"
>> backups, you may have ready-to-copy backups of the differents OS tree. When a
>> new client is added, you copy the corresponding OS directory as it was the 
>> first
>> full backup.
>
> Yes, if your remote machines are essentially clones of each other, you
> could create their pc directories as clones with a tool that knows how
> to make a tree of hardlinks.
>
> A better solution might be to have a local machine at the site running
> backuppc and work out some way to get an offsite copy.  If bandwidth is
> such an issue, you are also going to have trouble doing a restore.  But,
> if you've followed this mail list very long you'd know that the 'offsite
> copy' problem doesn't have a good solution yet either.
>
> --
>   Les Mikesell
>    lesmikes...@gmail.com
>
>
>
> --
> This SF.Net email is sponsored by the Verizon Developer Community
> Take advantage of Verizon's best-in-class app development support
> A streamlined, 14 day to market process makes app distribution fast and easy
> Join now and get one step closer to millions of Verizon customers
> http://p.sf.net/sfu/verizon-dev2dev
> ___
> BackupPC-users mailing list
> BackupPC-users@lists.sourceforge.net
> List:    https://lists.sourceforge.net/lists/listinfo/backuppc-users
> Wiki:    http://backuppc.wiki.sourceforge.net
> Project: http://backuppc.sourceforge.net/
>

--
This SF.Net email is sponsored by the Verizon Developer Community
Take advantage of Verizon's best-in-class app development support
A streamlined, 14 day to market process makes app distribution fast and easy
Join now and get one step closer to millions of Verizon customers
http://p.sf.net/sfu/verizon-dev2dev 
___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/


Re: [BackupPC-users] Matching files against the pool remotely.

2009-12-18 Thread Les Mikesell
Malik Recoing. wrote:
>
>>> I know a file will be skiped if it is present in the previous backup, but
>>> what
>>> appens if the file have been backed up for another host ?
>> It is required to be uploaded first as otherwise there's nothing to
>> compare it to (yeah, I know, that's a pain[1]).
>>
>> It might theoretically be sufficient to let the remote side calculate a
>> hash and compare it against the files in the pool with matching hashes,
>> and then let rsync do full compares against all the matching hashes in the
>> pool (since hash collisions happen), but I don't believe anyone has tried
>> to code this up yet, and it would only be of limited uses in systems that
>> were network bandwidth constrained rather than disk bandwidth constrained.
> 
> I'm quite sure it will be an improvement for both. Globaly there will be no
> overhead. More : the hash calculation will be kind of "clustered" delegating 
> it
> to the client. The matching of identical hash is anyway done by BackupPC_Link.
> Thus BackupPC_Link will became pointless in a "rsync-only" configuration. The
> disk and the network trafic will be reduced as many files won't be transfered 
> at
> all.

There are two problems: one is that the remote agent is a standard rsync 
binary that knows nothing about backuppc's hashes; the other is that 
hash collisions are normal and expected - and disambiguated by a full 
data comparison.

> I tougth of a similar solution. When your client are mostly "full system tree"
> backups, you may have ready-to-copy backups of the differents OS tree. When a
> new client is added, you copy the corresponding OS directory as it was the 
> first
> full backup.

Yes, if your remote machines are essentially clones of each other, you 
could create their pc directories as clones with a tool that knows how 
to make a tree of hardlinks.

A better solution might be to have a local machine at the site running 
backuppc and work out some way to get an offsite copy.  If bandwidth is 
such an issue, you are also going to have trouble doing a restore.  But, 
if you've followed this mail list very long you'd know that the 'offsite 
copy' problem doesn't have a good solution yet either.

-- 
   Les Mikesell
lesmikes...@gmail.com



--
This SF.Net email is sponsored by the Verizon Developer Community
Take advantage of Verizon's best-in-class app development support
A streamlined, 14 day to market process makes app distribution fast and easy
Join now and get one step closer to millions of Verizon customers
http://p.sf.net/sfu/verizon-dev2dev 
___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/


Re: [BackupPC-users] Matching files against the pool remotely.

2009-12-18 Thread Malik Recoing .
Tim Connors  gmail.com> writes:
> 
> On Fri, 18 Dec 2009, Malik Recoing. wrote:
> 
> > I know a file will be skiped if it is present in the previous backup, but
> > what
> > appens if the file have been backed up for another host ?
> 
> It is required to be uploaded first as otherwise there's nothing to
> compare it to (yeah, I know, that's a pain[1]).
> 
> It might theoretically be sufficient to let the remote side calculate a
> hash and compare it against the files in the pool with matching hashes,
> and then let rsync do full compares against all the matching hashes in the
> pool (since hash collisions happen), but I don't believe anyone has tried
> to code this up yet, and it would only be of limited uses in systems that
> were network bandwidth constrained rather than disk bandwidth constrained.

I'm quite sure it will be an improvement for both. Globaly there will be no
overhead. More : the hash calculation will be kind of "clustered" delegating it
to the client. The matching of identical hash is anyway done by BackupPC_Link.
Thus BackupPC_Link will became pointless in a "rsync-only" configuration. The
disk and the network trafic will be reduced as many files won't be transfered at
all.

If such a feature exists, it will give BackupPC a "magic" touch, backing up a
wole tree of well known files in a minute even over a slow network. 

What a pity I'm not fluent with perl...


> [1] I just worked around this myself by copying a large set of files onto
> sneakernet (my USB key), copying them onto a directory on the local backup
> server, backing that directory up, then moving the corresponding directory
> in the backup tree into the previous backup of the remote system, so it
> will be picked up and compared against the same files when that remote
> system is next backed up.  I find out tomorrow whether that actually
> worked :)
> 

I tougth of a similar solution. When your client are mostly "full system tree"
backups, you may have ready-to-copy backups of the differents OS tree. When a
new client is added, you copy the corresponding OS directory as it was the first
full backup.

Malik.





--
This SF.Net email is sponsored by the Verizon Developer Community
Take advantage of Verizon's best-in-class app development support
A streamlined, 14 day to market process makes app distribution fast and easy
Join now and get one step closer to millions of Verizon customers
http://p.sf.net/sfu/verizon-dev2dev 
___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/


Re: [BackupPC-users] Matching files against the pool remotely.

2009-12-18 Thread Tim Connors
On Fri, 18 Dec 2009, Malik Recoing. wrote:

> The Holy Doc says ( Barratt:Desing:operation:2 ): "it checks each file in the
> backup to see if it is identical to an existing file from any previous backup 
> of
> any PC. It does this without needed to write the file to disk."
>
> But it doesn't say "without the need to upload the file in memory".
>
> I know a file will be skiped if it is present in the previous backup, but what
> appens if the file have been backed up for another host ?

It is required to be uploaded first as otherwise there's nothing to
compare it to (yeah, I know, that's a pain[1]).

It might theoretically be sufficient to let the remote side calculate a
hash and compare it against the files in the pool with matching hashes,
and then let rsync do full compares against all the matching hashes in the
pool (since hash collisions happen), but I don't believe anyone has tried
to code this up yet, and it would only be of limited uses in systems that
were network bandwidth constrained rather than disk bandwidth constrained.

[1] I just worked around this myself by copying a large set of files onto
sneakernet (my USB key), copying them onto a directory on the local backup
server, backing that directory up, then moving the corresponding directory
in the backup tree into the previous backup of the remote system, so it
will be picked up and compared against the same files when that remote
system is next backed up.  I find out tomorrow whether that actually
worked :)


-- 
TimC
Computer screens simply ooze buckets of yang.
To balance this, place some women around the corners of the room.
-- Kaz Cooke, Dumb Feng Shui

--
This SF.Net email is sponsored by the Verizon Developer Community
Take advantage of Verizon's best-in-class app development support
A streamlined, 14 day to market process makes app distribution fast and easy
Join now and get one step closer to millions of Verizon customers
http://p.sf.net/sfu/verizon-dev2dev 
___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/


[BackupPC-users] Matching files against the pool remotely.

2009-12-18 Thread Malik Recoing .
Hello,

I'm trying to optimize BackupPC for use over internet with a lot of client (say
100 per server). Clients run rsyncd and are connected via dsl of variable 
speed. 

Many discussions in this list helped me a lot. But I can't figure out one thing 
:
Does BackupPC use rsync features to skip a file allready in the pool _before_ it
uploaded it ? Or does he need to upload it first and then the file is matched
against the pool, eventualy replaced by a hard link ?

In the first case this will save bandwidth and disk, in the second case only
disk space. Is BackupPC able to match a file remotely ? 

The Holy Doc says ( Barratt:Desing:operation:2 ): "it checks each file in the
backup to see if it is identical to an existing file from any previous backup of
any PC. It does this without needed to write the file to disk."

But it doesn't say "without the need to upload the file in memory".

I know a file will be skiped if it is present in the previous backup, but what
appens if the file have been backed up for another host ?

Thank you for your enlightenments.

Malik.








--
This SF.Net email is sponsored by the Verizon Developer Community
Take advantage of Verizon's best-in-class app development support
A streamlined, 14 day to market process makes app distribution fast and easy
Join now and get one step closer to millions of Verizon customers
http://p.sf.net/sfu/verizon-dev2dev 
___
BackupPC-users mailing list
BackupPC-users@lists.sourceforge.net
List:https://lists.sourceforge.net/lists/listinfo/backuppc-users
Wiki:http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/