Re: better RSYNC mirroring , for .debs and others

2000-03-10 Thread Edmund GRIMLEY EVANS
   I'm not arguing the rest of your points, but I'm curious about 
   this one. IIRC, the last thing a full bootstrap of GCC does,
   after building stage one binaries with the native compiler,
  
  Hum, It *used* to do this, can't seem to get it to do it today though 
  oh well
  
  IIRC it only applied to debug information, it included timestamps or
  some such.
 
 There is a small header at the beginning of an object file which is
 different each time, because it contains a time stamp.
 
 This is why 'make compare' removes the first 16 bytes of the object
 files before comparing.
 
 for file in *$(objext); do \
   tail +16c ./$$file  tmp-foo1; \
   tail +16c stage$$stage/$$file  tmp-foo2 \
  (cmp tmp-foo1 tmp-foo2  /dev/null 21 || echo $$file differs 
  .bad_compare) || true; \
 done

Yes, I've seen that stuff fly (or crawl) past while building gcc, but
I don't understand in what circumstances it's required. I can't get
gcc to include timestamps when I try it.

I think the main spurious diff culprit in debs is gzip (and tar z).

It occurred to me that perhaps debs should be built with a
hacked/wrapped gzip to avoid this problem.

I don't know how autobuilders, etc, work, but I can imagine there may
be circumstances where it would be useful to be able to tell
immediately that a package was not changed when it was rebuilt after
installing a new version of some header file (or whatever) that might,
potentially, have caused the package to change. Or is this all
properly handled by some different mechanism that I can read about
somewhere?

Edmund



Re: better RSYNC mirroring , for .debs and others

2000-03-10 Thread Jacob Kuntz
you're quite right. why are we using rsync anyway? in it's current state
it's a waste of resources except for block-oriented files like cdimages.
wouldn't it make more sense to use something like mirror or wget untill
debdiff matures? are mirror admins required to use rsync?

another tought: would it be possible to impliment an alternative Packages.gz
that works more like a database? ie, fixed length fields, etc? that would
make rsync noticeably more effective.

Tom Rothamel ([EMAIL PROTECTED]) wrote:
 I do happen to think that rsync is an inefficent solution to archive
 mirroring, however, as it seems it would need to scan and checksum a
 huge number of files every time it runs. Probably a better way would
 be to have dinstall[1] generate a list of changes it makes to the
 archive, and have people mirroring the archive use those lists to
 figure out what needs to be downloaded.
 
 This would also have the benefit of making it easy to ensure that
 archive mirrors are always in a consistent state. (ie, Packages.gz is
 updated after new packages have been downloaded, but before old
 packages are deleted.)
 

-- 
(jacob kuntz)[EMAIL PROTECTED] [EMAIL 
PROTECTED],underworld}.net
(megabite systems)   think free speech, not free beer.



Re: better RSYNC mirroring , for .debs and others

2000-03-10 Thread Jason Gunthorpe

On Fri, 10 Mar 2000, Jacob Kuntz wrote:

 wouldn't it make more sense to use something like mirror or wget untill
 debdiff matures? are mirror admins required to use rsync?

Sadly rsync is far, far better that mirror or wget, both of which are
verging on useless for an archive of our size. 

We use rsync not for its ability to do binary file diffs, but because it
largely works.

Sadly my project to get a real mirroring system written is on hold (alas)

Jason



better RSYNC mirroring , for .debs and others

2000-03-09 Thread Andrea Mennucc1

hi everybody

I have implemented
a good idea for reducing download stress for everybody who is
mirroring a lot of data using rsync, 
like, the people who are mirroring Debian GNU/Linux:
currently, many Debian leaf mirrors are using rsync 
for mirroring from the main  .debian.org hosts.

rsync contains a wonderful algorithm to speedup downloads when mirroring
files which have only minor differences;
only problem is, this algorithm is ALMOST NEVER  used
when mirroring a debian repository
... indeed, whenever a new version of a
package is entered in the debianrepository,
this package has a different name: for this reason rsync  does just a
full download. 
Summarizing, rsync currently does some speedup only
when it downloads Packages.gz files, or when it skips an already existing
package.

well, I have just implemented a simple
way to use the algorithm even when downloading the .debs .

here is a simple example

suppose the current situation is
$REMOTE::/pub/debian/dist/bin/dpkg_2.deb
whereas locally we have
/debian/dist/bin/dpkg_1.deb

when rsync looks for a local version of
/debian/dist/bin/dpkg_2.deb
if there is none, then rsync does
  ls -t /debian/dist/bin/dpkg_*
and looks for the most recent file it finds

this way, rsync will use the file /debian/dist/bin/dpkg_1.deb
to try to speedup the download of$REMOTE::/pub/debian/dist/bin/dpkg_2.deb
(using its fabulous algorithm)

BIG PRO: my new rsync is totally compatible with the old one

Conclusion:
this idea would make all debian mirror-people  happier
(specially if they mirror unstable; consider that, often,
when a new version of a package is released, only small changes are made...
sometimes, only the .postinst , or such, are really changed;
this may , thou, masked by the compression, alas: but, see TODO)

I attach  two files: the first file is a diff, showing where, in
the rsync 2.4.1 source code tree, I have done some modifications;
the second is a .tgz of the all the new and modified files you
need to build the new rsync: 
to build, first you need to download
the source code (see rsync.samba.org/rsync/download.html)
and then you unpack the file rsync.diffsrc.tgz in the tree code,
and build.

You may also get the compiled binary directly as 
 ftp://tonelli.sns.it/pub/rsync/rsync
and the new code alltogether in
 ftp://tonelli.sns.it/pub/rsync

TODO:
there are some potentially good ideas here:

1) the idea is to add modules to rsync: 
  a gzip module, a deb module, and rpm module...;
  currently, modules just look for an older local version of the file;

  in a future version,  any module would
  apply to a certain type of file, and create
  another file to pass to rsync
  so that this another file  may probably lead to more speedup:  
  e.g., the gzip module would unzip files before doing comparisons,
  and the deb module would unzip the data.tar.gz part of a package

 CONS: this would not be backward compatible, of course
  
  The idea is, a module may provide  the following calls:
   find_alternative_version_MOD()
   receive_file_MOD()
   send_file_MOD()
   
 Currently, only  find_alternative_version_deb() was implemented.

 If rsync uses only the find_alternative_version_MOD()
 calls, then it is backward compatible with the usual version:
 (in a sense , it is doing what the option  --compare-dest  already does,
  only in a smarter way)
 
 I have not currently implemented anyreceive_file_MOD()
   send_file_MOD() : these would need a change in the protocol:
   I hope that the rsync authors will give permission

1b) My idea (not sure) is that rsync may work if provided with named pipes
 instead of files: indeed, according to the technical report,
 it needs to read the local and remote files only once, 
  and then, it writes the local file, without ever seeking backwards;
 then, the above modules would not need to actually
 use disk space and create temporary files.


2) for a faster apt-get downloading,
 it may be possible to do the same trick WHEN UPGRADING
 INSTALLED PACKAGES!  Here is the idea:
  apt-get creates a local version of the package
  (using dpkg-repack)
  and do the rsync to get the remote version
 


-- 
Andrea C. Mennucci,   Scuola Normale Superiore, Pisa, Italy
? modules
? zlib/dummy
Index: Makefile.in
===
RCS file: /cvsroot/rsync/Makefile.in,v
retrieving revision 1.39
diff -r1.39 Makefile.in
24c24
   lib/fnmatch.h lib/getopt.h lib/mdfour.h
---
   lib/fnmatch.h lib/getopt.h lib/mdfour.h modules/modules.h
32c32,33
 OBJS=$(OBJS1) $(OBJS2) $(DAEMON_OBJ) $(LIBOBJ) $(ZLIBOBJ)
---
 MODULES_OBJ = modules/modules.o modules/deb.o
 OBJS=$(OBJS1) $(OBJS2) $(DAEMON_OBJ) $(LIBOBJ) $(ZLIBOBJ) $(MODULES_OBJ)
Index: generator.c
===
RCS file: /cvsroot/rsync/generator.c,v
retrieving revision 1.16
diff -r1.16 generator.c
19a20,23
 #ifndef NODEBIANVERSIONER
 #include modules/modules.h
 #endif
 

Re: better RSYNC mirroring , for .debs and others

2000-03-09 Thread Jacob Kuntz
tom rothamel is working on a project called debdiff that works towards the
same goal. please read his announcment thread, which is archived at
http://www.debian.org/Lists-Archives/debian-devel-0002/msg00391.htm.

i like the idea of rsync modules, but the concept you project misses is that
even a small addition or subtraction in the beginning of a file ruins
rsync's speed bonus because it then has to send everything. take a look at
tom's code. i think you'll find it interesting.

Andrea Mennucc1 ([EMAIL PROTECTED]) wrote:
 
 hi everybody
 
 I have implemented
 a good idea for reducing download stress for everybody who is
 mirroring a lot of data using rsync, 
 like, the people who are mirroring Debian GNU/Linux:
 currently, many Debian leaf mirrors are using rsync 
 for mirroring from the main  .debian.org hosts.
 
 rsync contains a wonderful algorithm to speedup downloads when mirroring
 files which have only minor differences;
 only problem is, this algorithm is ALMOST NEVER  used
 when mirroring a debian repository
 ... indeed, whenever a new version of a
 package is entered in the debianrepository,
 this package has a different name: for this reason rsync  does just a
 full download. 
 Summarizing, rsync currently does some speedup only
 when it downloads Packages.gz files, or when it skips an already existing
 package.
 
 well, I have just implemented a simple
 way to use the algorithm even when downloading the .debs .
 
 here is a simple example
 
 suppose the current situation is
 $REMOTE::/pub/debian/dist/bin/dpkg_2.deb
 whereas locally we have
 /debian/dist/bin/dpkg_1.deb
 
 when rsync looks for a local version of
 /debian/dist/bin/dpkg_2.deb
 if there is none, then rsync does
   ls -t /debian/dist/bin/dpkg_*
 and looks for the most recent file it finds
 
 this way, rsync will use the file /debian/dist/bin/dpkg_1.deb
 to try to speedup the download of$REMOTE::/pub/debian/dist/bin/dpkg_2.deb
 (using its fabulous algorithm)
 
 BIG PRO: my new rsync is totally compatible with the old one
 
 Conclusion:
 this idea would make all debian mirror-people  happier
 (specially if they mirror unstable; consider that, often,
 when a new version of a package is released, only small changes are made...
 sometimes, only the .postinst , or such, are really changed;
 this may , thou, masked by the compression, alas: but, see TODO)
 
 I attach  two files: the first file is a diff, showing where, in
 the rsync 2.4.1 source code tree, I have done some modifications;
 the second is a .tgz of the all the new and modified files you
 need to build the new rsync: 
 to build, first you need to download
 the source code (see rsync.samba.org/rsync/download.html)
 and then you unpack the file rsync.diffsrc.tgz in the tree code,
 and build.
 
 You may also get the compiled binary directly as 
  ftp://tonelli.sns.it/pub/rsync/rsync
 and the new code alltogether in
  ftp://tonelli.sns.it/pub/rsync
 
 TODO:
 there are some potentially good ideas here:
 
 1) the idea is to add modules to rsync: 
   a gzip module, a deb module, and rpm module...;
   currently, modules just look for an older local version of the file;
 
   in a future version,  any module would
   apply to a certain type of file, and create
   another file to pass to rsync
   so that this another file  may probably lead to more speedup:  
   e.g., the gzip module would unzip files before doing comparisons,
   and the deb module would unzip the data.tar.gz part of a package
 
  CONS: this would not be backward compatible, of course
   
   The idea is, a module may provide  the following calls:
find_alternative_version_MOD()
receive_file_MOD()
send_file_MOD()

  Currently, only  find_alternative_version_deb() was implemented.
 
  If rsync uses only the find_alternative_version_MOD()
  calls, then it is backward compatible with the usual version:
  (in a sense , it is doing what the option  --compare-dest  already does,
   only in a smarter way)
  
  I have not currently implemented anyreceive_file_MOD()
send_file_MOD() : these would need a change in the protocol:
I hope that the rsync authors will give permission
 
 1b) My idea (not sure) is that rsync may work if provided with named pipes
  instead of files: indeed, according to the technical report,
  it needs to read the local and remote files only once, 
   and then, it writes the local file, without ever seeking backwards;
  then, the above modules would not need to actually
  use disk space and create temporary files.
 
 
 2) for a faster apt-get downloading,
  it may be possible to do the same trick WHEN UPGRADING
  INSTALLED PACKAGES!  Here is the idea:
   apt-get creates a local version of the package
   (using dpkg-repack)
   and do the rsync to get the remote version
  
 
 
 -- 
 Andrea C. Mennucci,   Scuola Normale Superiore, Pisa, Italy

-- 
(jacob kuntz)[EMAIL PROTECTED] [EMAIL 
PROTECTED],underworld}.net
(megabite systems)

Re: better RSYNC mirroring , for .debs and others

2000-03-09 Thread Jason Gunthorpe

On Thu, 9 Mar 2000, Andrea Mennucc1 wrote:

 rsync contains a wonderful algorithm to speedup downloads when mirroring
 files which have only minor differences;
 only problem is, this algorithm is ALMOST NEVER  used
 when mirroring a debian repository

Small detail here, .debs, like .gz files are basically not-rsyncable. gzip
effectively randomizes the contents of the files making the available
differences very, very small. This is particularly true for .debs when you
add in the fact that gcc never produces binary identical output on
consecutive runs.

Please *do not* run a client with this type of patch connected to any of
our servers, it will send the load sky high for no good reason, rsync is
already responsible for silly amounts of load, do not make it worse.

Jason



Re: better RSYNC mirroring , for .debs and others

2000-03-09 Thread David Starner
On Thu, Mar 09, 2000 at 12:26:30PM -0700, Jason Gunthorpe wrote:
 differences very, very small. This is particularly true for .debs when you
 add in the fact that gcc never produces binary identical output on
 consecutive runs.

I'm not arguing the rest of your points, but I'm curious about 
this one. IIRC, the last thing a full bootstrap of GCC does,
after building stage one binaries with the native compiler,
stage two binaries with the stage one binaries and stage
three binaries with the stage two binaries, is compare the
stage two and stage three binaries. If they're not the
same, then you have a problem. I don't see how this fits
with what you're saying.

-- 
David Starner - [EMAIL PROTECTED]
Only a nerd would worry about wrong parentheses with
square brackets. But that's what mathematicians are.
   -- Dr. Burchard, math professor at OSU



Re: better RSYNC mirroring , for .debs and others

2000-03-09 Thread Jason Gunthorpe

On Thu, 9 Mar 2000, David Starner wrote:

 I'm not arguing the rest of your points, but I'm curious about 
 this one. IIRC, the last thing a full bootstrap of GCC does,
 after building stage one binaries with the native compiler,

Hum, It *used* to do this, can't seem to get it to do it today though 
oh well

IIRC it only applied to debug information, it included timestamps or
some such.

Jason



Re: better RSYNC mirroring , for .debs and others

2000-03-09 Thread Marcus Brinkmann
On Thu, Mar 09, 2000 at 12:46:05PM -0700, Jason Gunthorpe wrote:
 
 On Thu, 9 Mar 2000, David Starner wrote:
 
  I'm not arguing the rest of your points, but I'm curious about 
  this one. IIRC, the last thing a full bootstrap of GCC does,
  after building stage one binaries with the native compiler,
 
 Hum, It *used* to do this, can't seem to get it to do it today though 
 oh well
 
 IIRC it only applied to debug information, it included timestamps or
 some such.

There is a small header at the beginning of an object file which is
different each time, because it contains a time stamp.

This is why 'make compare' removes the first 16 bytes of the object
files before comparing.

for file in *$(objext); do \
  tail +16c ./$$file  tmp-foo1; \
  tail +16c stage$$stage/$$file  tmp-foo2 \
 (cmp tmp-foo1 tmp-foo2  /dev/null 21 || echo $$file differs 
 .bad_compare) || true; \
done

Marcus

-- 
`Rhubarb is no Egyptian god.' Debian http://www.debian.org Check Key server 
Marcus Brinkmann  GNUhttp://www.gnu.orgfor public PGP Key 
[EMAIL PROTECTED], [EMAIL PROTECTED]PGP Key ID 36E7CD09
http://homepage.ruhr-uni-bochum.de/Marcus.Brinkmann/   [EMAIL PROTECTED]



Re: better RSYNC mirroring , for .debs and others

2000-03-09 Thread Tom Rothamel
On 9 Mar 2000 12:56:29 -0500, Jacob Kuntz wrote:
 tom rothamel is working on a project called debdiff that works towards the
 same goal. please read his announcment thread, which is archived at
 http://www.debian.org/Lists-Archives/debian-devel-0002/msg00391.htm.

The code associated with this is now available at 
http://onegeek.org/~tom/software/ddiff/, for what it's worth.

I do happen to think that rsync is an inefficent solution to archive
mirroring, however, as it seems it would need to scan and checksum a
huge number of files every time it runs. Probably a better way would
be to have dinstall[1] generate a list of changes it makes to the
archive, and have people mirroring the archive use those lists to
figure out what needs to be downloaded.

This would also have the benefit of making it easy to ensure that
archive mirrors are always in a consistent state. (ie, Packages.gz is
updated after new packages have been downloaded, but before old
packages are deleted.)

[1] At least, I think that's it. I'm not really sure how things work
on the Debian end... I probably won't know for sure until
hell freezes over^W^W^Wnew-maintainer reopens.

-- 
Tom Rothamel - http://onegeek.org/~tom/ -- Using GNU/Linux
The Moon is Waxing Crescent (16% of Full)