Re: better RSYNC mirroring , for .debs and others
I'm not arguing the rest of your points, but I'm curious about this one. IIRC, the last thing a full bootstrap of GCC does, after building stage one binaries with the native compiler, Hum, It *used* to do this, can't seem to get it to do it today though oh well IIRC it only applied to debug information, it included timestamps or some such. There is a small header at the beginning of an object file which is different each time, because it contains a time stamp. This is why 'make compare' removes the first 16 bytes of the object files before comparing. for file in *$(objext); do \ tail +16c ./$$file tmp-foo1; \ tail +16c stage$$stage/$$file tmp-foo2 \ (cmp tmp-foo1 tmp-foo2 /dev/null 21 || echo $$file differs .bad_compare) || true; \ done Yes, I've seen that stuff fly (or crawl) past while building gcc, but I don't understand in what circumstances it's required. I can't get gcc to include timestamps when I try it. I think the main spurious diff culprit in debs is gzip (and tar z). It occurred to me that perhaps debs should be built with a hacked/wrapped gzip to avoid this problem. I don't know how autobuilders, etc, work, but I can imagine there may be circumstances where it would be useful to be able to tell immediately that a package was not changed when it was rebuilt after installing a new version of some header file (or whatever) that might, potentially, have caused the package to change. Or is this all properly handled by some different mechanism that I can read about somewhere? Edmund
Re: better RSYNC mirroring , for .debs and others
you're quite right. why are we using rsync anyway? in it's current state it's a waste of resources except for block-oriented files like cdimages. wouldn't it make more sense to use something like mirror or wget untill debdiff matures? are mirror admins required to use rsync? another tought: would it be possible to impliment an alternative Packages.gz that works more like a database? ie, fixed length fields, etc? that would make rsync noticeably more effective. Tom Rothamel ([EMAIL PROTECTED]) wrote: I do happen to think that rsync is an inefficent solution to archive mirroring, however, as it seems it would need to scan and checksum a huge number of files every time it runs. Probably a better way would be to have dinstall[1] generate a list of changes it makes to the archive, and have people mirroring the archive use those lists to figure out what needs to be downloaded. This would also have the benefit of making it easy to ensure that archive mirrors are always in a consistent state. (ie, Packages.gz is updated after new packages have been downloaded, but before old packages are deleted.) -- (jacob kuntz)[EMAIL PROTECTED] [EMAIL PROTECTED],underworld}.net (megabite systems) think free speech, not free beer.
Re: better RSYNC mirroring , for .debs and others
On Fri, 10 Mar 2000, Jacob Kuntz wrote: wouldn't it make more sense to use something like mirror or wget untill debdiff matures? are mirror admins required to use rsync? Sadly rsync is far, far better that mirror or wget, both of which are verging on useless for an archive of our size. We use rsync not for its ability to do binary file diffs, but because it largely works. Sadly my project to get a real mirroring system written is on hold (alas) Jason
better RSYNC mirroring , for .debs and others
hi everybody I have implemented a good idea for reducing download stress for everybody who is mirroring a lot of data using rsync, like, the people who are mirroring Debian GNU/Linux: currently, many Debian leaf mirrors are using rsync for mirroring from the main .debian.org hosts. rsync contains a wonderful algorithm to speedup downloads when mirroring files which have only minor differences; only problem is, this algorithm is ALMOST NEVER used when mirroring a debian repository ... indeed, whenever a new version of a package is entered in the debianrepository, this package has a different name: for this reason rsync does just a full download. Summarizing, rsync currently does some speedup only when it downloads Packages.gz files, or when it skips an already existing package. well, I have just implemented a simple way to use the algorithm even when downloading the .debs . here is a simple example suppose the current situation is $REMOTE::/pub/debian/dist/bin/dpkg_2.deb whereas locally we have /debian/dist/bin/dpkg_1.deb when rsync looks for a local version of /debian/dist/bin/dpkg_2.deb if there is none, then rsync does ls -t /debian/dist/bin/dpkg_* and looks for the most recent file it finds this way, rsync will use the file /debian/dist/bin/dpkg_1.deb to try to speedup the download of$REMOTE::/pub/debian/dist/bin/dpkg_2.deb (using its fabulous algorithm) BIG PRO: my new rsync is totally compatible with the old one Conclusion: this idea would make all debian mirror-people happier (specially if they mirror unstable; consider that, often, when a new version of a package is released, only small changes are made... sometimes, only the .postinst , or such, are really changed; this may , thou, masked by the compression, alas: but, see TODO) I attach two files: the first file is a diff, showing where, in the rsync 2.4.1 source code tree, I have done some modifications; the second is a .tgz of the all the new and modified files you need to build the new rsync: to build, first you need to download the source code (see rsync.samba.org/rsync/download.html) and then you unpack the file rsync.diffsrc.tgz in the tree code, and build. You may also get the compiled binary directly as ftp://tonelli.sns.it/pub/rsync/rsync and the new code alltogether in ftp://tonelli.sns.it/pub/rsync TODO: there are some potentially good ideas here: 1) the idea is to add modules to rsync: a gzip module, a deb module, and rpm module...; currently, modules just look for an older local version of the file; in a future version, any module would apply to a certain type of file, and create another file to pass to rsync so that this another file may probably lead to more speedup: e.g., the gzip module would unzip files before doing comparisons, and the deb module would unzip the data.tar.gz part of a package CONS: this would not be backward compatible, of course The idea is, a module may provide the following calls: find_alternative_version_MOD() receive_file_MOD() send_file_MOD() Currently, only find_alternative_version_deb() was implemented. If rsync uses only the find_alternative_version_MOD() calls, then it is backward compatible with the usual version: (in a sense , it is doing what the option --compare-dest already does, only in a smarter way) I have not currently implemented anyreceive_file_MOD() send_file_MOD() : these would need a change in the protocol: I hope that the rsync authors will give permission 1b) My idea (not sure) is that rsync may work if provided with named pipes instead of files: indeed, according to the technical report, it needs to read the local and remote files only once, and then, it writes the local file, without ever seeking backwards; then, the above modules would not need to actually use disk space and create temporary files. 2) for a faster apt-get downloading, it may be possible to do the same trick WHEN UPGRADING INSTALLED PACKAGES! Here is the idea: apt-get creates a local version of the package (using dpkg-repack) and do the rsync to get the remote version -- Andrea C. Mennucci, Scuola Normale Superiore, Pisa, Italy ? modules ? zlib/dummy Index: Makefile.in === RCS file: /cvsroot/rsync/Makefile.in,v retrieving revision 1.39 diff -r1.39 Makefile.in 24c24 lib/fnmatch.h lib/getopt.h lib/mdfour.h --- lib/fnmatch.h lib/getopt.h lib/mdfour.h modules/modules.h 32c32,33 OBJS=$(OBJS1) $(OBJS2) $(DAEMON_OBJ) $(LIBOBJ) $(ZLIBOBJ) --- MODULES_OBJ = modules/modules.o modules/deb.o OBJS=$(OBJS1) $(OBJS2) $(DAEMON_OBJ) $(LIBOBJ) $(ZLIBOBJ) $(MODULES_OBJ) Index: generator.c === RCS file: /cvsroot/rsync/generator.c,v retrieving revision 1.16 diff -r1.16 generator.c 19a20,23 #ifndef NODEBIANVERSIONER #include modules/modules.h #endif
Re: better RSYNC mirroring , for .debs and others
tom rothamel is working on a project called debdiff that works towards the same goal. please read his announcment thread, which is archived at http://www.debian.org/Lists-Archives/debian-devel-0002/msg00391.htm. i like the idea of rsync modules, but the concept you project misses is that even a small addition or subtraction in the beginning of a file ruins rsync's speed bonus because it then has to send everything. take a look at tom's code. i think you'll find it interesting. Andrea Mennucc1 ([EMAIL PROTECTED]) wrote: hi everybody I have implemented a good idea for reducing download stress for everybody who is mirroring a lot of data using rsync, like, the people who are mirroring Debian GNU/Linux: currently, many Debian leaf mirrors are using rsync for mirroring from the main .debian.org hosts. rsync contains a wonderful algorithm to speedup downloads when mirroring files which have only minor differences; only problem is, this algorithm is ALMOST NEVER used when mirroring a debian repository ... indeed, whenever a new version of a package is entered in the debianrepository, this package has a different name: for this reason rsync does just a full download. Summarizing, rsync currently does some speedup only when it downloads Packages.gz files, or when it skips an already existing package. well, I have just implemented a simple way to use the algorithm even when downloading the .debs . here is a simple example suppose the current situation is $REMOTE::/pub/debian/dist/bin/dpkg_2.deb whereas locally we have /debian/dist/bin/dpkg_1.deb when rsync looks for a local version of /debian/dist/bin/dpkg_2.deb if there is none, then rsync does ls -t /debian/dist/bin/dpkg_* and looks for the most recent file it finds this way, rsync will use the file /debian/dist/bin/dpkg_1.deb to try to speedup the download of$REMOTE::/pub/debian/dist/bin/dpkg_2.deb (using its fabulous algorithm) BIG PRO: my new rsync is totally compatible with the old one Conclusion: this idea would make all debian mirror-people happier (specially if they mirror unstable; consider that, often, when a new version of a package is released, only small changes are made... sometimes, only the .postinst , or such, are really changed; this may , thou, masked by the compression, alas: but, see TODO) I attach two files: the first file is a diff, showing where, in the rsync 2.4.1 source code tree, I have done some modifications; the second is a .tgz of the all the new and modified files you need to build the new rsync: to build, first you need to download the source code (see rsync.samba.org/rsync/download.html) and then you unpack the file rsync.diffsrc.tgz in the tree code, and build. You may also get the compiled binary directly as ftp://tonelli.sns.it/pub/rsync/rsync and the new code alltogether in ftp://tonelli.sns.it/pub/rsync TODO: there are some potentially good ideas here: 1) the idea is to add modules to rsync: a gzip module, a deb module, and rpm module...; currently, modules just look for an older local version of the file; in a future version, any module would apply to a certain type of file, and create another file to pass to rsync so that this another file may probably lead to more speedup: e.g., the gzip module would unzip files before doing comparisons, and the deb module would unzip the data.tar.gz part of a package CONS: this would not be backward compatible, of course The idea is, a module may provide the following calls: find_alternative_version_MOD() receive_file_MOD() send_file_MOD() Currently, only find_alternative_version_deb() was implemented. If rsync uses only the find_alternative_version_MOD() calls, then it is backward compatible with the usual version: (in a sense , it is doing what the option --compare-dest already does, only in a smarter way) I have not currently implemented anyreceive_file_MOD() send_file_MOD() : these would need a change in the protocol: I hope that the rsync authors will give permission 1b) My idea (not sure) is that rsync may work if provided with named pipes instead of files: indeed, according to the technical report, it needs to read the local and remote files only once, and then, it writes the local file, without ever seeking backwards; then, the above modules would not need to actually use disk space and create temporary files. 2) for a faster apt-get downloading, it may be possible to do the same trick WHEN UPGRADING INSTALLED PACKAGES! Here is the idea: apt-get creates a local version of the package (using dpkg-repack) and do the rsync to get the remote version -- Andrea C. Mennucci, Scuola Normale Superiore, Pisa, Italy -- (jacob kuntz)[EMAIL PROTECTED] [EMAIL PROTECTED],underworld}.net (megabite systems)
Re: better RSYNC mirroring , for .debs and others
On Thu, 9 Mar 2000, Andrea Mennucc1 wrote: rsync contains a wonderful algorithm to speedup downloads when mirroring files which have only minor differences; only problem is, this algorithm is ALMOST NEVER used when mirroring a debian repository Small detail here, .debs, like .gz files are basically not-rsyncable. gzip effectively randomizes the contents of the files making the available differences very, very small. This is particularly true for .debs when you add in the fact that gcc never produces binary identical output on consecutive runs. Please *do not* run a client with this type of patch connected to any of our servers, it will send the load sky high for no good reason, rsync is already responsible for silly amounts of load, do not make it worse. Jason
Re: better RSYNC mirroring , for .debs and others
On Thu, Mar 09, 2000 at 12:26:30PM -0700, Jason Gunthorpe wrote: differences very, very small. This is particularly true for .debs when you add in the fact that gcc never produces binary identical output on consecutive runs. I'm not arguing the rest of your points, but I'm curious about this one. IIRC, the last thing a full bootstrap of GCC does, after building stage one binaries with the native compiler, stage two binaries with the stage one binaries and stage three binaries with the stage two binaries, is compare the stage two and stage three binaries. If they're not the same, then you have a problem. I don't see how this fits with what you're saying. -- David Starner - [EMAIL PROTECTED] Only a nerd would worry about wrong parentheses with square brackets. But that's what mathematicians are. -- Dr. Burchard, math professor at OSU
Re: better RSYNC mirroring , for .debs and others
On Thu, 9 Mar 2000, David Starner wrote: I'm not arguing the rest of your points, but I'm curious about this one. IIRC, the last thing a full bootstrap of GCC does, after building stage one binaries with the native compiler, Hum, It *used* to do this, can't seem to get it to do it today though oh well IIRC it only applied to debug information, it included timestamps or some such. Jason
Re: better RSYNC mirroring , for .debs and others
On Thu, Mar 09, 2000 at 12:46:05PM -0700, Jason Gunthorpe wrote: On Thu, 9 Mar 2000, David Starner wrote: I'm not arguing the rest of your points, but I'm curious about this one. IIRC, the last thing a full bootstrap of GCC does, after building stage one binaries with the native compiler, Hum, It *used* to do this, can't seem to get it to do it today though oh well IIRC it only applied to debug information, it included timestamps or some such. There is a small header at the beginning of an object file which is different each time, because it contains a time stamp. This is why 'make compare' removes the first 16 bytes of the object files before comparing. for file in *$(objext); do \ tail +16c ./$$file tmp-foo1; \ tail +16c stage$$stage/$$file tmp-foo2 \ (cmp tmp-foo1 tmp-foo2 /dev/null 21 || echo $$file differs .bad_compare) || true; \ done Marcus -- `Rhubarb is no Egyptian god.' Debian http://www.debian.org Check Key server Marcus Brinkmann GNUhttp://www.gnu.orgfor public PGP Key [EMAIL PROTECTED], [EMAIL PROTECTED]PGP Key ID 36E7CD09 http://homepage.ruhr-uni-bochum.de/Marcus.Brinkmann/ [EMAIL PROTECTED]
Re: better RSYNC mirroring , for .debs and others
On 9 Mar 2000 12:56:29 -0500, Jacob Kuntz wrote: tom rothamel is working on a project called debdiff that works towards the same goal. please read his announcment thread, which is archived at http://www.debian.org/Lists-Archives/debian-devel-0002/msg00391.htm. The code associated with this is now available at http://onegeek.org/~tom/software/ddiff/, for what it's worth. I do happen to think that rsync is an inefficent solution to archive mirroring, however, as it seems it would need to scan and checksum a huge number of files every time it runs. Probably a better way would be to have dinstall[1] generate a list of changes it makes to the archive, and have people mirroring the archive use those lists to figure out what needs to be downloaded. This would also have the benefit of making it easy to ensure that archive mirrors are always in a consistent state. (ie, Packages.gz is updated after new packages have been downloaded, but before old packages are deleted.) [1] At least, I think that's it. I'm not really sure how things work on the Debian end... I probably won't know for sure until hell freezes over^W^W^Wnew-maintainer reopens. -- Tom Rothamel - http://onegeek.org/~tom/ -- Using GNU/Linux The Moon is Waxing Crescent (16% of Full)