hi everybody

I have implemented
a good idea for reducing download stress for everybody who is
mirroring a lot of data using rsync, 
like, the people who are mirroring Debian GNU/Linux:
currently, many Debian "leaf mirrors" are using rsync 
for mirroring from the main  .debian.org hosts.

rsync contains a wonderful algorithm to speedup downloads when mirroring
files which have only minor differences;
only problem is, this algorithm is ALMOST NEVER  used
when mirroring a debian repository
... indeed, whenever a new version of a
package is entered in the debianrepository,
this package has a different name: for this reason rsync  does just a
full download. 
Summarizing, rsync currently does some speedup only
when it downloads Packages.gz files, or when it skips an already existing
package.

well, I have just implemented a simple
way to use the algorithm even when downloading the .debs .

here is a simple example

suppose the current situation is
    $REMOTE::/pub/debian/dist/bin/dpkg_2.deb
whereas locally we have
    /debian/dist/bin/dpkg_1.deb

when rsync looks for a local version of
    /debian/dist/bin/dpkg_2.deb
if there is none, then rsync does
  ls -t     /debian/dist/bin/dpkg_*
and looks for the most recent file it finds

this way, rsync will use the file     /debian/dist/bin/dpkg_1.deb
to try to speedup the download of    $REMOTE::/pub/debian/dist/bin/dpkg_2.deb
(using its fabulous algorithm)

BIG PRO: my new "rsync" is totally compatible with the old one

Conclusion:
this idea would make all debian mirror-people  happier
(specially if they mirror "unstable"; consider that, often,
when a new version of a package is released, only small changes are made...
sometimes, only the .postinst , or such, are really changed;
this may , thou, masked by the compression, alas: but, see TODO)

I attach  two files: the first file is a diff, showing where, in
the "rsync 2.4.1" source code tree, I have done some modifications;
the second is a .tgz of the all the new and modified files you
need to build the new rsync: 
to build, first you need to download
the source code (see rsync.samba.org/rsync/download.html)
and then you unpack the file rsync.diffsrc.tgz in the tree code,
and build.

You may also get the compiled binary directly as 
 ftp://tonelli.sns.it/pub/rsync/rsync
and the new code alltogether in
 ftp://tonelli.sns.it/pub/rsync

TODO:
there are some potentially good ideas here:

1) the idea is to add "modules" to rsync: 
  a "gzip" module, a "deb" module, and "rpm" module...;
  currently, modules just look for an older local version of the file;

  in a future version,  any module would
  apply to a certain type of file, and create
  another file to pass to "rsync"
  so that this another file  may probably lead to more speedup:  
  e.g., the "gzip" module would unzip files before doing comparisons,
  and the "deb" module would unzip the data.tar.gz part of a package

 CONS: this would not be backward compatible, of course
  
  The idea is, a module may provide  the following calls:
   find_alternative_version_MOD()
   receive_file_MOD()
   send_file_MOD()
   
 Currently, only  find_alternative_version_deb() was implemented.

 If rsync uses only the find_alternative_version_MOD()
 calls, then it is "backward compatible" with the usual version:
 (in a sense , it is doing what the option  --compare-dest  already does,
  only in a smarter way)
 
 I have not currently implemented any    receive_file_MOD()
   send_file_MOD() : these would need a change in the protocol:
   I hope that the rsync authors will give permission

1b) My idea (not sure) is that "rsync" may work if provided with "named pipes"
 instead of files: indeed, according to the technical report,
 it needs to read the local and remote files only once, 
  and then, it writes the local file, without ever seeking backwards;
 then, the above modules would not need to actually
 use disk space and create temporary files.


2) for a faster apt-get downloading,
 it may be possible to do the same trick WHEN UPGRADING
 INSTALLED PACKAGES!  Here is the idea:
  "apt-get creates a local version of the package
  (using dpkg-repack)
  and do the rsync to get the remote version"
 


-- 
Andrea C. Mennucci,   Scuola Normale Superiore, Pisa, Italy
? modules
? zlib/dummy
Index: Makefile.in
===================================================================
RCS file: /cvsroot/rsync/Makefile.in,v
retrieving revision 1.39
diff -r1.39 Makefile.in
24c24
<       lib/fnmatch.h lib/getopt.h lib/mdfour.h
---
>       lib/fnmatch.h lib/getopt.h lib/mdfour.h modules/modules.h
32c32,33
< OBJS=$(OBJS1) $(OBJS2) $(DAEMON_OBJ) $(LIBOBJ) $(ZLIBOBJ)
---
> MODULES_OBJ = modules/modules.o modules/deb.o
> OBJS=$(OBJS1) $(OBJS2) $(DAEMON_OBJ) $(LIBOBJ) $(ZLIBOBJ) $(MODULES_OBJ)
Index: generator.c
===================================================================
RCS file: /cvsroot/rsync/generator.c,v
retrieving revision 1.16
diff -r1.16 generator.c
19a20,23
> #ifndef NODEBIANVERSIONER
> #include "modules/modules.h"
> #endif
> 
311c315,349
<                       fnamecmp = fnamecmpbuf;
---
>                 {
>                   fnamecmp = fnamecmpbuf;
>                   if (verbose > 1)
>                     rprintf(FINFO,"recv_generator  opens %s\n",fnamecmp);
>                 }
>       }
> #ifndef NODEBIANVERSIONER
>       /* by A Mennucci. GPL
>          this piece will look for a previous version 
>          of the same file
>       I think that rsync is somewhat a "spaghetti code":
>       look at how many extern declarations it uses....
>       and it is crazy that this check has to be done in two separate places
>       */
>       if (statret == -1) {
>         char *nf;
>         int saveerrno = errno;
>         nf=find_alternative_version(fname);
>         if ( nf != NULL)
>           {
>             statret = link_stat(nf,&st);
>             if (!S_ISREG(st.st_mode))
>               statret = -1;
>             if (statret == -1)
>               {
>                 perror("stat of suggested older version failed:");
>                 errno = saveerrno;
>               }
>             else
>               {
>                 fnamecmp = fnamecmpbuf;
>                 strcpy(fnamecmp, nf);
>               }
>             free (nf);
>           }
312a351
> #endif
Index: receiver.c
===================================================================
RCS file: /cvsroot/rsync/receiver.c,v
retrieving revision 1.28
diff -r1.28 receiver.c
18a19,21
> #ifndef NODEBIANVERSIONER
> #include "modules/modules.h"
> #endif
21a25
> 
375a380,401
> #ifndef NODEBIANVERSIONER
>               /* by A Mennucci.
>                  this piece will look for a previous version 
>                  of the same file */
>               if ((fd1 == -1)) {
>                 char *nf;
>                 nf=find_alternative_version(fname);
>                 if (nf!= NULL)
>                   {
>                     fnamecmp = fnamecmpbuf;
>                     strcpy(fnamecmpbuf,nf);
>                     fd1 = do_open(nf, O_RDONLY, 0);
>                     if(fd1==-1) 
>                       perror("file candidate");
>                     free(nf);
>                   }
>               }
>               if (fd1 != -1 )
>                 rprintf(FINFO,
>                         "((candidate local oldfile for %s is %s))\n",
>                         fname,fnamecmp);
> #endif

Attachment: rsync.diffsrc.tgz
Description: GNU Unix tar archive

Reply via email to