Re: A success story with apt and rsync

2003-07-31 Thread Otto Wyss
 From time to time the question arises on different forums whether it is
 possible to efficiently use rsync with apt-get. Recently there has been a
 thread here on debian-devel and it was also mentioned in Debian Weekly News
 June 24th, 2003. However, I only saw different small parts of a huge and
 complex problem set discussed at different places, I haven't find an
 overview of the whole situation anywhere.
 
Sorry that I write so late but I'm not reading debian-devel regularly.

I've started a solution to distribute Debian mirrors by rsync about 2
years ago. The only impact (if impact is the right word) of my
soultion on Debian is the use of the rsync patch for gzip. Everything
else is solve by my perl script so you might find ideas for your apt
solution there. See http://dpartialmirror.sourceforge.net/;.

O. Wyss

-- 
See http://wxguide.sourceforge.net/; for ideas how to design your app.




Re: A success story with apt and rsync

2003-07-07 Thread Michael Karcher
On Sun, Jul 06, 2003 at 01:29:06AM +0200, Andrew Suffield wrote:
 It should put them in the package in the order they came from
 readdir(), which will depend on the filesystem. This is normally the
 order in which they were created,
As long as the file system uses an inefficient approach for directories like
the ext2/ext3 linked lists. If directories are hash tables (like on
reiserfs) even creating another file in the same directory may totally mess
up the order.

Michael Karcher




Re: A success story with apt and rsync

2003-07-07 Thread Goswin Brederlow
Michael Karcher [EMAIL PROTECTED] writes:

 On Sun, Jul 06, 2003 at 01:29:06AM +0200, Andrew Suffield wrote:
  It should put them in the package in the order they came from
  readdir(), which will depend on the filesystem. This is normally the
  order in which they were created,
 As long as the file system uses an inefficient approach for directories like
 the ext2/ext3 linked lists. If directories are hash tables (like on
 reiserfs) even creating another file in the same directory may totally mess
 up the order.
 
 Michael Karcher

ext2/ext3 has hashed dirs too if you configure it.

MfG
Goswin




Re: A success story with apt and rsync

2003-07-06 Thread Jonathan Oxer
On Sun, 2003-07-06 at 09:27, Goswin Brederlow wrote:
 4. (and this is the knockout) rsync support for apt-get is NO
 WANTED. rsync uses too much resources (cpu and more relevant IO) on
 the server side and a widespread use of rsync for apt-get would choke
 the rsync mirrors and do more harm than good.

One way to alleviate this would be to only generate the deltas once on
server-side when first requested, then cache them on disk to be served
out like any other static file for reconstruction of the new package on
the client-side using rsync.

I've been thinking for a while about trying to build this into
Apt-cacher.

Jonathan




Re: A success story with apt and rsync

2003-07-06 Thread Martijn van Oosterhout
On Sun, Jul 06, 2003 at 12:37:00PM +1200, Corrin Lakeland wrote:
  4. (and this is the knockout) rsync support for apt-get is NO
  WANTED. rsync uses too much resources (cpu and more relevant IO) on
  the server side and a widespread use of rsync for apt-get would choke
  the rsync mirrors and do more harm than good.
 
 When I was looking into this I heard about some work into caching the rolling 
 checksums to eliminate server load. I didn't find any code.

That would be because the checksums would take at least 8 times the space of
the original files. You need the backward-rsync which was patented last I
heard.

-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 the West won the world not by the superiority of its ideas or values or
 religion but rather by its superiority in applying organized violence.
 Westerners often forget this fact, non-Westerners never do.
   - Samuel P. Huntington


pgpMZ4tlXA1It.pgp
Description: PGP signature


Re: A success story with apt and rsync

2003-07-06 Thread Koblinger Egmont

On Sun, 6 Jul 2003, Andrew Suffield wrote:

 It should put them in the package in the order they came from
 readdir(), which will depend on the filesystem. This is normally the
 order in which they were created, and should not vary when
 rebuilding. As such, sorting the list probably doesn't change the
 network traffic, but will slow dpkg-deb down on packages with large
 directories in them.

Yes, when saying random order I obviously ment in the order readdir()
returns them. It's random for me.  :-)))

It can easily be different on different filesystems, or even on same
type of filesystems with different parameters (e.g. blocksize).

I even think it can be different after a simple rebuild on exactly the
same environment. For example configure and libtool like to create files
with the PID in their name, which can take from 3 to 5 digits. If you
create the file X and then Y, remove X and then create Z then it is most
likely that if Z's name is shorter than or equal to the length of filename
X, then it will be returned first by readdir(), while if its name is
longer, then Y will be returned first and Z afterwards. So I can imagine
situations where the order of the files depend on the PIDs of the build
processes.

However, I guess or goal is not only to produce similar packages from
exactly the same source. It's quite important to produce similar package
even after a version upgrade. For example you have a foobar-0.9 package,
and now upgrade to foobar-1.0. The author may have completely rewritten
Makefile which yields in nearly the same executable, the same data files,
but completely different random order.


However, I think sorting the files costs really nothing. My system is not
a very new one, 375MHz Celeron, IDE disks, 384MB RAM etc... However:

/usr/lib$ du -s .
1,1G.
/usr/lib$ find . -type f | wc -l  # okay, it's now in memory cache
  18598
/usr/lib$ time find . /dev/null 21

real0m0.285s
user0m0.100s
sys 0m0.150s
[EMAIL PROTECTED]:/usr/lib$ time sortdir find . /dev/null 21

real0m1.683s
user0m1.390s
sys 0m0.250s


IMHO a step which takes one and a half seconds before compressing 18000
files of more than 1 gigabytes shouldn't be a problem.




cheers,
Egmont




Re: A success story with apt and rsync

2003-07-06 Thread Andrew Suffield
On Sun, Jul 06, 2003 at 10:28:07PM +0200, Koblinger Egmont wrote:
 
 On Sun, 6 Jul 2003, Andrew Suffield wrote:
 
  It should put them in the package in the order they came from
  readdir(), which will depend on the filesystem. This is normally the
  order in which they were created, and should not vary when
  rebuilding. As such, sorting the list probably doesn't change the
  network traffic, but will slow dpkg-deb down on packages with large
  directories in them.
 
 Yes, when saying random order I obviously ment in the order readdir()
 returns them. It's random for me.  :-)))
 
 It can easily be different on different filesystems, or even on same
 type of filesystems with different parameters (e.g. blocksize).

I can't think of any reason why changing the blocksize would affect
this. Most filesystems return files in the sequence in which they were
added to the directory. ext2, ext3, and reiser all do this; xfs is the
only one likely to be used on a Debian system which doesn't.

 I even think it can be different after a simple rebuild on exactly the
 same environment. For example configure and libtool like to create files
 with the PID in their name, which can take from 3 to 5 digits. If you
 create the file X and then Y, remove X and then create Z then it is most
 likely that if Z's name is shorter than or equal to the length of filename
 X, then it will be returned first by readdir(), while if its name is
 longer, then Y will be returned first and Z afterwards. So I can imagine
 situations where the order of the files depend on the PIDs of the build
 processes.

This lengthly bit of handwaving has no connection with reality.

 However, I think sorting the files costs really nothing. My system is not
 a very new one, 375MHz Celeron, IDE disks, 384MB RAM etc... However:
 
 /usr/lib$ du -s .
 1,1G.
 /usr/lib$ find . -type f | wc -l  # okay, it's now in memory cache
   18598
 /usr/lib$ time find . /dev/null 21
 
 real0m0.285s
 user0m0.100s
 sys 0m0.150s
 [EMAIL PROTECTED]:/usr/lib$ time sortdir find . /dev/null 21
 
 real0m1.683s
 user0m1.390s
 sys 0m0.250s
 
 
 IMHO a step which takes one and a half seconds before compressing 18000
 files of more than 1 gigabytes shouldn't be a problem.

This test only shows that you don't understand what is going on; it
has no relation to the problems that can occur.

On ext2, as an example, stat()ting or open()ing a directory of 1
files in the order returned by readdir() will be vastly quicker than
in some other sequence (like, say, bytewise lexicographic) due to the
way in which the filesystem looks up inodes. This has caused
significant performance issues for bugs.debian.org in the past.

-- 
  .''`.  ** Debian GNU/Linux ** | Andrew Suffield
 : :' :  http://www.debian.org/ | Dept. of Computing,
 `. `'  | Imperial College,
   `- --  | London, UK


pgp5SqrSYg0gQ.pgp
Description: PGP signature


Re: A success story with apt and rsync

2003-07-06 Thread Koblinger Egmont
Hi,

On 6 Jul 2003, Goswin Brederlow wrote:

 2. most of the time you have no old file to rsync against. Only
 mirrors will have an old file and they already use rsync.

This is definitely true if you install your system from CD's and then
upgrade it. However, if you keep on upgrading from testing/unstable then
you'll have more and more packages under /var/cache/apt/archives so it
will have more and more chance that an older version is found there. Or,
alternatively, if you are sitting behind a slow modem and apt-get
upgrade says it will upgrade extremely-huge-package, then you can still
easily insert your CD and copy the old version of extremely-huge-package
to /var/cache/apt/archives and hit ENTER to apt-get afterwards.

 3. rsyncing against the previous version is only possible via some
 dirty hack as apt module. apt would have to be changed to provide
 modules access to its cache structure or at least pass any previous
 version as argument. Some mirror scripts alreday use older versions as
 templaes for new versions.

Yes, this is what I've hacked together based on other people's great work.
It is (as I've said too) a dirty hack. If a more experienced apt-coder can
replace my hard-coded path with a mechanism that tells this path to the
module, then this hack won't even be dirty.

 4. (and this is the knockout) rsync support for apt-get is NO
 WANTED. rsync uses too much resources (cpu and more relevant IO) on
 the server side and a widespread use of rsync for apt-get would choke
 the rsync mirrors and do more harm than good.

It might be no wanted for administrators, however, I guess it is wanted to
many of the users (at least for me :-)) I don't see the huge load of the
server (since I'm the only one rsyncing from it), but I see the huge
difference in the download time. If my download wasn't faster because of
an overloaded server, I would switch back to FTP or anything which is
better to me as an end user.

I understand that rsync causes a high load on the server when several
users are connected, and so it is not suitable as a general replacement
for ftp, however I think it is suitable as an alternative. I also don't
expect the Debian team itself to set up a public rsync server for the
packages. However, some mirrors might want to set up an rsync server
either for the public or for example a university for its students.

Similar hack could be simply used by people who have account to a machine
with high bandwidth. For example if I used Debian and Debian had rsyncable
packages, but no public rsync server was available, I'd personally mirror
Debian to a machine at the university using FTP and would use rsync from
that server to my home machine to save traffic where the bandwidth is a
bottleneck.

So I don't think it's a bad idea to set up some public rsync servers
worldwide. The maximum number of connections can be set up so that cpu
usage is limited somehow. It's obvious that if a user often gets the
connection refused then he will switch back to ftp or http. Hence I guess
that the power of the public rsync servers and the users using rsync would
somehow be automatically balanced, it doesn't have to be coordinated
centrally. So IMHO let anybody set up an rsync server if he wants to, and
let the users use rsync if they want to (but don't put an rsync:// line
in the default sources.list).


 All together I think a extended bittorrent module for apt-get is by
 far the better sollution but it will take some more time and designing
 before it can be implemented.

It is very promising and I really hope that it will be a good protocol
with a good implementation and integration to apt. But until this is
realized, we still could have rsync as an alternative, if Debian packages
were packed in a slightly different way.



bye,

Egmont




Re: A success story with apt and rsync

2003-07-06 Thread Theodore Ts'o
On Sun, Jul 06, 2003 at 10:12:03PM +0100, Andrew Suffield wrote:
 On Sun, Jul 06, 2003 at 10:28:07PM +0200, Koblinger Egmont wrote:
  Yes, when saying random order I obviously ment in the order readdir()
  returns them. It's random for me.  :-)))
  
  It can easily be different on different filesystems, or even on same
  type of filesystems with different parameters (e.g. blocksize).
 
 I can't think of any reason why changing the blocksize would affect
 this. Most filesystems return files in the sequence in which they were
 added to the directory. ext2, ext3, and reiser all do this; xfs is the
 only one likely to be used on a Debian system which doesn't.

Err, no.  If the htree (hash tree) indexing feature is turned on for
ext2 or ext3 filesystems, they will returned sorted by the hash of the
filename --- effectively a random order.  (Since the hash also
includes a secret, random, per-filesystem secret in order to avoid
denial of service attacks by malicious users who might otherwise try
to create huge numbers of files containing hash collisions.)

I would be very, very surprised if reiserfs returned files in creation
order.  The fundamental problem is that the
readdir()/telldir()/seekdir() API is fundamentally busted.  Yes,
Dennis Ritchie and Ken Thompson do make mistakes, and have made many;
in this particular case, they made a whopper.  

Seekdir()/telldir() assumes a linear directory structure which you can
seek into, such that the results of readdir() are repeatable.  Posix
only allows files which are created or deleted in the interval to be
undefined; all other files must be returned in the same order as the
original readdir() stream, even if days or weeks elapse between the
readdir(), telldir(), and seekdir() calls.

Any filesystem which tries to use a B-tree like system, where leaf
nodes can be split, is going to have extreme problems trying to keep
these guarantees.  For this reason, most filesystem designers choose
to return files in b-tree order, and *not* the order in which files
were added to the directory.

It is really, really bad assumption to assume that files will be
returned in the same order as they were created.

 On ext2, as an example, stat()ting or open()ing a directory of 1
 files in the order returned by readdir() will be vastly quicker than
 in some other sequence (like, say, bytewise lexicographic) due to the
 way in which the filesystem looks up inodes. This has caused
 significant performance issues for bugs.debian.org in the past.

If you are using HTREE, and want to do a readdir() scan followed by
something which opens or stat's all of the files, you very badly will
want to sort the returned directory inodes by the inode number
(de-d_inode).  Otherwise, the order returned by readdir() will be
effectively random, with the resulting loss of performance which you
alluded to because the filesystem needs to randomly seek and ready all
around the inode table.

Why can't this be done in the kernel?  Because if the directory is 200
megabytes, then kernel would need to allocate and hold on to 200
megabytes until the userspace called closedir().  There is simply no
lightweight way to work around the problems caused by the broken API
which Ken Thompson and Dennis Ritchie designed.

The good news is that this particular optimization of sorting by inode
number should work for all filesystems, and should speed up xfs as
well as ext2/3 with HTREE.

- Ted




Re: A success story with apt and rsync

2003-07-06 Thread Koblinger Egmont

On Sun, 6 Jul 2003, Andrew Suffield wrote:

 On ext2, as an example, stat()ting or open()ing a directory of 1
 files in the order returned by readdir() will be vastly quicker than
 in some other sequence (like, say, bytewise lexicographic) due to the
 way in which the filesystem looks up inodes. This has caused
 significant performance issues for bugs.debian.org in the past.

You're right, I didn't get the point in the story when I simply ran find
using the sortdir wrapper, but now I understand the problem.

However I'm still unsure if this good to keep files unsorted, especially
if we consider effective syncing of packages. On my home computer I've
never heard the sound of my disk at package creating phase (even though
we've beein using sortdir for more than a half year, and I've compiled
hundreds of packages), but I hear it when e.g. the source is decompressed.
At the 'dpkg-deb --build' phase only the processor is the bottleneck.

This might vary under different circumstances. I'm unaware of them in case
of Debian, e.g. I have no information about what hardware your packages
are created on, whether there are any other cpu-intensive or
disk-intensive applications running on these machines etc. I can easily
imagine that using sortdir can drastically decrease performance if another
disk-intensive process is running. However my experiences didn't show a
noticeable performance decrease if this was the only process accessing the
disk...

But hey, let's stop for a minute :-) Building the package only uses the
memory cache for most of the packages, doesn't it? The files it packs
together have just recently been created and there are not so many
packages whose uncompressed size is close to or bigger than the amount of
RAM in today's machines...

And for the large packages the build itself might take thousands as much
time as reading the files in sorted order.

Does anyone know what RPM does? I know that listing the contents of a
package always produces alphabetical order but I don't know whether the
filelist is sorted on the fly or the files really appear alphabetically in
the cpio archive.


So I guess we've already seen pros and cons of sorting the files. (One
thing is missing: we still don't know how efficient rsync is if two
rsyncable tar.gz files contain the same files but in different order.)
The decision is clearly not mine but the Debian developers'. However, if
you ask me, I still vote for sorting the files :-))




bye,

Egmont




Re: A success story with apt and rsync

2003-07-06 Thread Andrew Suffield
On Sun, Jul 06, 2003 at 05:48:24PM -0400, Theodore Ts'o wrote:
 On Sun, Jul 06, 2003 at 10:12:03PM +0100, Andrew Suffield wrote:
  On Sun, Jul 06, 2003 at 10:28:07PM +0200, Koblinger Egmont wrote:
   Yes, when saying random order I obviously ment in the order readdir()
   returns them. It's random for me.  :-)))
   
   It can easily be different on different filesystems, or even on same
   type of filesystems with different parameters (e.g. blocksize).
  
  I can't think of any reason why changing the blocksize would affect
  this. Most filesystems return files in the sequence in which they were
  added to the directory. ext2, ext3, and reiser all do this; xfs is the
  only one likely to be used on a Debian system which doesn't.
 
 Err, no.  If the htree (hash tree) indexing feature is turned on for
 ext2 or ext3 filesystems, they will returned sorted by the hash of the
 filename --- effectively a random order.  (Since the hash also
 includes a secret, random, per-filesystem secret in order to avoid
 denial of service attacks by malicious users who might otherwise try
 to create huge numbers of files containing hash collisions.)

I can only presume this is new or obscure, since everything I tried
had the traditional behaviour. Can't see how to turn it on, either.

 I would be very, very surprised if reiserfs returned files in creation
 order.

Some trivial testing indicates that it does. Heck if I know how or why.

 It is really, really bad assumption to assume that files will be
 returned in the same order as they were created.

However, there's no real need to - that was just an example. As long
as the sequence is more or less stable (which it should be, for
btrees; don't know about htree) then rsync won't be perturbed.

  On ext2, as an example, stat()ting or open()ing a directory of 1
  files in the order returned by readdir() will be vastly quicker than
  in some other sequence (like, say, bytewise lexicographic) due to the
  way in which the filesystem looks up inodes. This has caused
  significant performance issues for bugs.debian.org in the past.
 
 If you are using HTREE, and want to do a readdir() scan followed by
 something which opens or stat's all of the files, you very badly will
 want to sort the returned directory inodes by the inode number
 (de-d_inode).  Otherwise, the order returned by readdir() will be
 effectively random, with the resulting loss of performance which you
 alluded to because the filesystem needs to randomly seek and ready all
 around the inode table.

Hmm, that's going to cause some trouble if htree becomes common. Is
there any way to test for this at runtime?

 The good news is that this particular optimization of sorting by inode
 number should work for all filesystems, and should speed up xfs as
 well as ext2/3 with HTREE.

What about ext[23] without htree? Mucking with the order returned by
readdir() has historically caused problems there...

-- 
  .''`.  ** Debian GNU/Linux ** | Andrew Suffield
 : :' :  http://www.debian.org/ | Dept. of Computing,
 `. `'  | Imperial College,
   `- --  | London, UK


pgpbFXtT67wbT.pgp
Description: PGP signature


Re: A success story with apt and rsync

2003-07-06 Thread Matt Zimmerman
On Sun, Jul 06, 2003 at 11:36:34PM +0100, Andrew Suffield wrote:

 On Sun, Jul 06, 2003 at 05:48:24PM -0400, Theodore Ts'o wrote:
  Err, no.  If the htree (hash tree) indexing feature is turned on for
  ext2 or ext3 filesystems, they will returned sorted by the hash of the
  filename --- effectively a random order.  (Since the hash also
  includes a secret, random, per-filesystem secret in order to avoid
  denial of service attacks by malicious users who might otherwise try
  to create huge numbers of files containing hash collisions.)
 
 I can only presume this is new or obscure, since everything I tried
 had the traditional behaviour. Can't see how to turn it on, either.

I believe htree == dir_index, so tune2fs(8) and mke2fs(8) have the answer.

-- 
 - mdz




Re: A success story with apt and rsync

2003-07-06 Thread Andrew Suffield
On Sun, Jul 06, 2003 at 07:28:09PM -0400, Matt Zimmerman wrote:
 On Sun, Jul 06, 2003 at 11:36:34PM +0100, Andrew Suffield wrote:
 
  On Sun, Jul 06, 2003 at 05:48:24PM -0400, Theodore Ts'o wrote:
   Err, no.  If the htree (hash tree) indexing feature is turned on for
   ext2 or ext3 filesystems, they will returned sorted by the hash of the
   filename --- effectively a random order.  (Since the hash also
   includes a secret, random, per-filesystem secret in order to avoid
   denial of service attacks by malicious users who might otherwise try
   to create huge numbers of files containing hash collisions.)
  
  I can only presume this is new or obscure, since everything I tried
  had the traditional behaviour. Can't see how to turn it on, either.
 
 I believe htree == dir_index, so tune2fs(8) and mke2fs(8) have the answer.

My /home has that enabled and readdir() returns files in creation order.

-- 
  .''`.  ** Debian GNU/Linux ** | Andrew Suffield
 : :' :  http://www.debian.org/ | Dept. of Computing,
 `. `'  | Imperial College,
   `- --  | London, UK


pgpNA07l53T5F.pgp
Description: PGP signature


Re: A success story with apt and rsync

2003-07-06 Thread Theodore Ts'o
On Sun, Jul 06, 2003 at 11:36:34PM +0100, Andrew Suffield wrote:
 
 I can only presume this is new or obscure, since everything I tried
 had the traditional behaviour. Can't see how to turn it on, either.
 

It's new for 2.5.  Backports to 2.4 are available here:

http://thunk.org/tytso/linux/extfs-2.4-update/extfs-update-2.4.21

For those who are interested, the broken out patches can be found here:

http://thunk.org/tytso/linux/extfs-2.4-update/broken-out-2.4.21/to-apply

Once you have a htree-enabled kernel, you enable a filesystem to use
the feature by using the following command:

tune2fs -O dir_index /dev/hdXX

Optionally, you can reorganize all of the directories to use btrees by
using the command e2fsck -fD /dev/hdXX.  Otherwise, only directories
that are expanded beyond a single block after you set the dir_index
flag will use htrees.  The dir_index is a fully compatible extension,
so it's perfectly safe to mount a filesystem with htrees on a
non-htree kernel.  A non-htree kernel will just ignore the b-tree
information, and if it attempts to modify a hash-tree directory, it
will just invalidate the htree interior node information, so that the
directory becomes unindexed until e2fsck -fD is run over the
filesystem to which optmizes all of the directories by reindexing them
all.

Why would you want to use htrees?  Because they speed up large
directories.  A lot.  Try creating 400,000 zero-length files in a
single directory.  It will take under 30 seconds with htree enabled,
and well over an hour without.

  The good news is that this particular optimization of sorting by inode
  number should work for all filesystems, and should speed up xfs as
  well as ext2/3 with HTREE.
 
 What about ext[23] without htree? Mucking with the order returned by
 readdir() has historically caused problems there...

It'll be fine; in fact, in some cases you'll see a slight speed up.
The key is that you'll get the best performance by reading/modifying
the inode data structures in sorted order by inode number.  This way,
you make a single sweep through the inode table, without needing any
extraneous seeks.  Using the natural sort order of readdir() on
non-htree ext2/3 systems mostly approximated this --- although if
files are deleted and created from the directory, this is not
guaranteed.  So sorting by inode number will never hurt, and may help.

- Ted




Re: A success story with apt and rsync

2003-07-06 Thread Theodore Ts'o
On Mon, Jul 07, 2003 at 01:01:34AM +0100, Andrew Suffield wrote:
  
  I believe htree == dir_index, so tune2fs(8) and mke2fs(8) have the answer.
 
 My /home has that enabled and readdir() returns files in creation order.
 

Then you don't have a htree-capable kernel or the directory isn't
indexed.  Directories that fit in a block are not indexed, as are
directories larger than a block that were created before directory
indexing was enabled, or if they were modified by a non-htree capable
kernel.

You can use the lsattr command to see if the indexed (I) flag is set
on a particular directory:

% lsattr -d /home/tytso
--I-- /home/tytso

- Ted




A success story with apt and rsync

2003-07-05 Thread Koblinger Egmont
Hi,

From time to time the question arises on different forums whether it is
possible to efficiently use rsync with apt-get. Recently there has been a
thread here on debian-devel and it was also mentioned in Debian Weekly News
June 24th, 2003. However, I only saw different small parts of a huge and
complex problem set discussed at different places, I haven't find an
overview of the whole situation anywhere.

Being one of the developers of the Hungarian distribution ``UHU-Linux'' I
spent some time in the last few days by collecting as much information as
possible, putting the patches together and coding a little bit to fill some
minor gaps. Here I'd like to summarize and share all my experiences.

Our distribution uses dpkg/apt for package management. We are not Debian
based though, even our build procedure which leads to deb packages is
completely different from Debian's, except for the last step (which is
obviously a ``dpkg-deb --build'').

Some of our resources are quite tight. This is especially true for the
bandwidth of the home machines of the developers and testers. Most of us
live behind a 384kbit/s ADSL line. From time to time we rebuild all our
packages to see if they still compile with our current packages. Such a full
rebuild produces 1.5GB of new packages, all with a new filename since the
release numbers are automatically bumped. Our goal was to reach that an
upgrade after such a full rebuild requires only a reasonable amount of
network traffic instead of one and a half gigabytes. Before telling how we
succeeded in it I'd like to demonstrate the result.


One of my favorite games is quadra. The size of the package is nearly 3MB.
I've purged it from my system and then performed an ``apt-get install
quadra''. Apt-get printed this, amongst others:

Get:1 rsync://rsync.uhulinux.hu ./ quadra 1.1.8-2.8 [2931kB]
Fetched 2931kB in 59s (49,0kB/s)

The download speed and time corresponds to the 384kbit/s bandwidth.

I recompiled the package on the server. Then I typed ``apt-get update''
followed by ``apt-get install quadra'' again. This time apt-get printed
this:

Get:1 rsync://rsync.uhulinux.hu ./ quadra 1.1.8-2.9 [2931kB]
Fetched 2931kB in 3s (788kB/s)

Yes, downloading only took three seconds instead of one minute. Obviously
these two files do not only differ in their filename, they contain their
release number, timestamps of files and perhaps other pieces of data which
make them different. Needless to say that a small change in the package
would only slightly increase the download time.

Speedup is usually approx. 2x--3x for packages containing lots of small
files, but can be extremely high for packages containing bigger files.


The rest of my mail tells the implementation details.


rsyncable gzip files


A small change in a file causes their gzipped version to get out of sync and
hence rsync doesn't see any common parts in them. There's a patch by Rusty
Russell floating around on the net which adds an --rsyncable option to gzip.
It is already included in Debian. This way gzipped files have
synchronization points making rsync's job much easier. The patch is
available (amongst others) at [1a] and [1b].

The documentation in the original patch says ``This reduces compression by
about 1 percent most cases''. Debian's version says ``This increases size by
less than 1 percent most cases''. Size increasement was 0.7% for all our
packages, but 1.2% for our most important packages (the core distrib in
about 300--400MB).

This 1% is very low if you think of it as 1%. If you think of it as you lose
6MB on every CD, then, well, it could have been smaller. But if you think of
what you gain with it, then it is definitely worth it.

The same patch also exists for zlib (see [2a] or [2b]). However as for gzip
you can control this behaviour with a command line option, it is not so
trivial to do it with a library. The official patch disables rsyncable
support by default. You can enable it by changing zlib_rsync = 0 to
zlib_rsync = 1 within zlib's source or you can control it from your
running application. As I didn't like these approaches, I added a small
patch so that setting the ZLIB_RSYNC environment variable turns on the
rsyncable support. This patch is at [3].

As dpkg seems to statically link against zlib, we had to recompile dpkg
after installing this patched zlib. After this we changed our build script
so that it invokes ``dpkg-deb --build'' with the ZLIB_RSYNC environment
variable set to some value.


order of files
--

dpkg-deb puts the files in the .deb package in random order. I hate this
misfeature since it's hard to eye-grep anything from ``dpkg -L'' or F3 in
mc. We run ``dpkg-deb --build'' using the sortdir library ([4a], [4b]) which
makes the files appear in the package in alphabetical order. I don't know
how efficient rsync is if you split a file to some dozens or even hundreds
of parts and shuffle them, and then syncronize this one with the original
version. 

Re: A success story with apt and rsync

2003-07-05 Thread Andrew Suffield
On Sat, Jul 05, 2003 at 11:56:41PM +0200, Koblinger Egmont wrote:
 order of files
 
 dpkg-deb puts the files in the .deb package in random order. I hate this
 misfeature since it's hard to eye-grep anything from ``dpkg -L'' or F3 in
 mc. We run ``dpkg-deb --build'' using the sortdir library ([4a], [4b]) which
 makes the files appear in the package in alphabetical order. I don't know
 how efficient rsync is if you split a file to some dozens or even hundreds
 of parts and shuffle them, and then syncronize this one with the original
 version. Anyway, I'm sure that sorting the files cannot hurt rsync, it can
 only help. I only guess that it really does help a lot.

It should put them in the package in the order they came from
readdir(), which will depend on the filesystem. This is normally the
order in which they were created, and should not vary when
rebuilding. As such, sorting the list probably doesn't change the
network traffic, but will slow dpkg-deb down on packages with large
directories in them.

-- 
  .''`.  ** Debian GNU/Linux ** | Andrew Suffield
 : :' :  http://www.debian.org/ | Dept. of Computing,
 `. `'  | Imperial College,
   `- --  | London, UK


pgpnwSkhK58UF.pgp
Description: PGP signature


Re: A success story with apt and rsync

2003-07-05 Thread Goswin Brederlow
Koblinger Egmont [EMAIL PROTECTED] writes:

 Hi,
 
 From time to time the question arises on different forums whether it is
 possible to efficiently use rsync with apt-get. Recently there has been a
 thread here on debian-devel and it was also mentioned in Debian Weekly News
 June 24th, 2003. However, I only saw different small parts of a huge and
 complex problem set discussed at different places, I haven't find an
 overview of the whole situation anywhere.
...

I worked on an rsync patch for apt-get some years ago and raised some
design questions, some the same as you did in the deleted parts. Lets
summarize what I still remember:

1. debs are gziped so any change (even change in time) results in a
different gzip. The rsyncable patch for gzip helps a lot there. So
lets consider that fixed.

2. most of the time you have no old file to rsync against. Only
mirrors will have an old file and they already use rsync.

3. rsyncing against the previous version is only possible via some
dirty hack as apt module. apt would have to be changed to provide
modules access to its cache structure or at least pass any previous
version as argument. Some mirror scripts alreday use older versions as
templaes for new versions.

4. (and this is the knockout) rsync support for apt-get is NO
WANTED. rsync uses too much resources (cpu and more relevant IO) on
the server side and a widespread use of rsync for apt-get would choke
the rsync mirrors and do more harm than good.

 conclusion
 --
 
 The good news is that it is working perfectly.
 
 The bad news is that you can't hack it on your home computer as long as your
 distribution doesn't provide rsync-friendly packages. Maybe one could set up
 a public rsync server with high bandwidth that keeps syncing the official
 packages and repacks them with rsync-friendly gzip/zlib and sorting the
 files.

There is a growing lobby to use gzip --rsyncable for debian packages
per default. Its coming.


So what can be done?


Doogie is thinking about extending the Bittorrent protocol for use as
apt-get method. I talked with him on irc about some design ideas and
so far it looks realy good if he can get some mirrors to host it.

The bittorrent protocol organises multiple downloaders so that they
also upload to each other and thereby reduces the traffic on the main
server. The extension of the protocol should also utilise http/ftp
mirrors as sources for the files thereby spreading the load over
multiple servers evenly.

Bittorrent calculates a hash for each block of a file very similar to
what rsync needs to work. Via another small extension rolling
checksums for each block could be included in the protocol and a
client side rsync can be done. (I heard this variant of rsync would be
patented in US but never saw real proof of it.)


All together I think a extended bittorrent module for apt-get is by
far the better sollution but it will take some more time and designing
before it can be implemented.

MfG
Goswin




Re: A success story with apt and rsync

2003-07-05 Thread Adam Heath
On 6 Jul 2003, Goswin Brederlow wrote:

 Doogie is thinking about extending the Bittorrent protocol for use as
 apt-get method. I talked with him on irc about some design ideas and
 so far it looks realy good if he can get some mirrors to host it.

My plans are to require no additional software to be installed on any server.
This means all files will be pre-generated, and mirrored.  This also means
that a tracker won't be available on that particular mirror, but the block
checksums will still be available.

 The bittorrent protocol organises multiple downloaders so that they
 also upload to each other and thereby reduces the traffic on the main
 server. The extension of the protocol should also utilise http/ftp
 mirrors as sources for the files thereby spreading the load over
 multiple servers evenly.

What this means is that clients will be able to fetch blocks from normal http
and ftp mirrors.  This will be used to start fetching data before connections
have been opened with peers.

 Bittorrent calculates a hash for each block of a file very similar to
 what rsync needs to work. Via another small extension rolling
 checksums for each block could be included in the protocol and a
 client side rsync can be done. (I heard this variant of rsync would be
 patented in US but never saw real proof of it.)

 All together I think a extended bittorrent module for apt-get is by
 far the better sollution but it will take some more time and designing
 before it can be implemented.

Also, for better sharing, users will have the option of leaving a running
server on their machines.

Additionally, part of my work will include extensions to the tracker to
support tracker peers, and tracker clusters.

Another extension is which tracker to use.  When fetching the .torrent
meta-data, my client will attempt to contact a tracker on the server the
.torrent resides on.  If none is found, it'll fall back to the one encoded in
the .torrent.  This provides for localization of connections, and better
latency.




Re: A success story with apt and rsync

2003-07-05 Thread Corrin Lakeland
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Sunday 06 July 2003 11:27, Goswin Brederlow wrote:
 Koblinger Egmont [EMAIL PROTECTED] writes:
  Hi,
 
  From time to time the question arises on different forums whether it is
 
  possible to efficiently use rsync with apt-get. Recently there has been a
  thread here on debian-devel and it was also mentioned in Debian Weekly
  News June 24th, 2003. However, I only saw different small parts of a huge
  and complex problem set discussed at different places, I haven't find an
  overview of the whole situation anywhere.

 ...

  Lets
 summarize what I still remember:

 2. most of the time you have no old file to rsync against. Only
 mirrors will have an old file and they already use rsync.

/var/cache/apt/ ?

 4. (and this is the knockout) rsync support for apt-get is NO
 WANTED. rsync uses too much resources (cpu and more relevant IO) on
 the server side and a widespread use of rsync for apt-get would choke
 the rsync mirrors and do more harm than good.

When I was looking into this I heard about some work into caching the rolling 
checksums to eliminate server load. I didn't find any code.

 Doogie is thinking about extending the Bittorrent protocol for use as
 apt-get method. I talked with him on irc about some design ideas and
 so far it looks realy good if he can get some mirrors to host it.

Sounds interesting.  bittorrent allocates people to peer off in a round-robin 
fashon, which is really stupid.  If two people have similar IPs they should 
make a better peer.

 Via another small extension rolling
 checksums for each block could be included in the protocol and a
 client side rsync can be done. (I heard this variant of rsync would be
 patented in US but never saw real proof of it.)

Likewise on both counts.

Corrin
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQE/B28si5A0ZsG8x8cRAuuoAJ9+wAEhoRcfBDsAtj96KHowqlM03QCffbF1
sl5I76+IzUdF2MavgDLJcls=
=6X9X
-END PGP SIGNATURE-