Re: A success story with apt and rsync

2003-07-06 Thread Koblinger Egmont

On Sun, 6 Jul 2003, Andrew Suffield wrote:

 It should put them in the package in the order they came from
 readdir(), which will depend on the filesystem. This is normally the
 order in which they were created, and should not vary when
 rebuilding. As such, sorting the list probably doesn't change the
 network traffic, but will slow dpkg-deb down on packages with large
 directories in them.

Yes, when saying random order I obviously ment in the order readdir()
returns them. It's random for me.  :-)))

It can easily be different on different filesystems, or even on same
type of filesystems with different parameters (e.g. blocksize).

I even think it can be different after a simple rebuild on exactly the
same environment. For example configure and libtool like to create files
with the PID in their name, which can take from 3 to 5 digits. If you
create the file X and then Y, remove X and then create Z then it is most
likely that if Z's name is shorter than or equal to the length of filename
X, then it will be returned first by readdir(), while if its name is
longer, then Y will be returned first and Z afterwards. So I can imagine
situations where the order of the files depend on the PIDs of the build
processes.

However, I guess or goal is not only to produce similar packages from
exactly the same source. It's quite important to produce similar package
even after a version upgrade. For example you have a foobar-0.9 package,
and now upgrade to foobar-1.0. The author may have completely rewritten
Makefile which yields in nearly the same executable, the same data files,
but completely different random order.


However, I think sorting the files costs really nothing. My system is not
a very new one, 375MHz Celeron, IDE disks, 384MB RAM etc... However:

/usr/lib$ du -s .
1,1G.
/usr/lib$ find . -type f | wc -l  # okay, it's now in memory cache
  18598
/usr/lib$ time find . /dev/null 21

real0m0.285s
user0m0.100s
sys 0m0.150s
[EMAIL PROTECTED]:/usr/lib$ time sortdir find . /dev/null 21

real0m1.683s
user0m1.390s
sys 0m0.250s


IMHO a step which takes one and a half seconds before compressing 18000
files of more than 1 gigabytes shouldn't be a problem.




cheers,
Egmont




Re: A success story with apt and rsync

2003-07-06 Thread Koblinger Egmont
Hi,

On 6 Jul 2003, Goswin Brederlow wrote:

 2. most of the time you have no old file to rsync against. Only
 mirrors will have an old file and they already use rsync.

This is definitely true if you install your system from CD's and then
upgrade it. However, if you keep on upgrading from testing/unstable then
you'll have more and more packages under /var/cache/apt/archives so it
will have more and more chance that an older version is found there. Or,
alternatively, if you are sitting behind a slow modem and apt-get
upgrade says it will upgrade extremely-huge-package, then you can still
easily insert your CD and copy the old version of extremely-huge-package
to /var/cache/apt/archives and hit ENTER to apt-get afterwards.

 3. rsyncing against the previous version is only possible via some
 dirty hack as apt module. apt would have to be changed to provide
 modules access to its cache structure or at least pass any previous
 version as argument. Some mirror scripts alreday use older versions as
 templaes for new versions.

Yes, this is what I've hacked together based on other people's great work.
It is (as I've said too) a dirty hack. If a more experienced apt-coder can
replace my hard-coded path with a mechanism that tells this path to the
module, then this hack won't even be dirty.

 4. (and this is the knockout) rsync support for apt-get is NO
 WANTED. rsync uses too much resources (cpu and more relevant IO) on
 the server side and a widespread use of rsync for apt-get would choke
 the rsync mirrors and do more harm than good.

It might be no wanted for administrators, however, I guess it is wanted to
many of the users (at least for me :-)) I don't see the huge load of the
server (since I'm the only one rsyncing from it), but I see the huge
difference in the download time. If my download wasn't faster because of
an overloaded server, I would switch back to FTP or anything which is
better to me as an end user.

I understand that rsync causes a high load on the server when several
users are connected, and so it is not suitable as a general replacement
for ftp, however I think it is suitable as an alternative. I also don't
expect the Debian team itself to set up a public rsync server for the
packages. However, some mirrors might want to set up an rsync server
either for the public or for example a university for its students.

Similar hack could be simply used by people who have account to a machine
with high bandwidth. For example if I used Debian and Debian had rsyncable
packages, but no public rsync server was available, I'd personally mirror
Debian to a machine at the university using FTP and would use rsync from
that server to my home machine to save traffic where the bandwidth is a
bottleneck.

So I don't think it's a bad idea to set up some public rsync servers
worldwide. The maximum number of connections can be set up so that cpu
usage is limited somehow. It's obvious that if a user often gets the
connection refused then he will switch back to ftp or http. Hence I guess
that the power of the public rsync servers and the users using rsync would
somehow be automatically balanced, it doesn't have to be coordinated
centrally. So IMHO let anybody set up an rsync server if he wants to, and
let the users use rsync if they want to (but don't put an rsync:// line
in the default sources.list).


 All together I think a extended bittorrent module for apt-get is by
 far the better sollution but it will take some more time and designing
 before it can be implemented.

It is very promising and I really hope that it will be a good protocol
with a good implementation and integration to apt. But until this is
realized, we still could have rsync as an alternative, if Debian packages
were packed in a slightly different way.



bye,

Egmont




Re: A success story with apt and rsync

2003-07-06 Thread Koblinger Egmont

On Sun, 6 Jul 2003, Andrew Suffield wrote:

 On ext2, as an example, stat()ting or open()ing a directory of 1
 files in the order returned by readdir() will be vastly quicker than
 in some other sequence (like, say, bytewise lexicographic) due to the
 way in which the filesystem looks up inodes. This has caused
 significant performance issues for bugs.debian.org in the past.

You're right, I didn't get the point in the story when I simply ran find
using the sortdir wrapper, but now I understand the problem.

However I'm still unsure if this good to keep files unsorted, especially
if we consider effective syncing of packages. On my home computer I've
never heard the sound of my disk at package creating phase (even though
we've beein using sortdir for more than a half year, and I've compiled
hundreds of packages), but I hear it when e.g. the source is decompressed.
At the 'dpkg-deb --build' phase only the processor is the bottleneck.

This might vary under different circumstances. I'm unaware of them in case
of Debian, e.g. I have no information about what hardware your packages
are created on, whether there are any other cpu-intensive or
disk-intensive applications running on these machines etc. I can easily
imagine that using sortdir can drastically decrease performance if another
disk-intensive process is running. However my experiences didn't show a
noticeable performance decrease if this was the only process accessing the
disk...

But hey, let's stop for a minute :-) Building the package only uses the
memory cache for most of the packages, doesn't it? The files it packs
together have just recently been created and there are not so many
packages whose uncompressed size is close to or bigger than the amount of
RAM in today's machines...

And for the large packages the build itself might take thousands as much
time as reading the files in sorted order.

Does anyone know what RPM does? I know that listing the contents of a
package always produces alphabetical order but I don't know whether the
filelist is sorted on the fly or the files really appear alphabetically in
the cpio archive.


So I guess we've already seen pros and cons of sorting the files. (One
thing is missing: we still don't know how efficient rsync is if two
rsyncable tar.gz files contain the same files but in different order.)
The decision is clearly not mine but the Debian developers'. However, if
you ask me, I still vote for sorting the files :-))




bye,

Egmont




A success story with apt and rsync

2003-07-05 Thread Koblinger Egmont
Hi,

From time to time the question arises on different forums whether it is
possible to efficiently use rsync with apt-get. Recently there has been a
thread here on debian-devel and it was also mentioned in Debian Weekly News
June 24th, 2003. However, I only saw different small parts of a huge and
complex problem set discussed at different places, I haven't find an
overview of the whole situation anywhere.

Being one of the developers of the Hungarian distribution ``UHU-Linux'' I
spent some time in the last few days by collecting as much information as
possible, putting the patches together and coding a little bit to fill some
minor gaps. Here I'd like to summarize and share all my experiences.

Our distribution uses dpkg/apt for package management. We are not Debian
based though, even our build procedure which leads to deb packages is
completely different from Debian's, except for the last step (which is
obviously a ``dpkg-deb --build'').

Some of our resources are quite tight. This is especially true for the
bandwidth of the home machines of the developers and testers. Most of us
live behind a 384kbit/s ADSL line. From time to time we rebuild all our
packages to see if they still compile with our current packages. Such a full
rebuild produces 1.5GB of new packages, all with a new filename since the
release numbers are automatically bumped. Our goal was to reach that an
upgrade after such a full rebuild requires only a reasonable amount of
network traffic instead of one and a half gigabytes. Before telling how we
succeeded in it I'd like to demonstrate the result.


One of my favorite games is quadra. The size of the package is nearly 3MB.
I've purged it from my system and then performed an ``apt-get install
quadra''. Apt-get printed this, amongst others:

Get:1 rsync://rsync.uhulinux.hu ./ quadra 1.1.8-2.8 [2931kB]
Fetched 2931kB in 59s (49,0kB/s)

The download speed and time corresponds to the 384kbit/s bandwidth.

I recompiled the package on the server. Then I typed ``apt-get update''
followed by ``apt-get install quadra'' again. This time apt-get printed
this:

Get:1 rsync://rsync.uhulinux.hu ./ quadra 1.1.8-2.9 [2931kB]
Fetched 2931kB in 3s (788kB/s)

Yes, downloading only took three seconds instead of one minute. Obviously
these two files do not only differ in their filename, they contain their
release number, timestamps of files and perhaps other pieces of data which
make them different. Needless to say that a small change in the package
would only slightly increase the download time.

Speedup is usually approx. 2x--3x for packages containing lots of small
files, but can be extremely high for packages containing bigger files.


The rest of my mail tells the implementation details.


rsyncable gzip files


A small change in a file causes their gzipped version to get out of sync and
hence rsync doesn't see any common parts in them. There's a patch by Rusty
Russell floating around on the net which adds an --rsyncable option to gzip.
It is already included in Debian. This way gzipped files have
synchronization points making rsync's job much easier. The patch is
available (amongst others) at [1a] and [1b].

The documentation in the original patch says ``This reduces compression by
about 1 percent most cases''. Debian's version says ``This increases size by
less than 1 percent most cases''. Size increasement was 0.7% for all our
packages, but 1.2% for our most important packages (the core distrib in
about 300--400MB).

This 1% is very low if you think of it as 1%. If you think of it as you lose
6MB on every CD, then, well, it could have been smaller. But if you think of
what you gain with it, then it is definitely worth it.

The same patch also exists for zlib (see [2a] or [2b]). However as for gzip
you can control this behaviour with a command line option, it is not so
trivial to do it with a library. The official patch disables rsyncable
support by default. You can enable it by changing zlib_rsync = 0 to
zlib_rsync = 1 within zlib's source or you can control it from your
running application. As I didn't like these approaches, I added a small
patch so that setting the ZLIB_RSYNC environment variable turns on the
rsyncable support. This patch is at [3].

As dpkg seems to statically link against zlib, we had to recompile dpkg
after installing this patched zlib. After this we changed our build script
so that it invokes ``dpkg-deb --build'' with the ZLIB_RSYNC environment
variable set to some value.


order of files
--

dpkg-deb puts the files in the .deb package in random order. I hate this
misfeature since it's hard to eye-grep anything from ``dpkg -L'' or F3 in
mc. We run ``dpkg-deb --build'' using the sortdir library ([4a], [4b]) which
makes the files appear in the package in alphabetical order. I don't know
how efficient rsync is if you split a file to some dozens or even hundreds
of parts and shuffle them, and then syncronize this one with the original
version.