Re: A success story with apt and rsync
On Sun, 6 Jul 2003, Andrew Suffield wrote: It should put them in the package in the order they came from readdir(), which will depend on the filesystem. This is normally the order in which they were created, and should not vary when rebuilding. As such, sorting the list probably doesn't change the network traffic, but will slow dpkg-deb down on packages with large directories in them. Yes, when saying random order I obviously ment in the order readdir() returns them. It's random for me. :-))) It can easily be different on different filesystems, or even on same type of filesystems with different parameters (e.g. blocksize). I even think it can be different after a simple rebuild on exactly the same environment. For example configure and libtool like to create files with the PID in their name, which can take from 3 to 5 digits. If you create the file X and then Y, remove X and then create Z then it is most likely that if Z's name is shorter than or equal to the length of filename X, then it will be returned first by readdir(), while if its name is longer, then Y will be returned first and Z afterwards. So I can imagine situations where the order of the files depend on the PIDs of the build processes. However, I guess or goal is not only to produce similar packages from exactly the same source. It's quite important to produce similar package even after a version upgrade. For example you have a foobar-0.9 package, and now upgrade to foobar-1.0. The author may have completely rewritten Makefile which yields in nearly the same executable, the same data files, but completely different random order. However, I think sorting the files costs really nothing. My system is not a very new one, 375MHz Celeron, IDE disks, 384MB RAM etc... However: /usr/lib$ du -s . 1,1G. /usr/lib$ find . -type f | wc -l # okay, it's now in memory cache 18598 /usr/lib$ time find . /dev/null 21 real0m0.285s user0m0.100s sys 0m0.150s [EMAIL PROTECTED]:/usr/lib$ time sortdir find . /dev/null 21 real0m1.683s user0m1.390s sys 0m0.250s IMHO a step which takes one and a half seconds before compressing 18000 files of more than 1 gigabytes shouldn't be a problem. cheers, Egmont
Re: A success story with apt and rsync
Hi, On 6 Jul 2003, Goswin Brederlow wrote: 2. most of the time you have no old file to rsync against. Only mirrors will have an old file and they already use rsync. This is definitely true if you install your system from CD's and then upgrade it. However, if you keep on upgrading from testing/unstable then you'll have more and more packages under /var/cache/apt/archives so it will have more and more chance that an older version is found there. Or, alternatively, if you are sitting behind a slow modem and apt-get upgrade says it will upgrade extremely-huge-package, then you can still easily insert your CD and copy the old version of extremely-huge-package to /var/cache/apt/archives and hit ENTER to apt-get afterwards. 3. rsyncing against the previous version is only possible via some dirty hack as apt module. apt would have to be changed to provide modules access to its cache structure or at least pass any previous version as argument. Some mirror scripts alreday use older versions as templaes for new versions. Yes, this is what I've hacked together based on other people's great work. It is (as I've said too) a dirty hack. If a more experienced apt-coder can replace my hard-coded path with a mechanism that tells this path to the module, then this hack won't even be dirty. 4. (and this is the knockout) rsync support for apt-get is NO WANTED. rsync uses too much resources (cpu and more relevant IO) on the server side and a widespread use of rsync for apt-get would choke the rsync mirrors and do more harm than good. It might be no wanted for administrators, however, I guess it is wanted to many of the users (at least for me :-)) I don't see the huge load of the server (since I'm the only one rsyncing from it), but I see the huge difference in the download time. If my download wasn't faster because of an overloaded server, I would switch back to FTP or anything which is better to me as an end user. I understand that rsync causes a high load on the server when several users are connected, and so it is not suitable as a general replacement for ftp, however I think it is suitable as an alternative. I also don't expect the Debian team itself to set up a public rsync server for the packages. However, some mirrors might want to set up an rsync server either for the public or for example a university for its students. Similar hack could be simply used by people who have account to a machine with high bandwidth. For example if I used Debian and Debian had rsyncable packages, but no public rsync server was available, I'd personally mirror Debian to a machine at the university using FTP and would use rsync from that server to my home machine to save traffic where the bandwidth is a bottleneck. So I don't think it's a bad idea to set up some public rsync servers worldwide. The maximum number of connections can be set up so that cpu usage is limited somehow. It's obvious that if a user often gets the connection refused then he will switch back to ftp or http. Hence I guess that the power of the public rsync servers and the users using rsync would somehow be automatically balanced, it doesn't have to be coordinated centrally. So IMHO let anybody set up an rsync server if he wants to, and let the users use rsync if they want to (but don't put an rsync:// line in the default sources.list). All together I think a extended bittorrent module for apt-get is by far the better sollution but it will take some more time and designing before it can be implemented. It is very promising and I really hope that it will be a good protocol with a good implementation and integration to apt. But until this is realized, we still could have rsync as an alternative, if Debian packages were packed in a slightly different way. bye, Egmont
Re: A success story with apt and rsync
On Sun, 6 Jul 2003, Andrew Suffield wrote: On ext2, as an example, stat()ting or open()ing a directory of 1 files in the order returned by readdir() will be vastly quicker than in some other sequence (like, say, bytewise lexicographic) due to the way in which the filesystem looks up inodes. This has caused significant performance issues for bugs.debian.org in the past. You're right, I didn't get the point in the story when I simply ran find using the sortdir wrapper, but now I understand the problem. However I'm still unsure if this good to keep files unsorted, especially if we consider effective syncing of packages. On my home computer I've never heard the sound of my disk at package creating phase (even though we've beein using sortdir for more than a half year, and I've compiled hundreds of packages), but I hear it when e.g. the source is decompressed. At the 'dpkg-deb --build' phase only the processor is the bottleneck. This might vary under different circumstances. I'm unaware of them in case of Debian, e.g. I have no information about what hardware your packages are created on, whether there are any other cpu-intensive or disk-intensive applications running on these machines etc. I can easily imagine that using sortdir can drastically decrease performance if another disk-intensive process is running. However my experiences didn't show a noticeable performance decrease if this was the only process accessing the disk... But hey, let's stop for a minute :-) Building the package only uses the memory cache for most of the packages, doesn't it? The files it packs together have just recently been created and there are not so many packages whose uncompressed size is close to or bigger than the amount of RAM in today's machines... And for the large packages the build itself might take thousands as much time as reading the files in sorted order. Does anyone know what RPM does? I know that listing the contents of a package always produces alphabetical order but I don't know whether the filelist is sorted on the fly or the files really appear alphabetically in the cpio archive. So I guess we've already seen pros and cons of sorting the files. (One thing is missing: we still don't know how efficient rsync is if two rsyncable tar.gz files contain the same files but in different order.) The decision is clearly not mine but the Debian developers'. However, if you ask me, I still vote for sorting the files :-)) bye, Egmont
A success story with apt and rsync
Hi, From time to time the question arises on different forums whether it is possible to efficiently use rsync with apt-get. Recently there has been a thread here on debian-devel and it was also mentioned in Debian Weekly News June 24th, 2003. However, I only saw different small parts of a huge and complex problem set discussed at different places, I haven't find an overview of the whole situation anywhere. Being one of the developers of the Hungarian distribution ``UHU-Linux'' I spent some time in the last few days by collecting as much information as possible, putting the patches together and coding a little bit to fill some minor gaps. Here I'd like to summarize and share all my experiences. Our distribution uses dpkg/apt for package management. We are not Debian based though, even our build procedure which leads to deb packages is completely different from Debian's, except for the last step (which is obviously a ``dpkg-deb --build''). Some of our resources are quite tight. This is especially true for the bandwidth of the home machines of the developers and testers. Most of us live behind a 384kbit/s ADSL line. From time to time we rebuild all our packages to see if they still compile with our current packages. Such a full rebuild produces 1.5GB of new packages, all with a new filename since the release numbers are automatically bumped. Our goal was to reach that an upgrade after such a full rebuild requires only a reasonable amount of network traffic instead of one and a half gigabytes. Before telling how we succeeded in it I'd like to demonstrate the result. One of my favorite games is quadra. The size of the package is nearly 3MB. I've purged it from my system and then performed an ``apt-get install quadra''. Apt-get printed this, amongst others: Get:1 rsync://rsync.uhulinux.hu ./ quadra 1.1.8-2.8 [2931kB] Fetched 2931kB in 59s (49,0kB/s) The download speed and time corresponds to the 384kbit/s bandwidth. I recompiled the package on the server. Then I typed ``apt-get update'' followed by ``apt-get install quadra'' again. This time apt-get printed this: Get:1 rsync://rsync.uhulinux.hu ./ quadra 1.1.8-2.9 [2931kB] Fetched 2931kB in 3s (788kB/s) Yes, downloading only took three seconds instead of one minute. Obviously these two files do not only differ in their filename, they contain their release number, timestamps of files and perhaps other pieces of data which make them different. Needless to say that a small change in the package would only slightly increase the download time. Speedup is usually approx. 2x--3x for packages containing lots of small files, but can be extremely high for packages containing bigger files. The rest of my mail tells the implementation details. rsyncable gzip files A small change in a file causes their gzipped version to get out of sync and hence rsync doesn't see any common parts in them. There's a patch by Rusty Russell floating around on the net which adds an --rsyncable option to gzip. It is already included in Debian. This way gzipped files have synchronization points making rsync's job much easier. The patch is available (amongst others) at [1a] and [1b]. The documentation in the original patch says ``This reduces compression by about 1 percent most cases''. Debian's version says ``This increases size by less than 1 percent most cases''. Size increasement was 0.7% for all our packages, but 1.2% for our most important packages (the core distrib in about 300--400MB). This 1% is very low if you think of it as 1%. If you think of it as you lose 6MB on every CD, then, well, it could have been smaller. But if you think of what you gain with it, then it is definitely worth it. The same patch also exists for zlib (see [2a] or [2b]). However as for gzip you can control this behaviour with a command line option, it is not so trivial to do it with a library. The official patch disables rsyncable support by default. You can enable it by changing zlib_rsync = 0 to zlib_rsync = 1 within zlib's source or you can control it from your running application. As I didn't like these approaches, I added a small patch so that setting the ZLIB_RSYNC environment variable turns on the rsyncable support. This patch is at [3]. As dpkg seems to statically link against zlib, we had to recompile dpkg after installing this patched zlib. After this we changed our build script so that it invokes ``dpkg-deb --build'' with the ZLIB_RSYNC environment variable set to some value. order of files -- dpkg-deb puts the files in the .deb package in random order. I hate this misfeature since it's hard to eye-grep anything from ``dpkg -L'' or F3 in mc. We run ``dpkg-deb --build'' using the sortdir library ([4a], [4b]) which makes the files appear in the package in alphabetical order. I don't know how efficient rsync is if you split a file to some dozens or even hundreds of parts and shuffle them, and then syncronize this one with the original version.