subject:"Re\: A success story with apt and rsync"

Re: A success story with apt and rsync

2003-07-31 Thread Otto Wyss

> From time to time the question arises on different forums whether it is
> possible to efficiently use rsync with apt-get. Recently there has been a
> thread here on debian-devel and it was also mentioned in Debian Weekly News
> June 24th, 2003. However, I only saw different small parts of a huge and
> complex problem set discussed at different places, I haven't find an
> overview of the whole situation anywhere.
> 
Sorry that I write so late but I'm not reading debian-devel regularly.

I've started a solution to distribute Debian mirrors by rsync about 2
years ago. The only "impact" (if impact is the right word) of my
soultion on Debian is the use of the rsync patch for gzip. Everything
else is solve by my perl script so you might find ideas for your apt
solution there. See "http://dpartialmirror.sourceforge.net/";.

O. Wyss

-- 
See "http://wxguide.sourceforge.net/"; for ideas how to design your app.

Re: A success story with apt and rsync

2003-07-07 Thread Goswin Brederlow

Michael Karcher <[EMAIL PROTECTED]> writes:

> On Sun, Jul 06, 2003 at 01:29:06AM +0200, Andrew Suffield wrote:
> > It should put them in the package in the order they came from
> > readdir(), which will depend on the filesystem. This is normally the
> > order in which they were created,
> As long as the file system uses an inefficient approach for directories like
> the ext2/ext3 linked lists. If directories are hash tables (like on
> reiserfs) even creating another file in the same directory may totally mess
> up the order.
> 
> Michael Karcher

ext2/ext3 has hashed dirs too if you configure it.

MfG
Goswin

Re: A success story with apt and rsync

2003-07-07 Thread Michael Karcher

On Sun, Jul 06, 2003 at 01:29:06AM +0200, Andrew Suffield wrote:
> It should put them in the package in the order they came from
> readdir(), which will depend on the filesystem. This is normally the
> order in which they were created,
As long as the file system uses an inefficient approach for directories like
the ext2/ext3 linked lists. If directories are hash tables (like on
reiserfs) even creating another file in the same directory may totally mess
up the order.

Michael Karcher

Re: A success story with apt and rsync

2003-07-06 Thread Theodore Ts'o

On Mon, Jul 07, 2003 at 01:01:34AM +0100, Andrew Suffield wrote:
> > 
> > I believe htree == dir_index, so tune2fs(8) and mke2fs(8) have the answer.
> 
> My /home has that enabled and readdir() returns files in creation order.
> 

Then you don't have a htree-capable kernel or the directory isn't
indexed.  Directories that fit in a block are not indexed, as are
directories larger than a block that were created before directory
indexing was enabled, or if they were modified by a non-htree capable
kernel.

You can use the lsattr command to see if the indexed (I) flag is set
on a particular directory:

% lsattr -d /home/tytso
--I-- /home/tytso

- Ted

Re: A success story with apt and rsync

2003-07-06 Thread Theodore Ts'o

On Sun, Jul 06, 2003 at 11:36:34PM +0100, Andrew Suffield wrote:
> 
> I can only presume this is new or obscure, since everything I tried
> had the traditional behaviour. Can't see how to turn it on, either.
> 

It's new for 2.5.  Backports to 2.4 are available here:

http://thunk.org/tytso/linux/extfs-2.4-update/extfs-update-2.4.21

For those who are interested, the broken out patches can be found here:

http://thunk.org/tytso/linux/extfs-2.4-update/broken-out-2.4.21/to-apply

Once you have a htree-enabled kernel, you enable a filesystem to use
the feature by using the following command:

tune2fs -O dir_index /dev/hdXX

Optionally, you can reorganize all of the directories to use btrees by
using the command "e2fsck -fD /dev/hdXX".  Otherwise, only directories
that are expanded beyond a single block after you set the dir_index
flag will use htrees.  The dir_index is a fully compatible extension,
so it's perfectly safe to mount a filesystem with htrees on a
non-htree kernel.  A non-htree kernel will just ignore the b-tree
information, and if it attempts to modify a hash-tree directory, it
will just invalidate the htree interior node information, so that the
directory becomes unindexed until e2fsck -fD is run over the
filesystem to which optmizes all of the directories by reindexing them
all.

Why would you want to use htrees?  Because they speed up large
directories.  A lot.  Try creating 400,000 zero-length files in a
single directory.  It will take under 30 seconds with htree enabled,
and well over an hour without.

> > The good news is that this particular optimization of sorting by inode
> > number should work for all filesystems, and should speed up xfs as
> > well as ext2/3 with HTREE.
> 
> What about ext[23] without htree? Mucking with the order returned by
> readdir() has historically caused problems there...

It'll be fine; in fact, in some cases you'll see a slight speed up.
The key is that you'll get the best performance by reading/modifying
the inode data structures in sorted order by inode number.  This way,
you make a single sweep through the inode table, without needing any
extraneous seeks.  Using the natural sort order of readdir() on
non-htree ext2/3 systems mostly approximated this --- although if
files are deleted and created from the directory, this is not
guaranteed.  So sorting by inode number will never hurt, and may help.

- Ted

Re: A success story with apt and rsync

2003-07-06 Thread Andrew Suffield

On Sun, Jul 06, 2003 at 07:28:09PM -0400, Matt Zimmerman wrote:
> On Sun, Jul 06, 2003 at 11:36:34PM +0100, Andrew Suffield wrote:
> 
> > On Sun, Jul 06, 2003 at 05:48:24PM -0400, Theodore Ts'o wrote:
> > > Err, no.  If the htree (hash tree) indexing feature is turned on for
> > > ext2 or ext3 filesystems, they will returned sorted by the hash of the
> > > filename --- effectively a random order.  (Since the hash also
> > > includes a secret, random, per-filesystem secret in order to avoid
> > > denial of service attacks by malicious users who might otherwise try
> > > to create huge numbers of files containing hash collisions.)
> > 
> > I can only presume this is new or obscure, since everything I tried
> > had the traditional behaviour. Can't see how to turn it on, either.
> 
> I believe htree == dir_index, so tune2fs(8) and mke2fs(8) have the answer.

My /home has that enabled and readdir() returns files in creation order.

-- 
  .''`.  ** Debian GNU/Linux ** | Andrew Suffield
 : :' :  http://www.debian.org/ | Dept. of Computing,
 `. `'  | Imperial College,
   `- -><-  | London, UK


pgpNA07l53T5F.pgp
Description: PGP signature

Re: A success story with apt and rsync

2003-07-06 Thread Matt Zimmerman

On Sun, Jul 06, 2003 at 11:36:34PM +0100, Andrew Suffield wrote:

> On Sun, Jul 06, 2003 at 05:48:24PM -0400, Theodore Ts'o wrote:
> > Err, no.  If the htree (hash tree) indexing feature is turned on for
> > ext2 or ext3 filesystems, they will returned sorted by the hash of the
> > filename --- effectively a random order.  (Since the hash also
> > includes a secret, random, per-filesystem secret in order to avoid
> > denial of service attacks by malicious users who might otherwise try
> > to create huge numbers of files containing hash collisions.)
> 
> I can only presume this is new or obscure, since everything I tried
> had the traditional behaviour. Can't see how to turn it on, either.

I believe htree == dir_index, so tune2fs(8) and mke2fs(8) have the answer.

-- 
 - mdz

Re: A success story with apt and rsync

2003-07-06 Thread Andrew Suffield

On Sun, Jul 06, 2003 at 05:48:24PM -0400, Theodore Ts'o wrote:
> On Sun, Jul 06, 2003 at 10:12:03PM +0100, Andrew Suffield wrote:
> > On Sun, Jul 06, 2003 at 10:28:07PM +0200, Koblinger Egmont wrote:
> > > Yes, when saying "random order" I obviously ment "in the order readdir()
> > > returns them". It's random for me.  :-)))
> > > 
> > > It can easily be different on different filesystems, or even on same
> > > type of filesystems with different parameters (e.g. blocksize).
> > 
> > I can't think of any reason why changing the blocksize would affect
> > this. Most filesystems return files in the sequence in which they were
> > added to the directory. ext2, ext3, and reiser all do this; xfs is the
> > only one likely to be used on a Debian system which doesn't.
> 
> Err, no.  If the htree (hash tree) indexing feature is turned on for
> ext2 or ext3 filesystems, they will returned sorted by the hash of the
> filename --- effectively a random order.  (Since the hash also
> includes a secret, random, per-filesystem secret in order to avoid
> denial of service attacks by malicious users who might otherwise try
> to create huge numbers of files containing hash collisions.)

I can only presume this is new or obscure, since everything I tried
had the traditional behaviour. Can't see how to turn it on, either.

> I would be very, very surprised if reiserfs returned files in creation
> order.

Some trivial testing indicates that it does. Heck if I know how or why.

> It is really, really bad assumption to assume that files will be
> returned in the same order as they were created.

However, there's no real need to - that was just an example. As long
as the sequence is more or less stable (which it should be, for
btrees; don't know about htree) then rsync won't be perturbed.

> > On ext2, as an example, stat()ting or open()ing a directory of 1
> > files in the order returned by readdir() will be vastly quicker than
> > in some other sequence (like, say, bytewise lexicographic) due to the
> > way in which the filesystem looks up inodes. This has caused
> > significant performance issues for bugs.debian.org in the past.
> 
> If you are using HTREE, and want to do a readdir() scan followed by
> something which opens or stat's all of the files, you very badly will
> want to sort the returned directory inodes by the inode number
> (de->d_inode).  Otherwise, the order returned by readdir() will be
> effectively random, with the resulting loss of performance which you
> alluded to because the filesystem needs to randomly seek and ready all
> around the inode table.

Hmm, that's going to cause some trouble if htree becomes common. Is
there any way to test for this at runtime?

> The good news is that this particular optimization of sorting by inode
> number should work for all filesystems, and should speed up xfs as
> well as ext2/3 with HTREE.

What about ext[23] without htree? Mucking with the order returned by
readdir() has historically caused problems there...

-- 
  .''`.  ** Debian GNU/Linux ** | Andrew Suffield
 : :' :  http://www.debian.org/ | Dept. of Computing,
 `. `'  | Imperial College,
   `- -><-  | London, UK


pgpbFXtT67wbT.pgp
Description: PGP signature

Re: A success story with apt and rsync

2003-07-06 Thread Koblinger Egmont

On Sun, 6 Jul 2003, Andrew Suffield wrote:

> On ext2, as an example, stat()ting or open()ing a directory of 1
> files in the order returned by readdir() will be vastly quicker than
> in some other sequence (like, say, bytewise lexicographic) due to the
> way in which the filesystem looks up inodes. This has caused
> significant performance issues for bugs.debian.org in the past.

You're right, I didn't get the point in the story when I simply ran find
using the sortdir wrapper, but now I understand the problem.

However I'm still unsure if this good to keep files unsorted, especially
if we consider effective syncing of packages. On my home computer I've
never heard the sound of my disk at package creating phase (even though
we've beein using sortdir for more than a half year, and I've compiled
hundreds of packages), but I hear it when e.g. the source is decompressed.
At the 'dpkg-deb --build' phase only the processor is the bottleneck.

This might vary under different circumstances. I'm unaware of them in case
of Debian, e.g. I have no information about what hardware your packages
are created on, whether there are any other cpu-intensive or
disk-intensive applications running on these machines etc. I can easily
imagine that using sortdir can drastically decrease performance if another
disk-intensive process is running. However my experiences didn't show a
noticeable performance decrease if this was the only process accessing the
disk...

But hey, let's stop for a minute :-) Building the package only uses the
memory cache for most of the packages, doesn't it? The files it packs
together have just recently been created and there are not so many
packages whose uncompressed size is close to or bigger than the amount of
RAM in today's machines...

And for the large packages the build itself might take thousands as much
time as reading the files in sorted order.

Does anyone know what RPM does? I know that listing the contents of a
package always produces alphabetical order but I don't know whether the
filelist is sorted on the fly or the files really appear alphabetically in
the cpio archive.

So I guess we've already seen pros and cons of sorting the files. (One
thing is missing: we still don't know how efficient rsync is if two
rsyncable tar.gz files contain the same files but in different order.)
The decision is clearly not mine but the Debian developers'. However, if
you ask me, I still vote for sorting the files :-))

bye,

Egmont

Re: A success story with apt and rsync

2003-07-06 Thread Theodore Ts'o

On Sun, Jul 06, 2003 at 10:12:03PM +0100, Andrew Suffield wrote:
> On Sun, Jul 06, 2003 at 10:28:07PM +0200, Koblinger Egmont wrote:
> > Yes, when saying "random order" I obviously ment "in the order readdir()
> > returns them". It's random for me.  :-)))
> > 
> > It can easily be different on different filesystems, or even on same
> > type of filesystems with different parameters (e.g. blocksize).
> 
> I can't think of any reason why changing the blocksize would affect
> this. Most filesystems return files in the sequence in which they were
> added to the directory. ext2, ext3, and reiser all do this; xfs is the
> only one likely to be used on a Debian system which doesn't.

Err, no.  If the htree (hash tree) indexing feature is turned on for
ext2 or ext3 filesystems, they will returned sorted by the hash of the
filename --- effectively a random order.  (Since the hash also
includes a secret, random, per-filesystem secret in order to avoid
denial of service attacks by malicious users who might otherwise try
to create huge numbers of files containing hash collisions.)

I would be very, very surprised if reiserfs returned files in creation
order.  The fundamental problem is that the
readdir()/telldir()/seekdir() API is fundamentally busted.  Yes,
Dennis Ritchie and Ken Thompson do make mistakes, and have made many;
in this particular case, they made a whopper.  

Seekdir()/telldir() assumes a linear directory structure which you can
seek into, such that the results of readdir() are repeatable.  Posix
only allows files which are created or deleted in the interval to be
undefined; all other files must be returned in the same order as the
original readdir() stream, even if days or weeks elapse between the
readdir(), telldir(), and seekdir() calls.

Any filesystem which tries to use a B-tree like system, where leaf
nodes can be split, is going to have extreme problems trying to keep
these guarantees.  For this reason, most filesystem designers choose
to return files in b-tree order, and *not* the order in which files
were added to the directory.

It is really, really bad assumption to assume that files will be
returned in the same order as they were created.

> On ext2, as an example, stat()ting or open()ing a directory of 1
> files in the order returned by readdir() will be vastly quicker than
> in some other sequence (like, say, bytewise lexicographic) due to the
> way in which the filesystem looks up inodes. This has caused
> significant performance issues for bugs.debian.org in the past.

If you are using HTREE, and want to do a readdir() scan followed by
something which opens or stat's all of the files, you very badly will
want to sort the returned directory inodes by the inode number
(de->d_inode).  Otherwise, the order returned by readdir() will be
effectively random, with the resulting loss of performance which you
alluded to because the filesystem needs to randomly seek and ready all
around the inode table.

Why can't this be done in the kernel?  Because if the directory is 200
megabytes, then kernel would need to allocate and hold on to 200
megabytes until the userspace called closedir().  There is simply no
lightweight way to work around the problems caused by the broken API
which Ken Thompson and Dennis Ritchie designed.

The good news is that this particular optimization of sorting by inode
number should work for all filesystems, and should speed up xfs as
well as ext2/3 with HTREE.

- Ted

Re: A success story with apt and rsync

2003-07-06 Thread Koblinger Egmont

Hi,

On 6 Jul 2003, Goswin Brederlow wrote:

> 2. most of the time you have no old file to rsync against. Only
> mirrors will have an old file and they already use rsync.

This is definitely true if you install your system from CD's and then
upgrade it. However, if you keep on upgrading from testing/unstable then
you'll have more and more packages under /var/cache/apt/archives so it
will have more and more chance that an older version is found there. Or,
alternatively, if you are sitting behind a slow modem and "apt-get
upgrade" says it will upgrade "extremely-huge-package", then you can still
easily insert your CD and copy the old version of "extremely-huge-package"
to /var/cache/apt/archives and hit ENTER to apt-get afterwards.

> 3. rsyncing against the previous version is only possible via some
> dirty hack as apt module. apt would have to be changed to provide
> modules access to its cache structure or at least pass any previous
> version as argument. Some mirror scripts alreday use older versions as
> templaes for new versions.

Yes, this is what I've hacked together based on other people's great work.
It is (as I've said too) a dirty hack. If a more experienced apt-coder can
replace my hard-coded path with a mechanism that tells this path to the
module, then this hack won't even be dirty.

> 4. (and this is the knockout) rsync support for apt-get is NO
> WANTED. rsync uses too much resources (cpu and more relevant IO) on
> the server side and a widespread use of rsync for apt-get would choke
> the rsync mirrors and do more harm than good.

It might be no wanted for administrators, however, I guess it is wanted to
many of the users (at least for me :-)) I don't see the huge load of the
server (since I'm the only one rsyncing from it), but I see the huge
difference in the download time. If my download wasn't faster because of
an overloaded server, I would switch back to FTP or anything which is
better to me as an end user.

I understand that rsync causes a high load on the server when several
users are connected, and so it is not suitable as a general replacement
for ftp, however I think it is suitable as an alternative. I also don't
expect the Debian team itself to set up a public rsync server for the
packages. However, some mirrors might want to set up an rsync server
either for the public or for example a university for its students.

Similar hack could be simply used by people who have account to a machine
with high bandwidth. For example if I used Debian and Debian had rsyncable
packages, but no public rsync server was available, I'd personally mirror
Debian to a machine at the university using FTP and would use rsync from
that server to my home machine to save traffic where the bandwidth is a
bottleneck.

So I don't think it's a bad idea to set up some public rsync servers
worldwide. The maximum number of connections can be set up so that cpu
usage is limited somehow. It's obvious that if a user often gets the
connection refused then he will switch back to ftp or http. Hence I guess
that the power of the public rsync servers and the users using rsync would
somehow be automatically balanced, it doesn't have to be coordinated
centrally. So IMHO let anybody set up an rsync server if he wants to, and
let the users use rsync if they want to (but don't put an rsync:// line
in the default sources.list).

> All together I think a extended bittorrent module for apt-get is by
> far the better sollution but it will take some more time and designing
> before it can be implemented.

It is very promising and I really hope that it will be a good protocol
with a good implementation and integration to apt. But until this is
realized, we still could have rsync as an alternative, if Debian packages
were packed in a slightly different way.

bye,

Egmont

Re: A success story with apt and rsync

2003-07-06 Thread Andrew Suffield

On Sun, Jul 06, 2003 at 10:28:07PM +0200, Koblinger Egmont wrote:
> 
> On Sun, 6 Jul 2003, Andrew Suffield wrote:
> 
> > It should put them in the package in the order they came from
> > readdir(), which will depend on the filesystem. This is normally the
> > order in which they were created, and should not vary when
> > rebuilding. As such, sorting the list probably doesn't change the
> > network traffic, but will slow dpkg-deb down on packages with large
> > directories in them.
> 
> Yes, when saying "random order" I obviously ment "in the order readdir()
> returns them". It's random for me.  :-)))
> 
> It can easily be different on different filesystems, or even on same
> type of filesystems with different parameters (e.g. blocksize).

I can't think of any reason why changing the blocksize would affect
this. Most filesystems return files in the sequence in which they were
added to the directory. ext2, ext3, and reiser all do this; xfs is the
only one likely to be used on a Debian system which doesn't.

> I even think it can be different after a simple rebuild on exactly the
> same environment. For example configure and libtool like to create files
> with the PID in their name, which can take from 3 to 5 digits. If you
> create the file X and then Y, remove X and then create Z then it is most
> likely that if Z's name is shorter than or equal to the length of filename
> X, then it will be returned first by readdir(), while if its name is
> longer, then Y will be returned first and Z afterwards. So I can imagine
> situations where the order of the files depend on the PIDs of the build
> processes.

This lengthly bit of handwaving has no connection with reality.

> However, I think sorting the files costs really nothing. My system is not
> a very new one, 375MHz Celeron, IDE disks, 384MB RAM etc... However:
> 
> /usr/lib$ du -s .
> 1,1G.
> /usr/lib$ find . -type f | wc -l  # okay, it's now in memory cache
>   18598
> /usr/lib$ time find . >/dev/null 2>&1
> 
> real0m0.285s
> user0m0.100s
> sys 0m0.150s
> [EMAIL PROTECTED]:/usr/lib$ time sortdir find . >/dev/null 2>&1
> 
> real0m1.683s
> user0m1.390s
> sys 0m0.250s
> 
> 
> IMHO a step which takes one and a half seconds before compressing 18000
> files of more than 1 gigabytes shouldn't be a problem.

This test only shows that you don't understand what is going on; it
has no relation to the problems that can occur.

On ext2, as an example, stat()ting or open()ing a directory of 1
files in the order returned by readdir() will be vastly quicker than
in some other sequence (like, say, bytewise lexicographic) due to the
way in which the filesystem looks up inodes. This has caused
significant performance issues for bugs.debian.org in the past.

-- 
  .''`.  ** Debian GNU/Linux ** | Andrew Suffield
 : :' :  http://www.debian.org/ | Dept. of Computing,
 `. `'  | Imperial College,
   `- -><-  | London, UK


pgp5SqrSYg0gQ.pgp
Description: PGP signature

Re: A success story with apt and rsync

2003-07-06 Thread Koblinger Egmont

On Sun, 6 Jul 2003, Andrew Suffield wrote:

> It should put them in the package in the order they came from
> readdir(), which will depend on the filesystem. This is normally the
> order in which they were created, and should not vary when
> rebuilding. As such, sorting the list probably doesn't change the
> network traffic, but will slow dpkg-deb down on packages with large
> directories in them.

Yes, when saying "random order" I obviously ment "in the order readdir()
returns them". It's random for me.  :-)))

It can easily be different on different filesystems, or even on same
type of filesystems with different parameters (e.g. blocksize).

I even think it can be different after a simple rebuild on exactly the
same environment. For example configure and libtool like to create files
with the PID in their name, which can take from 3 to 5 digits. If you
create the file X and then Y, remove X and then create Z then it is most
likely that if Z's name is shorter than or equal to the length of filename
X, then it will be returned first by readdir(), while if its name is
longer, then Y will be returned first and Z afterwards. So I can imagine
situations where the order of the files depend on the PIDs of the build
processes.

However, I guess or goal is not only to produce similar packages from
exactly the same source. It's quite important to produce similar package
even after a version upgrade. For example you have a foobar-0.9 package,
and now upgrade to foobar-1.0. The author may have completely rewritten
Makefile which yields in nearly the same executable, the same data files,
but completely different "random" order.

However, I think sorting the files costs really nothing. My system is not
a very new one, 375MHz Celeron, IDE disks, 384MB RAM etc... However:

/usr/lib$ du -s .
1,1G.
/usr/lib$ find . -type f | wc -l  # okay, it's now in memory cache
  18598
/usr/lib$ time find . >/dev/null 2>&1

real0m0.285s
user0m0.100s
sys 0m0.150s
[EMAIL PROTECTED]:/usr/lib$ time sortdir find . >/dev/null 2>&1

real0m1.683s
user0m1.390s
sys 0m0.250s

IMHO a step which takes one and a half seconds before compressing 18000
files of more than 1 gigabytes shouldn't be a problem.

cheers,
Egmont

Re: A success story with apt and rsync

2003-07-06 Thread Martijn van Oosterhout

On Sun, Jul 06, 2003 at 12:37:00PM +1200, Corrin Lakeland wrote:
> > 4. (and this is the knockout) rsync support for apt-get is NO
> > WANTED. rsync uses too much resources (cpu and more relevant IO) on
> > the server side and a widespread use of rsync for apt-get would choke
> > the rsync mirrors and do more harm than good.
> 
> When I was looking into this I heard about some work into caching the rolling 
> checksums to eliminate server load. I didn't find any code.

That would be because the checksums would take at least 8 times the space of
the original files. You need the backward-rsync which was patented last I
heard.

-- 
Martijn van Oosterhout  http://svana.org/kleptog/
> "the West won the world not by the superiority of its ideas or values or
> religion but rather by its superiority in applying organized violence.
> Westerners often forget this fact, non-Westerners never do."
>   - Samuel P. Huntington


pgpMZ4tlXA1It.pgp
Description: PGP signature

Re: A success story with apt and rsync

2003-07-06 Thread Jonathan Oxer

On Sun, 2003-07-06 at 09:27, Goswin Brederlow wrote:
> 4. (and this is the knockout) rsync support for apt-get is NO
> WANTED. rsync uses too much resources (cpu and more relevant IO) on
> the server side and a widespread use of rsync for apt-get would choke
> the rsync mirrors and do more harm than good.

One way to alleviate this would be to only generate the deltas once on
server-side when first requested, then cache them on disk to be served
out like any other static file for reconstruction of the new package on
the client-side using rsync.

I've been thinking for a while about trying to build this into
Apt-cacher.

Jonathan

Re: A success story with apt and rsync

2003-07-05 Thread Corrin Lakeland

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Sunday 06 July 2003 11:27, Goswin Brederlow wrote:
> Koblinger Egmont <[EMAIL PROTECTED]> writes:
> > Hi,
> >
> > >From time to time the question arises on different forums whether it is
> >
> > possible to efficiently use rsync with apt-get. Recently there has been a
> > thread here on debian-devel and it was also mentioned in Debian Weekly
> > News June 24th, 2003. However, I only saw different small parts of a huge
> > and complex problem set discussed at different places, I haven't find an
> > overview of the whole situation anywhere.
>
> ...
>
>  Lets
> summarize what I still remember:
>
> 2. most of the time you have no old file to rsync against. Only
> mirrors will have an old file and they already use rsync.

/var/cache/apt/ ?

> 4. (and this is the knockout) rsync support for apt-get is NO
> WANTED. rsync uses too much resources (cpu and more relevant IO) on
> the server side and a widespread use of rsync for apt-get would choke
> the rsync mirrors and do more harm than good.

When I was looking into this I heard about some work into caching the rolling 
checksums to eliminate server load. I didn't find any code.

> Doogie is thinking about extending the Bittorrent protocol for use as
> apt-get method. I talked with him on irc about some design ideas and
> so far it looks realy good if he can get some mirrors to host it.

Sounds interesting.  bittorrent allocates people to peer off in a round-robin 
fashon, which is really stupid.  If two people have similar IPs they should 
make a better peer.

> Via another small extension rolling
> checksums for each block could be included in the protocol and a
> client side rsync can be done. (I heard this variant of rsync would be
> patented in US but never saw real proof of it.)

Likewise on both counts.

Corrin
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQE/B28si5A0ZsG8x8cRAuuoAJ9+wAEhoRcfBDsAtj96KHowqlM03QCffbF1
sl5I76+IzUdF2MavgDLJcls=
=6X9X
-END PGP SIGNATURE-

Re: A success story with apt and rsync

2003-07-05 Thread Adam Heath

On 6 Jul 2003, Goswin Brederlow wrote:

> Doogie is thinking about extending the Bittorrent protocol for use as
> apt-get method. I talked with him on irc about some design ideas and
> so far it looks realy good if he can get some mirrors to host it.

My plans are to require no additional software to be installed on any server.
This means all files will be pre-generated, and mirrored.  This also means
that a tracker won't be available on that particular mirror, but the block
checksums will still be available.

> The bittorrent protocol organises multiple downloaders so that they
> also upload to each other and thereby reduces the traffic on the main
> server. The extension of the protocol should also utilise http/ftp
> mirrors as sources for the files thereby spreading the load over
> multiple servers evenly.

What this means is that clients will be able to fetch blocks from normal http
and ftp mirrors.  This will be used to start fetching data before connections
have been opened with peers.

> Bittorrent calculates a hash for each block of a file very similar to
> what rsync needs to work. Via another small extension rolling
> checksums for each block could be included in the protocol and a
> client side rsync can be done. (I heard this variant of rsync would be
> patented in US but never saw real proof of it.)
>
> All together I think a extended bittorrent module for apt-get is by
> far the better sollution but it will take some more time and designing
> before it can be implemented.

Also, for better sharing, users will have the option of leaving a running
server on their machines.

Additionally, part of my work will include extensions to the tracker to
support tracker peers, and tracker clusters.

Another extension is which tracker to use.  When fetching the .torrent
meta-data, my client will attempt to contact a tracker on the server the
.torrent resides on.  If none is found, it'll fall back to the one encoded in
the .torrent.  This provides for localization of connections, and better
latency.

Re: A success story with apt and rsync

2003-07-05 Thread Goswin Brederlow

Koblinger Egmont <[EMAIL PROTECTED]> writes:

> Hi,
> 
> >From time to time the question arises on different forums whether it is
> possible to efficiently use rsync with apt-get. Recently there has been a
> thread here on debian-devel and it was also mentioned in Debian Weekly News
> June 24th, 2003. However, I only saw different small parts of a huge and
> complex problem set discussed at different places, I haven't find an
> overview of the whole situation anywhere.
...

I worked on an rsync patch for apt-get some years ago and raised some
design questions, some the same as you did in the deleted parts. Lets
summarize what I still remember:

1. debs are gziped so any change (even change in time) results in a
different gzip. The rsyncable patch for gzip helps a lot there. So
lets consider that fixed.

2. most of the time you have no old file to rsync against. Only
mirrors will have an old file and they already use rsync.

3. rsyncing against the previous version is only possible via some
dirty hack as apt module. apt would have to be changed to provide
modules access to its cache structure or at least pass any previous
version as argument. Some mirror scripts alreday use older versions as
templaes for new versions.

4. (and this is the knockout) rsync support for apt-get is NO
WANTED. rsync uses too much resources (cpu and more relevant IO) on
the server side and a widespread use of rsync for apt-get would choke
the rsync mirrors and do more harm than good.

> conclusion
> --
> 
> The good news is that it is working perfectly.
> 
> The bad news is that you can't hack it on your home computer as long as your
> distribution doesn't provide rsync-friendly packages. Maybe one could set up
> a public rsync server with high bandwidth that keeps syncing the official
> packages and repacks them with rsync-friendly gzip/zlib and sorting the
> files.

There is a growing lobby to use gzip --rsyncable for debian packages
per default. Its coming.

So what can be done?

Doogie is thinking about extending the Bittorrent protocol for use as
apt-get method. I talked with him on irc about some design ideas and
so far it looks realy good if he can get some mirrors to host it.

The bittorrent protocol organises multiple downloaders so that they
also upload to each other and thereby reduces the traffic on the main
server. The extension of the protocol should also utilise http/ftp
mirrors as sources for the files thereby spreading the load over
multiple servers evenly.

Bittorrent calculates a hash for each block of a file very similar to
what rsync needs to work. Via another small extension rolling
checksums for each block could be included in the protocol and a
client side rsync can be done. (I heard this variant of rsync would be
patented in US but never saw real proof of it.)

All together I think a extended bittorrent module for apt-get is by
far the better sollution but it will take some more time and designing
before it can be implemented.

MfG
Goswin

Re: A success story with apt and rsync

2003-07-05 Thread Andrew Suffield

On Sat, Jul 05, 2003 at 11:56:41PM +0200, Koblinger Egmont wrote:
> order of files
> 
> dpkg-deb puts the files in the .deb package in random order. I hate this
> misfeature since it's hard to eye-grep anything from ``dpkg -L'' or F3 in
> mc. We run ``dpkg-deb --build'' using the sortdir library ([4a], [4b]) which
> makes the files appear in the package in alphabetical order. I don't know
> how efficient rsync is if you split a file to some dozens or even hundreds
> of parts and shuffle them, and then syncronize this one with the original
> version. Anyway, I'm sure that sorting the files cannot hurt rsync, it can
> only help. I only guess that it really does help a lot.

It should put them in the package in the order they came from
readdir(), which will depend on the filesystem. This is normally the
order in which they were created, and should not vary when
rebuilding. As such, sorting the list probably doesn't change the
network traffic, but will slow dpkg-deb down on packages with large
directories in them.

-- 
  .''`.  ** Debian GNU/Linux ** | Andrew Suffield
 : :' :  http://www.debian.org/ | Dept. of Computing,
 `. `'  | Imperial College,
   `- -><-  | London, UK


pgpnwSkhK58UF.pgp
Description: PGP signature

Re: A success story with apt and rsync

Re: A success story with apt and rsync

Re: A success story with apt and rsync

Re: A success story with apt and rsync

Re: A success story with apt and rsync

Re: A success story with apt and rsync

Re: A success story with apt and rsync

Re: A success story with apt and rsync

Re: A success story with apt and rsync

Re: A success story with apt and rsync

Re: A success story with apt and rsync

Re: A success story with apt and rsync

Re: A success story with apt and rsync

Re: A success story with apt and rsync

Re: A success story with apt and rsync

Re: A success story with apt and rsync

Re: A success story with apt and rsync

Re: A success story with apt and rsync

Re: A success story with apt and rsync

19 matches

Site Navigation

Mail list logo

Footer information