Re: Rsyncable GZIP (was Re: Package metadata server)

2002-04-08 Thread Robert Tiberius Johnson
On Sun, 2002-04-07 at 19:36, Adam Heath wrote:
 54 days with of Packages(sid/main/i386) gives 900k of xdeltas.

Thanks for the info.  I think that keeping 54 days of diffs (or xdeltas)
is unnecessary -- most of the benefit is accrued by keeping only 20 days
or so.  But I need real stats on frequency of updates to verify this for
certain.

 The problem with xdelta tho, is that it requires both old and new version to
 be available on the same side of the link, to do it's magic.

That's true, but I don't see a problem with that.  If I understand
debian correctly, ocassionally a master server scans over all the newly
uploaded packages and produces a new Packages file.  At that time, the
server will have a copy of the old and new Packages file, and can
produce the delta, placing it in a file with name based on the hash of
the old Packages file.  The master server can then delete the oldest
delta file and replace the old Packages file with the new one.  Mirrors
pick up these changes, and then clients with the old Packages file will
be able to download the delta, since they can compute the same hash. 
Let me know if I got something wrong here.

Best,
Rob



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]




Re: Rsyncable GZIP (was Re: Package metadata server)

2002-04-07 Thread Martijn van Oosterhout
On Sat, Apr 06, 2002 at 10:19:21AM -0500, Jeff Licquia wrote:
 On Sat, 2002-04-06 at 03:13, Otto Wyss wrote:
  Please show use any figures first before you assert this.
  
  I know rsync imposes some load for the computing of the md5sum but
  sendind only the difference outweighs it repeatedly. 
 
 It's my understanding that rsync imposes a large computational burden on
 the server in exchange for a large bandwidth savings.  At a certain
 number of rsync clients, this burden can become too onerous for the
 server to handle.
 
 Also, the benefits almost all accrue to the client.  The server gains a
 small benefit (bandwidth savings), and pays a cost that's both high and
 hard to manage.  (Our users wouldn't stand for connection limits, I
 don't think.)
 
 I don't have any figures to show to prove this.  Then again, neither do
 you, so I guess we're even.

A large mirror in Australia does provide an rsync server to access debian
packages. When redhat 7.0 came out so many people tried to rsync it at the
same time, the machine promptly fell over. 

It's not clear how low the connection limit is now, but low enough to be
irratating.

Almost all the processing for rsync is on the server side and the server
friendly version is patented or something (IIRC).
-- 
Martijn van Oosterhout kleptog@svana.org   http://svana.org/kleptog/
 Ignorance continues to thrive when intelligent people choose to do
 nothing.  Speaking out against censorship and ignorance is the imperative
 of all intelligent people.


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]




Re: Rsyncable GZIP (was Re: Package metadata server)

2002-04-07 Thread Otto Wyss
 A large mirror in Australia does provide an rsync server to access debian
 packages. When redhat 7.0 came out so many people tried to rsync it at the
 same time, the machine promptly fell over. 
 
What amazes me is that nobody is able or willing to provide any figures.
So I guess no provider of an rsync server is interested in this subject
and therefore it can't be a big problem. 

I'm asking any provider of an ftp/rsync Debian server if any comparable
figures could be extracted from the server log. Or if anyone could
measure how much CPU load the download of the Packages/Packages.gz files
really reads.

O. Wyss

-- 
Author of Debian partial mirror synch script
(http://dpartialmirror.sourceforge.net/;)


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]




Re: Rsyncable GZIP (was Re: Package metadata server)

2002-04-07 Thread Robert Tiberius Johnson
On Sun, 2002-04-07 at 11:16, Otto Wyss wrote:
 What amazes me is that nobody is able or willing to provide any figures.
 So I guess no provider of an rsync server is interested in this subject
 and therefore it can't be a big problem. 

Here are some experiments, and a mathematical analysis of different
approaches.  The only missing piece of data that I had to fudge is: how
often do people run apt-get update?  If anyone can give me server logs
containing If-Modified-Since fields, that would be great.

Quick Summary:
--
Diffs compressed with bzip2 generated the smallest difference files, and
hence the smallest downloads.  Using the scheme described below, I
estimate that mirrors retaining 20 days worth of diffs will need about
159K of disk space per Packages file and the average 'apt-get update'
will transfer about 24.2K per changed Packages file.

xdelta would have slightly higher bandwidth and disk space requirements,
but would be applicable to binary files (such as debs).  rsync has no
disk space requirements, but uses 10 times as much bandwidth and
requires more memory, cpu power, etc, on the server.  rsync also has the
advantage of already being implemented, but managing a series of diffs
seems like a trivial shell script.

So in my opinion, diff/bzip2 or xdelta looks like the way.

Example: Diffs between unstable main Packages file from Apr 6 and 7:
---
diff+bzip2: 12987 bytes
diff+gzip:  13890 bytes
xdelta: 15176 bytes
rsync: 163989 bytes (*)
(*) rsyncing uncompressed Packages files with 
rsync -a --no-whole-file --stats -z -e ssh

The Scheme (proposed earlier by others)
--
Assuming debian admins tend to update relatively frequently, the
following diff-based scheme seems to offer the best compromise on mirror
disk space and download bandwidth:

For the 20 most recent Packages files, the server stores the diff
between each pair of consecutive Packages files.  apt-get then simply
does:

do
{
  m = md5sum of my Packages file
  d = fetch Packages_diff_${md5}.bz2
  if (d does not exist)
  {
fetch Packages.gz
break;
  }
  patch my Packages file with d
} while (d is not an empty file)

This scheme easily allows mirrors to tweak the parameters to best suite
their own disk space and bandwidth limitations, and they are not
required to have any cgi-scripts or extra services running.  For
example, a mirror that's tight on disk space can just delete some of the
older diffs, but it will incur a slight bandwidth penalty as a result.

The only disadvantage to using diffs (compared to rsync or some other
dynamic scheme) is the additional disk space requirement.  The disk
space requirement is very small, and disk space is cheaper than cpu
time, memory, and bandwidth.

Analysis:
-

The anlysis uses gp, a great math tool that's available in debian.

I. Diff vs. xdelta
--
By looking at debian-devel-changes, I figured that between Feb. 1 and
April 1, an average of 75 packages were uploaded each day.  There are
around 8000 packages listed in testing main, so the probability that any
given package changes on any given day is p=75/8000.  Thus the expected
number of packages that change in s days is (1-(1-p)^s)*8000.  For
example, the expected number of changed packages in 60 days is 3453. 
Comparing the Packages files from Feb. 7 and April 7 shows 3884 changed
packages, so the model seems reasonably accurate.

My experiments with diff, xdelta, bzip2, etc. concluded that if you diff
two Packages files with 75 changed packages between them, and then
compress the diff with bzip2 -9, the resulting file is about 7936 bytes,
or roughly 106 bytes per changed package.  Thus the average size of a
compressed diff between Packages files seperated by s days is

diffsize(s)=(1-(1-p)^s)*8000*106

The xdelta of the same files is about 25% larger.  If this scheme is
extended to all deb files, not just Packages files, it may just be more
convenient to use xdelta, though.

II. Successive diffs vs. all-at-once diffs
--
This analysis applies to either diff or xdelta; it doesn't matter.

A. Disk space
The next question is, should we diff consecutive Packages files, or
should we compute diffs between the last 20 Packages files and today's
Packages file?  The latter will allow apt-get to fetch just one diff and
be done with it.  However, it uses more disk space on the mirrors.  The
former may require apt-get to fetch several patches in order to update
its Packages file, but will use less disk space on the mirrors.

There is actually a spectrum of choices here.  A mirror may store diffs
between every 3, 4, 5, etc. Packages files.  So, if a client has the
Packages file from 14 days ago, it will first be given a patch bringing
its Packages file to 9 days ago, then 4, and then to the current
Packages file.  If a server stores diffs between Packages files
seperated by s days, and stores d days back, then it will need 


Re: Rsyncable GZIP (was Re: Package metadata server)

2002-04-07 Thread Richard Atterer
On Sun, Apr 07, 2002 at 08:16:28PM +0200, Otto Wyss wrote:
 What amazes me is that nobody is able or willing to provide any
 figures. So I guess no provider of an rsync server is interested in
 this subject and therefore it can't be a big problem.

It is a problem on cdimage.d.o, which is also ftp.uk.d.o. A single CD
image rsync means a load of 1 for 10 minutes, and once 5 people or so
rsync in parallel, the machine gets quite sluggish.

Using rsync just for the Packages file would probably work, but forget
about also using it for the packages.

Cheers,

  Richard

-- 
  __   _
  |_) /|  Richard Atterer |  CS student at the Technische  |  GnuPG key:
  | \/¯|  http://atterer.net  |  Universität München, Germany  |  0x888354F7
  ¯ '` ¯


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]




Re: Rsyncable GZIP (was Re: Package metadata server)

2002-04-07 Thread Jeff Licquia
On Sun, 2002-04-07 at 13:16, Otto Wyss wrote:
  A large mirror in Australia does provide an rsync server to access debian
  packages. When redhat 7.0 came out so many people tried to rsync it at the
  same time, the machine promptly fell over. 
  
 What amazes me is that nobody is able or willing to provide any figures.
 So I guess no provider of an rsync server is interested in this subject
 and therefore it can't be a big problem. 

...or, more likely, they are too busy maintaining their rsync servers to
respond (or even follow the traffic on a list like this one).

The rest of us are trying to impress upon you the possibility that it
might be a big problem, as we've heard that it is in the past.  As
flimsy as anecdotal evidence is, it certainly beats proof by assertion.


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]




Re: Rsyncable GZIP (was Re: Package metadata server)

2002-04-07 Thread Nathan E Norman
On Sun, Apr 07, 2002 at 09:11:27PM -0500, Jeff Licquia wrote:
 On Sun, 2002-04-07 at 13:16, Otto Wyss wrote:
   A large mirror in Australia does provide an rsync server to access debian
   packages. When redhat 7.0 came out so many people tried to rsync it at the
   same time, the machine promptly fell over. 
   
  What amazes me is that nobody is able or willing to provide any figures.
  So I guess no provider of an rsync server is interested in this subject
  and therefore it can't be a big problem. 
 
 ...or, more likely, they are too busy maintaining their rsync servers to
 respond (or even follow the traffic on a list like this one).
 
 The rest of us are trying to impress upon you the possibility that it
 might be a big problem, as we've heard that it is in the past.  As
 flimsy as anecdotal evidence is, it certainly beats proof by assertion.

Agreed.  I used to run debian.midco.net (which sadly no longer exists
now that I no longer work at midco.net).  That machine was a dual
processor PII with 70 GB of RAID disk; it was a news server for a
while before it was pressed into service as a mirror.  IOW, it was a
decent machine in its day.  d.m.n was a primary push mirror and
provided anon rsync access to the world, but with a 15 connection
limit.  Any more than that and apache became resource starved, and
when you're trying to act as a primary HTTP mirror for apt, that's not
good.

I don't have stats as d.m.n has been dead for almost two years now,
but I can assure you that rsync, while quite cool, can be dangerous
in large doses.

Regards,

-- 
Nathan Norman - Micromuse Ltd.  mailto:[EMAIL PROTECTED]
Gil-galad was an Elven-king.|  The Fellowship
Of him the harpers sadly sing:  |of
the last whose realm was fair and free  | the Ring
between the Mountains and the Sea.  |  J.R.R. Tolkien


pgpfIRoHg9Q1J.pgp
Description: PGP signature


Re: Rsyncable GZIP (was Re: Package metadata server)

2002-04-07 Thread Adam Heath
On 7 Apr 2002, Robert Tiberius Johnson wrote:

 On Sun, 2002-04-07 at 11:16, Otto Wyss wrote:
  What amazes me is that nobody is able or willing to provide any figures.
  So I guess no provider of an rsync server is interested in this subject
  and therefore it can't be a big problem.

 Here are some experiments, and a mathematical analysis of different
 approaches.  The only missing piece of data that I had to fudge is: how
 often do people run apt-get update?  If anyone can give me server logs
 containing If-Modified-Since fields, that would be great.

 Quick Summary:
 --
 Diffs compressed with bzip2 generated the smallest difference files, and
 hence the smallest downloads.  Using the scheme described below, I
 estimate that mirrors retaining 20 days worth of diffs will need about
 159K of disk space per Packages file and the average 'apt-get update'
 will transfer about 24.2K per changed Packages file.

54 days with of Packages(sid/main/i386) gives 900k of xdeltas.

 xdelta would have slightly higher bandwidth and disk space requirements,
 but would be applicable to binary files (such as debs).  rsync has no
 disk space requirements, but uses 10 times as much bandwidth and
 requires more memory, cpu power, etc, on the server.  rsync also has the
 advantage of already being implemented, but managing a series of diffs
 seems like a trivial shell script.

xdelta would not work with debs.  It doesn't understand archives.

xdelta does understand already compressed files, and will actually decompres
the files first, before generating the diff.

The problem with xdelta tho, is that it requires both old and new version to
be available on the same side of the link, to do it's magic.

Is someone interested in modifying xdelta to read archives(ar, tar, cpio)?


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]




Re: Rsyncable GZIP (was Re: Package metadata server)

2002-04-07 Thread Adam Heath
On 7 Apr 2002, Robert Tiberius Johnson wrote:

 On Sun, 2002-04-07 at 11:16, Otto Wyss wrote:
  What amazes me is that nobody is able or willing to provide any figures.
  So I guess no provider of an rsync server is interested in this subject
  and therefore it can't be a big problem.

Btw, thanks for this very good analysis.  It's going to be very helpful when I
implement all this.


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]




Re: Rsyncable GZIP (was Re: Package metadata server)

2002-04-06 Thread Otto Wyss
  Some questions that need to be asked:
  Howmany of our mirrors are rsyncable?
 How much load can the servers handle?
 How much more load does rsync do than a fast http server like tux?
 
Please show use any figures first before you assert this.

I know rsync imposes some load for the computing of the md5sum but
sendind only the difference outweighs it repeatedly. 

O. Wyss

-- 
Author of Debian partial mirror synch script
(http://dpartialmirror.sourceforge.net/;)


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]




Re: Rsyncable GZIP (was Re: Package metadata server)

2002-04-06 Thread Jeff Licquia
On Sat, 2002-04-06 at 03:13, Otto Wyss wrote:
 Please show use any figures first before you assert this.
 
 I know rsync imposes some load for the computing of the md5sum but
 sendind only the difference outweighs it repeatedly. 

It's my understanding that rsync imposes a large computational burden on
the server in exchange for a large bandwidth savings.  At a certain
number of rsync clients, this burden can become too onerous for the
server to handle.

Also, the benefits almost all accrue to the client.  The server gains a
small benefit (bandwidth savings), and pays a cost that's both high and
hard to manage.  (Our users wouldn't stand for connection limits, I
don't think.)

I don't have any figures to show to prove this.  Then again, neither do
you, so I guess we're even.


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]




Rsyncable GZIP (was Re: Package metadata server)

2002-04-05 Thread Rob Bradford
On Fri, 2002-04-05 at 12:37, Glenn McGrath wrote:
 (picked up from http://www.debianplanet.org/article.php?sid=633)
 

Hehe.

 The current method of checking for updates is to retrieve a new
 Packages.gz file and discard the old Packages.gz file. The problem with
 this method is that commonly less than 1% of the Packages.gz file has
 changed. A number of solutions have been proposed to overcome this
 problem, these include - Compressing the Packages.gz in an rsync friendly
 manner.

I definitely like the idea of the rsyncable gzipping, i think its
something that could be relatively eaily implemented following the woody
release. I dont think we should use this for all our gzipped files in
the archive, but the Packages.gz file would be an excellent place to use
this.

Some questions that need to be asked:
How much extra work would this involve?
Howmany of our mirrors are rsyncable?
Is there *really* a benefit?
What is the phase of the moon?

I know this has come up before but lets come to a conclusion.

Cheers
-- 
Rob 'robster' Bradford
http://robster.org.uk


signature.asc
Description: This is a digitally signed message part


Re: Rsyncable GZIP (was Re: Package metadata server)

2002-04-05 Thread Erich Schubert
 Some questions that need to be asked:
 Howmany of our mirrors are rsyncable?
How much load can the servers handle?
How much more load does rsync do than a fast http server like tux?

I think the proposed way of providing diff's for certain common
versions is much nicer. Sending a request for
Packages-diff-mymd5sum.gz
and fetching that file is certainly faster and does cause much less load
on the server than rsync. It actually will even require less data to
be transferred.
When the file doesn't exist i can always fallback to fetching the whole
file or maybe rsync then...

Keeping 14 daily diffs probably does take just a few megs on the server.
Providing rsync will probably need as much main memory...

But we HAD this discussion already a few days ago...

Greetings,
Erich


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]




Re: Rsyncable GZIP (was Re: Package metadata server)

2002-04-05 Thread Rob Bradford
 I think the proposed way of providing diff's for certain common
 versions is much nicer. Sending a request for
 Packages-diff-mymd5sum.gz

I like the sound of this way, as you say rsync could cause load problems
on the servers. Now someone just needs to hack apt =)

Cheers
-- 
Rob 'robster' Bradford
http://robster.org.uk


signature.asc
Description: This is a digitally signed message part