Re: [Rpm-ecosystem] A proof-of-concept for delta'ing repodata

2018-03-02 Thread Jonathan Dieter
On Fri, 2018-03-02 at 13:18 +0100, Vít Ondruch wrote:
> Jonathan,
> 
> Have you experimented with casync [1]?

Yes, I did.  Unfortunately, casync creates thousands of tiny files,
which adds load to the mirrors and downloading even a few hundred takes
significantly longer than using http ranges.

Jonathan

signature.asc
Description: This is a digitally signed message part
___
Rpm-ecosystem mailing list
Rpm-ecosystem@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-ecosystem


Re: [Rpm-ecosystem] A proof-of-concept for delta'ing repodata

2018-03-02 Thread Vít Ondruch
Jonathan,

Have you experimented with casync [1]?

Vít


[1] https://github.com/systemd/casync



Dne 13.2.2018 v 10:52 Igor Gnatenko napsal(a):
> CCing rpm-ecosystem@ ML since it's main location where this message
> should have
> went 😉
>
> On Mon, 2018-02-12 at 23:53 +0200, Jonathan Dieter wrote:
> > 
> > I've come up with a method of splitting repodata into chunks that can
> > be downloaded and combined with chunks that are already on the local
> > system to create a byte-for-byte copy of the compressed repodata.
> > Tools and scripts are at:
> > https://www.jdieter.net/downloads/
> > 
>
> > Background:
> > With DNF, we're currently downloading ~20MB of repository data every
> > time the updates repository changes.
>
> > When casync was released, I wondered if we could use it to only
> > download the deltas for the repodata.  At Flock last summer, I ran some
> > tests against the uncompressed repodata and saw a reduction of 30-40%
> > from one day to the next, which seemed low, but was a good starting
> > point.
>
> > Unfortunately, due to the way casync separates each file into thousands
> > of compressed chunks, building each file required thousands of (serial)
> > downloads which, even on a decent internet connection, took *forever*.
>
> > When I talked through the idea with Kevin and Patrick, they also
> > pointed out that our mirrors might not be too keen on the idea of
> > adding thousands of tiny files that change every day.
>
>
> > The Solution(?):
> > One potential solution to the "multitude of files" problem is to merge
> > the chunks back into a single file, and use HTTP ranges to only
> > download the parts of the file we want.  An added bonus is that most
> > web servers are configured to support hundreds of ranges in one
> > request, which greatly reduces the number of requests we have to make.
>
> > The other problem with casync is that it's chunk separation is naïve,
> > which is why we were only achieving 30-40% savings.  But we know what
> > the XML file is supposed to look like, so we can separate the chunks on
> > the tag boundaries in the XML.
>
> > So I've ditched casync altogether and put together a proof-of-concept
> > (tentatively named zchunk) that takes an XML file, compresses each tag
> > separately, and then concatenates all of them into one file.  The tool
> > also creates an index file that tells you the sha256sum for each
> > compressed chunk and the location of the chunk in the file.
>
> > I've also written a small script that will download a zchunk off the
> > internet.  If you don't specify an old file, it will just download
> > everything, but if you specify an old file, it will download the index
> > of the new file and compare the sha256sums of each chunk.  Any
> > checksums that match will be taken from the old file, and the rest will
> > be downloaded.
>
> > In testing, I've seen savings ranging from 10% (December 17 to today)
> > to 95% (yesterday to today).
>
>
> > Remaining problems:
> >  * Zchunk files are bigger than their gzip equivalents.  This ranges
> >    from 5% larger for filelists.xml to 300% larger for primary.xml.
> >    This can be greatly reduced by chunking primary.xml based on srpm
> >    rather than rpm, which brings the size increase for primary.xml down
> >    to roughly 30%.
>
> >  * Many changes to the metadata can mean a large number of ranges
> >    requested.  I ran a check on our mirrors, and three (out of around
> >    150 that had the file I was testing) don't honor range requests at
> >    all, and three others only honor a small number in a single request.
> > A further seven didn't respond at all (not sure if that had
> >    anything to do with the range requests), and the rest supported
> >    between 256 and 512 ranges in a single request.  We can reduce the
> >    number of ranges requested by always ordering our packages by date.
> >    This would ensure that new packages are grouped at the end of the
> >    xml where they will be grabbed in one contiguous range.
>
> This would "break" DNF, because libsolv is assigning Id's by the order of
> packages in metadata. So if something requires "webserver" and there
> is "nginx"
> and "httpd" providing it (without versions), then lowest Id is picked
> up (not
> going into details of this). Which means depending on when last update
> for one
> or other was submitted, users will get different results. This is
> unacceptable
> from my POV.
>
> >  * Zchunk files use zlib (it gives better compression than xz with such
> >    small chunks), but, because they use a custom zdict, they are not gz
> >    files.  This means that we'll need new tools to read and write them.
> >    (And I am volunteering to do the work here)
>
> What about zstd? Also in latest version of lz4 there is support for
> dictionaries too.
>
> > The tools:
> > The proof-of-concept tools are all sitting in
> > https://www.jdieter.net/downloads/zchunk-scripts/
>
> > They are full of ugly hacks, especially when it comes 

Re: [Rpm-ecosystem] A proof-of-concept for delta'ing repodata

2018-02-16 Thread Jonathan Dieter
On Tue, 2018-02-13 at 10:52 +0100, Igor Gnatenko wrote:
> What about zstd? Also in latest version of lz4 there is support for
> dictionaries too.

So I've investigated zstd, and, here are my results:

Latest F27
primary.gz - 3.1MB

zlib zchunk (including custom dict)
primary.zck - 4.2MB ~35% increase

zstd zchunk (including dict generated from last three Fedora GA
primaries)
primary.zck - 3.7MB ~20% increase

Using zstd for filelists.xml has roughly the same increase as with
zlib, which is expected as the chunks are larger and thus get better
compression even without a dict.

I did also look briefly at lz4, but it seems that it's major advantage
is speed, and I'm not sure that metadata decompression speed is our
main bottleneck in dnf.

With these numbers, I think it makes sense to move forward with zstd
instead of zlib.

Jonathan

signature.asc
Description: This is a digitally signed message part
___
Rpm-ecosystem mailing list
Rpm-ecosystem@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-ecosystem


Re: [Rpm-ecosystem] A proof-of-concept for delta'ing repodata

2018-02-14 Thread Igor Gnatenko
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

On Tue, 2018-02-13 at 12:09 +, Michael Schroeder wrote:
> On Tue, Feb 13, 2018 at 10:52:14AM +0100, Igor Gnatenko wrote:
> > This would "break" DNF, because libsolv is assigning Id's by the order of
> > packages in metadata. So if something requires "webserver" and there is
> > "nginx"
> > and "httpd" providing it (without versions), then lowest Id is picked up
> > (not
> > going into details of this).
> 
> No, this is not correct. Libsolv doesn't use the Id to pick a package,
> exactly to be independent on the package order of the repository.

Oh, could you link me source code for this part? I tried to look it up, but
haven't found. Thanks!
- -- 
- -Igor Gnatenko
-BEGIN PGP SIGNATURE-

iQIzBAEBCAAdFiEEhLFO09aHZVqO+CM6aVcUvRu8X0wFAlqEgP8ACgkQaVcUvRu8
X0z1zw//Vpu6NgZEHZUyes7Bjb5ABmIlP5b9XCo5d0GHAWAfq761eFV4qot/y8H2
ErLDA4+OttOgag+PWNLo5Jpx2pN39r0ccqTWpISfb7lMpumKhlqBAvojSu6tM1nr
9Z3FtBm4EG7pSyjzEQW6Z2BmLovLQOUUncoCR0hbntOB5XK3X/96lkfxmW/8UqRE
62g2SCJHNjz4w74Xc23/ZAkCZtAby6Dw7KJ0DwyPEkl4bRM2GAJhZsnTw8EVHgnH
q4v1aSmYezG4GvKNAF/mepMrKJVxMD1JdkaOsFpzaKqFU/QjNpfwDlEGPzMvnCgA
ZmVQ7SYMFPGg8KiTUfsIWExm5qTEfmrN0e7RmZgHtFmS/H+h1H40ZUGFfd60k+Q/
pr5z1phc2bZ8iew2SscORj8IYe4xT3oOWbcabLGL1w1aM6QobFckavSL39K1Rpfz
YyKGqYOKhENlduhghYvB8+aBM357MVZTrJM5+x9u9NmMiuvH7e/bXQPxhd7LzTCS
ESsO7ryWh5kypccHgizrVLKliz+S3Nl+kf9oQLY98gVyxbMNkhfYDkp0gUY46Zvo
0LoTZ8QcajYiryJzOfi4C+7sJpozowxmasHYb15miCspnA1O3PIa82MgrlS77E8B
8gkbCckeGCxaAa3dXa2JgpENBznQ7DgAZaraQT+ZSxb1jF7QPW8=
=O1D0
-END PGP SIGNATURE-

___
Rpm-ecosystem mailing list
Rpm-ecosystem@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-ecosystem


Re: [Rpm-ecosystem] A proof-of-concept for delta'ing repodata

2018-02-13 Thread Michael Schroeder
On Tue, Feb 13, 2018 at 10:52:14AM +0100, Igor Gnatenko wrote:
> This would "break" DNF, because libsolv is assigning Id's by the order of
> packages in metadata. So if something requires "webserver" and there is 
> "nginx"
> and "httpd" providing it (without versions), then lowest Id is picked up (not
> going into details of this).

No, this is not correct. Libsolv doesn't use the Id to pick a package,
exactly to be independent on the package order of the repository.

Cheers,
  Michael.

-- 
Michael Schroeder   m...@suse.de
SUSE LINUX GmbH,   GF Jeff Hawn, HRB 16746 AG Nuernberg
main(_){while(_=~getchar())putchar(~_-1/(~(_|32)/13*2-11)*13);}
___
Rpm-ecosystem mailing list
Rpm-ecosystem@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-ecosystem


Re: [Rpm-ecosystem] A proof-of-concept for delta'ing repodata

2018-02-13 Thread Jonathan Dieter
On Tue, 2018-02-13 at 10:52 +0100, Igor Gnatenko wrote:
> On Mon, 2018-02-12 at 23:53 +0200, Jonathan Dieter wrote:
> >  * Many changes to the metadata can mean a large number of ranges
> >requested.  I ran a check on our mirrors, and three (out of around
> >150 that had the file I was testing) don't honor range requests at
> >all, and three others only honor a small number in a single request.
> > A further seven didn't respond at all (not sure if that had
> >anything to do with the range requests), and the rest supported
> >between 256 and 512 ranges in a single request.  We can reduce the
> >number of ranges requested by always ordering our packages by date. 
> >This would ensure that new packages are grouped at the end of the
> >xml where they will be grabbed in one contiguous range.
> 
> This would "break" DNF, because libsolv is assigning Id's by the order of
> packages in metadata. So if something requires "webserver" and there is 
> "nginx"
> and "httpd" providing it (without versions), then lowest Id is picked up (not
> going into details of this). Which means depending on when last update for one
> or other was submitted, users will get different results. This is unacceptable
> from my POV.

That's fair enough, though how hard would it be to change libsolv to
assign Id's based on alphabetical order as opposed to metadata order
(or possibly reorder the xml before sending it to libsolv)?  

To be clear, this optimization would reduce the number of range
requests we have to send to the server, but would not hugely change the
amount we download, so I don't think it's very high priority.


> >  * Zchunk files use zlib (it gives better compression than xz with such
> >small chunks), but, because they use a custom zdict, they are not gz
> >files.  This means that we'll need new tools to read and write them.
> >(And I am volunteering to do the work here)
> 
> What about zstd? Also in latest version of lz4 there is support for
> dictionaries too.

I'll take a look at both of those.


> As being someone who tried to work on this problem I very appreciate what you
> have done here. We've started with using zsync and results were quite good, 
> but
> zsync is dead and has ton of bugs. Also it requires archives to be `
> --rsyncable`. So my question is why not to add idx file as additional one for
> existing files instead of inventing new format? The problem is that we will
> have to distribute in old format too (for compatibility reasons).

I'm not sure if it was clear, but I'm basically making --rsyncable
archives with more intelligent divisions between the independent
blocks, which is why it gives better delta performance... you're not
getting *any* redundant data.

I did originally experiment with xz files (a series of concatenated xz
files is still a valid xz file), but the files were 20% larger than
zlib with custom zdict.

The zdict helps us reduce file size by allowing all the chunks to use
the same common strings that will not change (mainly tag names), but
custom zdicts aren't allowed by gzip.

I've also toyed with the idea of supporting embedded idx's in zchunk
files so we don't have to keep two files for every local zchunk file. 
We'd still want separate idx files on the webserver, though, otherwise
we're looking at an extra http request to get the size of the index in
the zchunk.  If we embed the index in the file, we must create a new
format as we don't want the index concatenated with the rest of the
uncompressed file when decompressing.


> I'm not sure if trying to do optimizations by XML tags is very good idea
> especially because I hope that in future we would stop distributing XML's and
> start distributing solv/solvx.

zchunk.py shouldn't care what type of data it's chunking, but it needs
to be able to chunk the same way every time.  Currently it only knows
how to do that with XML, because we can split it based on tag
boundaries, and grouping based on source rpm gives us even better
compression without sacrificing any flexibility.

dl_zchunk.py and unzchunk.py neither know, nor care what type of file
they're working with.

Thanks so much for the feedback, and especially for the pointers to lz4
and zstd.  Hopefully they'll get us closer to matching our current gz
size.

Jonathan

signature.asc
Description: This is a digitally signed message part
___
Rpm-ecosystem mailing list
Rpm-ecosystem@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-ecosystem


Re: [Rpm-ecosystem] A proof-of-concept for delta'ing repodata

2018-02-13 Thread Miroslav Suchý
Dne 13.2.2018 v 10:52 Igor Gnatenko napsal(a):
> I've come up with a method of splitting repodata into chunks that can
> be downloaded and combined with chunks that are already on the local
> system to create a byte-for-byte copy of the compressed repodata. 
> Tools and scripts are at:
> https://www.jdieter.net/downloads/

Debian has this for ages and they recently reworked it. It is nice reading 
about why and how:
http://www.chiark.greenend.org.uk/~cjwatson/blog/no-more-hash-sum-mismatch-errors.html

Miroslav



signature.asc
Description: OpenPGP digital signature
___
Rpm-ecosystem mailing list
Rpm-ecosystem@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-ecosystem


Re: [Rpm-ecosystem] A proof-of-concept for delta'ing repodata

2018-02-13 Thread Neal Gompa
On Tue, Feb 13, 2018 at 4:52 AM, Igor Gnatenko
 wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> CCing rpm-ecosystem@ ML since it's main location where this message should 
> have
> went 😉
>
> On Mon, 2018-02-12 at 23:53 +0200, Jonathan Dieter wrote:
>> 
>> I've come up with a method of splitting repodata into chunks that can
>> be downloaded and combined with chunks that are already on the local
>> system to create a byte-for-byte copy of the compressed repodata.
>> Tools and scripts are at:
>> https://www.jdieter.net/downloads/
>> 
>>
>> Background:
>> With DNF, we're currently downloading ~20MB of repository data every
>> time the updates repository changes.
>>
>> When casync was released, I wondered if we could use it to only
>> download the deltas for the repodata.  At Flock last summer, I ran some
>> tests against the uncompressed repodata and saw a reduction of 30-40%
>> from one day to the next, which seemed low, but was a good starting
>> point.
>>
>> Unfortunately, due to the way casync separates each file into thousands
>> of compressed chunks, building each file required thousands of (serial)
>> downloads which, even on a decent internet connection, took *forever*.
>>
>> When I talked through the idea with Kevin and Patrick, they also
>> pointed out that our mirrors might not be too keen on the idea of
>> adding thousands of tiny files that change every day.
>>
>>
>> The Solution(?):
>> One potential solution to the "multitude of files" problem is to merge
>> the chunks back into a single file, and use HTTP ranges to only
>> download the parts of the file we want.  An added bonus is that most
>> web servers are configured to support hundreds of ranges in one
>> request, which greatly reduces the number of requests we have to make.
>>
>> The other problem with casync is that it's chunk separation is naïve,
>> which is why we were only achieving 30-40% savings.  But we know what
>> the XML file is supposed to look like, so we can separate the chunks on
>> the tag boundaries in the XML.
>>
>> So I've ditched casync altogether and put together a proof-of-concept
>> (tentatively named zchunk) that takes an XML file, compresses each tag
>> separately, and then concatenates all of them into one file.  The tool
>> also creates an index file that tells you the sha256sum for each
>> compressed chunk and the location of the chunk in the file.
>>
>> I've also written a small script that will download a zchunk off the
>> internet.  If you don't specify an old file, it will just download
>> everything, but if you specify an old file, it will download the index
>> of the new file and compare the sha256sums of each chunk.  Any
>> checksums that match will be taken from the old file, and the rest will
>> be downloaded.
>>
>> In testing, I've seen savings ranging from 10% (December 17 to today)
>> to 95% (yesterday to today).
>>

This is fantastic. It'd be really helpful for users in Mageia where
its more common for users to have limited bandwidth.

>>
>> Remaining problems:
>>  * Zchunk files are bigger than their gzip equivalents.  This ranges
>>from 5% larger for filelists.xml to 300% larger for primary.xml.
>>This can be greatly reduced by chunking primary.xml based on srpm
>>rather than rpm, which brings the size increase for primary.xml down
>>to roughly 30%.
>>
>>  * Many changes to the metadata can mean a large number of ranges
>>requested.  I ran a check on our mirrors, and three (out of around
>>150 that had the file I was testing) don't honor range requests at
>>all, and three others only honor a small number in a single request.
>> A further seven didn't respond at all (not sure if that had
>>anything to do with the range requests), and the rest supported
>>between 256 and 512 ranges in a single request.  We can reduce the
>>number of ranges requested by always ordering our packages by date.
>>This would ensure that new packages are grouped at the end of the
>>xml where they will be grabbed in one contiguous range.
>
> This would "break" DNF, because libsolv is assigning Id's by the order of
> packages in metadata. So if something requires "webserver" and there is 
> "nginx"
> and "httpd" providing it (without versions), then lowest Id is picked up (not
> going into details of this). Which means depending on when last update for one
> or other was submitted, users will get different results. This is unacceptable
> from my POV.
>

Even without this optimization, we'd still have some savings..

>>  * Zchunk files use zlib (it gives better compression than xz with such
>>small chunks), but, because they use a custom zdict, they are not gz
>>files.  This means that we'll need new tools to read and write them.
>>(And I am volunteering to do the work here)
>
> What about zstd? Also in latest version of lz4 there is support for
> dictionaries too.
>

I would suggest using zstd or lz4 too. I already had filed an RFE for
createrepo_

Re: [Rpm-ecosystem] A proof-of-concept for delta'ing repodata

2018-02-13 Thread Igor Gnatenko
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

CCing rpm-ecosystem@ ML since it's main location where this message should have
went 😉

On Mon, 2018-02-12 at 23:53 +0200, Jonathan Dieter wrote:
> 
> I've come up with a method of splitting repodata into chunks that can
> be downloaded and combined with chunks that are already on the local
> system to create a byte-for-byte copy of the compressed repodata. 
> Tools and scripts are at:
> https://www.jdieter.net/downloads/
> 
> 
> Background:
> With DNF, we're currently downloading ~20MB of repository data every
> time the updates repository changes.
> 
> When casync was released, I wondered if we could use it to only
> download the deltas for the repodata.  At Flock last summer, I ran some
> tests against the uncompressed repodata and saw a reduction of 30-40%
> from one day to the next, which seemed low, but was a good starting
> point.
> 
> Unfortunately, due to the way casync separates each file into thousands
> of compressed chunks, building each file required thousands of (serial)
> downloads which, even on a decent internet connection, took *forever*.
> 
> When I talked through the idea with Kevin and Patrick, they also
> pointed out that our mirrors might not be too keen on the idea of
> adding thousands of tiny files that change every day.
> 
> 
> The Solution(?):
> One potential solution to the "multitude of files" problem is to merge
> the chunks back into a single file, and use HTTP ranges to only
> download the parts of the file we want.  An added bonus is that most
> web servers are configured to support hundreds of ranges in one
> request, which greatly reduces the number of requests we have to make.
> 
> The other problem with casync is that it's chunk separation is naïve,
> which is why we were only achieving 30-40% savings.  But we know what
> the XML file is supposed to look like, so we can separate the chunks on
> the tag boundaries in the XML.
> 
> So I've ditched casync altogether and put together a proof-of-concept
> (tentatively named zchunk) that takes an XML file, compresses each tag
> separately, and then concatenates all of them into one file.  The tool
> also creates an index file that tells you the sha256sum for each
> compressed chunk and the location of the chunk in the file.
> 
> I've also written a small script that will download a zchunk off the
> internet.  If you don't specify an old file, it will just download
> everything, but if you specify an old file, it will download the index
> of the new file and compare the sha256sums of each chunk.  Any
> checksums that match will be taken from the old file, and the rest will
> be downloaded.
> 
> In testing, I've seen savings ranging from 10% (December 17 to today)
> to 95% (yesterday to today).
> 
> 
> Remaining problems:
>  * Zchunk files are bigger than their gzip equivalents.  This ranges
>from 5% larger for filelists.xml to 300% larger for primary.xml. 
>This can be greatly reduced by chunking primary.xml based on srpm
>rather than rpm, which brings the size increase for primary.xml down
>to roughly 30%.
> 
>  * Many changes to the metadata can mean a large number of ranges
>requested.  I ran a check on our mirrors, and three (out of around
>150 that had the file I was testing) don't honor range requests at
>all, and three others only honor a small number in a single request.
> A further seven didn't respond at all (not sure if that had
>anything to do with the range requests), and the rest supported
>between 256 and 512 ranges in a single request.  We can reduce the
>number of ranges requested by always ordering our packages by date. 
>This would ensure that new packages are grouped at the end of the
>xml where they will be grabbed in one contiguous range.

This would "break" DNF, because libsolv is assigning Id's by the order of
packages in metadata. So if something requires "webserver" and there is "nginx"
and "httpd" providing it (without versions), then lowest Id is picked up (not
going into details of this). Which means depending on when last update for one
or other was submitted, users will get different results. This is unacceptable
from my POV.

>  * Zchunk files use zlib (it gives better compression than xz with such
>small chunks), but, because they use a custom zdict, they are not gz
>files.  This means that we'll need new tools to read and write them.
>(And I am volunteering to do the work here)

What about zstd? Also in latest version of lz4 there is support for
dictionaries too.

> The tools:
> The proof-of-concept tools are all sitting in
> https://www.jdieter.net/downloads/zchunk-scripts/
> 
> They are full of ugly hacks, especially when it comes to parsing the
> XML, there's little to no error reporting, and I didn't comment them
> well at all, but they should work.
> 
> If all you want to do is download zchunks, you need to run dl_zchunk.py
> with the url you want to download (ending in .zck)