[Rpm-ecosystem] Some points about zchunk

Michael Schroeder Thu, 05 Jul 2018 07:19:26 -0700

Hi,


here are some of my thoughts about Jonathan's zchunk compression:

Basics:
-------
The basic idea is to split the input file in chunks so that a metadata
update can re-use identical chunks and just download the changed chunks.
Zchunk can either split on fixed strings like '<package', or use
an algorithmic approch with a rolling checksum.

So let's split on '<package' boundary, keeping chunks with the same
src rpm together (that's what the zchunk code does with the -s option):

input: RC3-primary.xml           8904471 bytes
split into 1844 chunks with average 4828 bytes

So what about compression? We still want to use compression, and this
is where things get more complicated:

gzip -9 compression:             1197114 bytes  (ratio: 7.4)
xz compression:                   887040 bytes  (ratio: 10.03)
zstd -16 compression:             953835

We could compress each chunk individually, but that would lead to
a very bad compression ratio. But zstd compression helps us with
its dictionary support: we first create a compression dictionary 
of all the chunks and then compress with the dictionary:

$ zstd -16 --train chunks* -o dictionary

compressed dictionary size: 30595 bytes
1844 chunkes compressed to average 605.35 bytes
chunked and compressed file: 30595 + 605.35 * 1844 = 1146860 bytes

As you can see the compression is a bit better than gzip -9, nice!

The download algorithm works like this: get the list of chunks of
the new file and check if we can reuse chunks from the old file.
If we need to download new chunks, download them plus the dictionary.


Thoughts:
---------
The basic algorithms and implementation are sound and work nicely.
Kudos to Jonathan for doing such an amazing job.

Here's some points I have: (Please correct me if I'm wrong anywhere)

 1) The current implementation can't reuse chunks when the dictionary
    changes. That's a rather big limitation. A dictionary is a must
    if we want to go with small chunks.

    We can also go with no dictionary and large chunks, this is somewhat
    the zchunk default. For the example above the buzhash algorithm would
    split the file into 193 chunks instead of the "package level" 1844
    chunks. Large chunks mean good compression, but the amount of data
    that can get reused will probably be much less. In that case we (SUSE)
    might as well stay with zsync and gzip -9 --rsyncable ;)
    
    From an algorithmic point having different dictionaries is not a
    problem: you'd just need to store the checksum over the uncompressed
    chunks instead. But there's a big drawback: you can't reconstruct the
    identical file. That's because you need to re-compress the chunks
    you reuse with the new dictionary, and this may lead to different
    data if the zstd algorithm is different than the one used when
    creating the repository

    We have the same problem with deltarpms, the recompression is the weak
    step. Repository creation is usually done on system that runs a different
    distribution version than the target, which makes this even more likely.

    So we can reconstruct a zchunk file that gets the same data when
    uncompressed, but it might not be the identical zchunk file. But this
    may not be a problem at all, we just need to be sure that the
    verification step works.

 2) What to put into repomd.xml? We'll need to old primary.xml.gz for
    compatibility reasons. It's a good security practice to minimize the
    attack vector, so we should put the zchunk header checksum into
    the repodata.xml so that it can be verified before running the zchunk
    code. So primary.xml.zck with extra attributes for the header? Or an
    extra element that describes the zchunk header?

 3) I don't think signature support in zchunk is useful ;)

 4) Nitpick: Why does zchunk use sha1 checksums for the chunks? Either
    it's something that needs to be cryptographic sound, then sha1 is the
    wrong choice. Or it's just meant for identifying chunks, then
    md5 is probably faster/smaller. Or some other checksum. But you
    really don't need 20 bytes like with sha1.

Ok, that's enough for now.

Thanks,
  Michael.

-- 
Michael Schroeder                                   m...@suse.de
SUSE LINUX GmbH,           GF Jeff Hawn, HRB 16746 AG Nuernberg
main(_){while(_=~getchar())putchar(~_-1/(~(_|32)/13*2-11)*13);}
_______________________________________________
Rpm-ecosystem mailing list
Rpm-ecosystem@lists.rpm.org
http://lists.rpm.org/mailman/listinfo/rpm-ecosystem

[Rpm-ecosystem] Some points about zchunk

Reply via email to