Hi,
here are some of my thoughts about Jonathan's zchunk compression: Basics: ------- The basic idea is to split the input file in chunks so that a metadata update can re-use identical chunks and just download the changed chunks. Zchunk can either split on fixed strings like '<package', or use an algorithmic approch with a rolling checksum. So let's split on '<package' boundary, keeping chunks with the same src rpm together (that's what the zchunk code does with the -s option): input: RC3-primary.xml 8904471 bytes split into 1844 chunks with average 4828 bytes So what about compression? We still want to use compression, and this is where things get more complicated: gzip -9 compression: 1197114 bytes (ratio: 7.4) xz compression: 887040 bytes (ratio: 10.03) zstd -16 compression: 953835 We could compress each chunk individually, but that would lead to a very bad compression ratio. But zstd compression helps us with its dictionary support: we first create a compression dictionary of all the chunks and then compress with the dictionary: $ zstd -16 --train chunks* -o dictionary compressed dictionary size: 30595 bytes 1844 chunkes compressed to average 605.35 bytes chunked and compressed file: 30595 + 605.35 * 1844 = 1146860 bytes As you can see the compression is a bit better than gzip -9, nice! The download algorithm works like this: get the list of chunks of the new file and check if we can reuse chunks from the old file. If we need to download new chunks, download them plus the dictionary. Thoughts: --------- The basic algorithms and implementation are sound and work nicely. Kudos to Jonathan for doing such an amazing job. Here's some points I have: (Please correct me if I'm wrong anywhere) 1) The current implementation can't reuse chunks when the dictionary changes. That's a rather big limitation. A dictionary is a must if we want to go with small chunks. We can also go with no dictionary and large chunks, this is somewhat the zchunk default. For the example above the buzhash algorithm would split the file into 193 chunks instead of the "package level" 1844 chunks. Large chunks mean good compression, but the amount of data that can get reused will probably be much less. In that case we (SUSE) might as well stay with zsync and gzip -9 --rsyncable ;) From an algorithmic point having different dictionaries is not a problem: you'd just need to store the checksum over the uncompressed chunks instead. But there's a big drawback: you can't reconstruct the identical file. That's because you need to re-compress the chunks you reuse with the new dictionary, and this may lead to different data if the zstd algorithm is different than the one used when creating the repository We have the same problem with deltarpms, the recompression is the weak step. Repository creation is usually done on system that runs a different distribution version than the target, which makes this even more likely. So we can reconstruct a zchunk file that gets the same data when uncompressed, but it might not be the identical zchunk file. But this may not be a problem at all, we just need to be sure that the verification step works. 2) What to put into repomd.xml? We'll need to old primary.xml.gz for compatibility reasons. It's a good security practice to minimize the attack vector, so we should put the zchunk header checksum into the repodata.xml so that it can be verified before running the zchunk code. So primary.xml.zck with extra attributes for the header? Or an extra element that describes the zchunk header? 3) I don't think signature support in zchunk is useful ;) 4) Nitpick: Why does zchunk use sha1 checksums for the chunks? Either it's something that needs to be cryptographic sound, then sha1 is the wrong choice. Or it's just meant for identifying chunks, then md5 is probably faster/smaller. Or some other checksum. But you really don't need 20 bytes like with sha1. Ok, that's enough for now. Thanks, Michael. -- Michael Schroeder m...@suse.de SUSE LINUX GmbH, GF Jeff Hawn, HRB 16746 AG Nuernberg main(_){while(_=~getchar())putchar(~_-1/(~(_|32)/13*2-11)*13);} _______________________________________________ Rpm-ecosystem mailing list Rpm-ecosystem@lists.rpm.org http://lists.rpm.org/mailman/listinfo/rpm-ecosystem