Hi, everyone. The Gentoo's tbz2/xpak package format is quite old. We've made a few incompatible changes in the past (most notably, allowing non-bzip2 compression and multi-instance naming) but the core design stayed the same. I think we should consider changing it, for the reasons outlined below.
The rough format description can be found in xpak(5). Basically, it's a regular compressed tarball with binary metadata blob appended to the end. As such, it looks like a regular compressed tarball to the compression tools (with some ignored junk at the end). The metadata is entirely custom format and needs dedicated tools to manipulate. The current format has a few advantages whose preserving would probably be worthwhile: + The binary package is a single flat file. + It is reasonably compatible with regular compressed tarball, so the users can unpack it using standard tools (except for metadata). + The metadata is uncompressed and can be quickly found without touching the compressed data. + The metadata can be updated (e.g. as result of pkgmove) without touching the compressed data. However, it has a few disadvantages as well: - The metadata is entirely custom binary format, requiring dedicated tools to read or edit. - The metadata format is relying on customary behavior of compression tools that ignore junk following the compressed data. - By placing the metadata at the end of file, we make it rather hard to read the metadata from remote location (via FTP, HTTP) without fetching the whole file. [NB: it's technically possible but probably not worth the effort] - By requiring the custom format to be at the end of file, we make it impossible to trivially cover it with a OpenPGP signature without introducing another custom format. - While the format might allow for some extensibility, it's rather evolutionary dead end. I think the key points of the new format should be: 1. It should reuse common file formats as much as possible, with inventing as little custom code as possible. 2. It should allow for easy introspection and editing by users without dedicated tools. 3. The metadata should allow for lookup without fetching the whole binary package. 4. The format should allow for some extensions without having to reinvent the wheel every time. 5. It would be nice to preserve the existing advantages. My proposal =========== Basic format ------------ The base of the format is a regular compressed tarball. There's no junk appended to it but the metadata is stored inside it as /var/db/pkg/${PF}. The contents are as compatible with the actual vdb format as possible. This has the following advantages: + Binary package is still stored as a single file. + It uses a standard compressed .tar format, with minimal customization. + The user can easily inspect and modify the packages with standard tools (tar and the compressor). + If we can maintain reasonable level of vdb compatibility, the user can even emergency-install a package without causing too much hassle (as it will be recorded in vdb); ideally Portage would detect this vdb entry and support fixing the install afterwards. Optimizing for easy recognition ------------------------------- In order to make it possible for magic-based tools such as file(1) to easily distinguish Gentoo binary packages from regular tarballs, we could (ab)use the volume label field, e.g. use: $ tar -V 'gpkg: app-foo/bar-1' -c ... This will add a volume label as the first file entry inside the tarball, which does not affect extracting but can be trivially matched via magic rules. Note: this is meant to be used as a method for fast binary package recognition; I don't think we should reject (hand-modified) binary packages that lack this label. Optimizing for metadata reading/manipulation performance -------------------------------------------------------- The main problem with using a single tarball for both metadata and data is that normally you'd have to decompress everything to reliably unpack metadata, and recompress everything to update it. This problem can be addressed by a few optimization tricks. Firstly, all metadata files are packed to the archive before data files. With a slightly customized unpacker, we can stop decompressing as soon as we're past metadata and avoid decompressing the whole archive. This will also make it possible to read metadata from remote files without fetching far past the compressed metadata block. Secondly, if we're up for some more tricks, we could technically split the tarball into metadata and data blocks compressed separately. This will need a bit of archiver customization but it will make it possible to decompress the metadata part without even touching compressed data, and to replace it without recompressing data. What's important is that both tricks proposed maintain backwards compatibility with regular compressed tarballs. That is, the user will still be able to extract it with regular archiving tools. Adding OpenPGP signatures ------------------------- This is the main XXX here. Technically, the most obvious solution is to cover the entire tarball with OpenPGP signature. However, this has the disadvantage that the verification requires fetching the whole file. I will look into possibility of having partial signatures. -- Best regards, Michał Górny
signature.asc
Description: This is a digitally signed message part