Hi, everyone.

The Gentoo's tbz2/xpak package format is quite old.  We've made a few
incompatible changes in the past (most notably, allowing non-bzip2
compression and multi-instance naming) but the core design stayed
the same.  I think we should consider changing it, for the reasons
outlined below.

The rough format description can be found in xpak(5).  Basically, it's
a regular compressed tarball with binary metadata blob appended
to the end.  As such, it looks like a regular compressed tarball
to the compression tools (with some ignored junk at the end).
The metadata is entirely custom format and needs dedicated tools
to manipulate.


The current format has a few advantages whose preserving would probably
be worthwhile:

+ The binary package is a single flat file.

+ It is reasonably compatible with regular compressed tarball,
so the users can unpack it using standard tools (except for metadata).

+ The metadata is uncompressed and can be quickly found without touching
the compressed data.

+ The metadata can be updated (e.g. as result of pkgmove) without
touching the compressed data.


However, it has a few disadvantages as well:

- The metadata is entirely custom binary format, requiring dedicated
tools to read or edit.

- The metadata format is relying on customary behavior of compression
tools that ignore junk following the compressed data.

- By placing the metadata at the end of file, we make it rather hard to
read the metadata from remote location (via FTP, HTTP) without fetching
the whole file.  [NB: it's technically possible but probably not worth
the effort]

- By requiring the custom format to be at the end of file, we make it
impossible to trivially cover it with a OpenPGP signature without
introducing another custom format.

- While the format might allow for some extensibility, it's rather
evolutionary dead end.


I think the key points of the new format should be:

1. It should reuse common file formats as much as possible, with
inventing as little custom code as possible.

2. It should allow for easy introspection and editing by users without
dedicated tools.

3. The metadata should allow for lookup without fetching the whole
binary package.

4. The format should allow for some extensions without having to
reinvent the wheel every time.

5. It would be nice to preserve the existing advantages.


My proposal
===========

Basic format
------------
The base of the format is a regular compressed tarball.  There's no junk
appended to it but the metadata is stored inside it as
/var/db/pkg/${PF}.  The contents are as compatible with the actual vdb
format as possible.

This has the following advantages:

+ Binary package is still stored as a single file.

+ It uses a standard compressed .tar format, with minimal customization.

+ The user can easily inspect and modify the packages with standard
tools (tar and the compressor).

+ If we can maintain reasonable level of vdb compatibility, the user can
even emergency-install a package without causing too much hassle (as it
will be recorded in vdb); ideally Portage would detect this vdb entry
and support fixing the install afterwards.


Optimizing for easy recognition
-------------------------------
In order to make it possible for magic-based tools such as file(1) to
easily distinguish Gentoo binary packages from regular tarballs, we
could (ab)use the volume label field, e.g. use:

  $ tar -V 'gpkg: app-foo/bar-1' -c ...

This will add a volume label as the first file entry inside the tarball,
which does not affect extracting but can be trivially matched via magic
rules.

Note: this is meant to be used as a method for fast binary package
recognition; I don't think we should reject (hand-modified) binary
packages that lack this label.


Optimizing for metadata reading/manipulation performance
--------------------------------------------------------
The main problem with using a single tarball for both metadata and data
is that normally you'd have to decompress everything to reliably unpack
metadata, and recompress everything to update it.  This problem can be
addressed by a few optimization tricks.

Firstly, all metadata files are packed to the archive before data files.
 With a slightly customized unpacker, we can stop decompressing as soon
as we're past metadata and avoid decompressing the whole archive.  This
will also make it possible to read metadata from remote files without
fetching far past the compressed metadata block.

Secondly, if we're up for some more tricks, we could technically split
the tarball into metadata and data blocks compressed separately.  This
will need a bit of archiver customization but it will make it possible
to decompress the metadata part without even touching compressed data,
and to replace it without recompressing data.

What's important is that both tricks proposed maintain backwards
compatibility with regular compressed tarballs.  That is, the user will
still be able to extract it with regular archiving tools.


Adding OpenPGP signatures
-------------------------
This is the main XXX here.

Technically, the most obvious solution is to cover the entire tarball
with OpenPGP signature.  However, this has the disadvantage that
the verification requires fetching the whole file.

I will look into possibility of having partial signatures.


-- 
Best regards,
Michał Górny

Attachment: signature.asc
Description: This is a digitally signed message part

Reply via email to