Hi, I have been looking into the aliasing problems in dpkg on behalf of Freexian's Debian funding. To that end I proposed a possible way forward last year (https://lists.debian.org/debian-dpkg/2022/11/msg00007.html), but the feedback I got was not particularly helpful in determining consensus. A little later, Simon Richter also looked into the problem (https://lists.debian.org/debian-dpkg/2022/12/msg00023.html), but remained silent after the initial post. Little happened since then. Now Raphael Hertzog proposed to use the DEP process to get this thing unstuck and with the help of Emilio Pozuelo Monfort I created a draft for discussion. I allocate number 17 via debian-project@l.d.o. What follows is the draft text. Please consider it to be a piece of best intentions at reconciling feedback wherever I could. At the time of this writing it certainly is not consensus, but consensus is what I seek here. Without further ado, the full DEP text follows after my name while it also is available at https://salsa.debian.org/dep-team/deps/-/merge_requests/5
Helmut Introduction ============ At its core, `dpkg` assumes that every filename uniquely refers to a file on disk. The situation where two distinct filenames refer to the same file on disk is referred to as aliasing. Violating this assumption leads to undefined behaviour such as file loss. The assumption is commonly violated when a leading directory component contains a symbolic link. A situation where this is known to cause file loss happens when a file is moved from one binary package to another binary package while at the same time changing the filename in a way that retains its final location. In this situation, `dpkg` may first unpack the new replacing location and then remove the replaced package thus unknowingly remove the aliased file. Other components such as `dpkg-divert` or `update-alternatives` are likely affected in similar ways. The purpose of this DEP is selecting and implementing a change to `dpkg` to improve the way it handles such situations that affect typical Debian installations. Naive solution ============== In theory, `dpkg` could resolve this automatically. For every file it touches, it could canonicalize the location using the actual filesystem and check whether any other installed file has the same canonicalized location. Unfortunately, `dpkg` cannot know which filenames can collide, so it would check every filename in its database. For canonicalization, it would `stat()` every component of every filename. This easily amounts to a million or more `stat()` calls on larger installations. Caching could reduce the impact somewhat, but since Debian introduces aliases during maintainer scripts, it would have to invalidate the cache after maintainer scripts have been run. The resulting performance would be unacceptable. Proposal ======== In order to handle aliasing efficiently, `dpkg` gains new options `--add-alias <symlink>`, `--remove-alias <symlink>` and `--list-aliases`. When creating symbolic links that cause aliasing effects, the creating entity is supposed to inform `dpkg` using an appropriate invocation. Doing so records the aliasing information in a new mapping inside its administrative directory. No existing administrative files are modified as a result of this operation. When `dpkg` operates on paths, it can compute a canonicalized version using a pure function without the need to `stat()` files on disk thus greatly improving performance. Canonicalized paths are only needed when determining whether a file conflict exists. In all other cases, original paths continue to be used as symbolic links will be followed by filesystem operations. The `--add-alias` operation records the target of the symbolic link that must exist prior to invocation. The `--remove-alias` operation fails if any files are still installed in the aliased location. Rejected proposals ================== Hardcoding aliases into dpkg ---------------------------- It was suggested to include a static aliasing mapping into the `dpkg` source code. Since `dpkg` is used by multiple projects in different ways (not necessarily Debian-derivatives), this approach would break other consumers. Also note that Debian's `dpkg` can be used to operate on an installation using different aliases via the `--root` flag. As such the alias mapping needs to be a property of the installation. Modifying package lists in place -------------------------------- `dpkg` could rewrite the extracted `.list` files from `control.tar` and store paths in canonicalized form. Canonicalization would happen as when a `control.tar` is extracted. It would also happen either as a one-time conversion during the upgrade of `dpkg` or whenever a `.list` file is read. Given canonicalized list files, string comparison on files would support conflict detection. Other pieces to be updated in a similar way include `alternatives`, `diversions`, `statoverride`, and `triggers`. This would affect the output of `dpkg -S`, which would then output canonicalized paths. Packages generated by `dpkg-repack` would have their contents canonicalized as well. Managing the aliasing mapping using a control file -------------------------------------------------- It was suggested that the mapping could be managed via a special control file `canonical`. Given that aliasing is not a common operation, the benefit of handling it declaratively is minor. Beyond that, aliasing can also happen as an customization issued by an administrator. Therefore, a command line based approach is preferred. Having dpkg move files and create symbolic links ------------------------------------------------ When instructed with `--add-alias`, `dpkg` could also create the corresponding symbolic links and move the affected files to their new location. While that would be convenient, doing so is non-trivial in an atomic way. Sometimes, the underlying filesystem does not fully conform to POSIX (e.g. `overlayfs`) and such corner cases need to be managed individually. Since such an implementation already exists outside `dpkg` and its complexity is non-trivial, the moving of files shall remain external. In case aliases are setup in a bootstrap setting, no moves are necessary. Implement aliasing after metadata tracking ------------------------------------------ The [metadata tracking](https://wiki.debian.org/Teams/Dpkg/Spec/MetadataTracking) feature enhances `dpkg` with knowledge about filesystem metadata for installed files. This includes knowledge of symbolic links, which would help with tracking aliasing. Unfortunately, progress on this is fairly slow and we think that aliasing support is more urgent. Proposal internals ================== A new file `aliases` is added to the administrative directory. Pairs of lines containing link name and destination indicate an alias. Within this file, no link name or destination may contain another link name. The `--add-alias` and `--remove-alias` options change this file only and must ensure that the properties are retained. This leads to a trivial algorithm for canonicalizing paths. A given path can be scanned for recorded link names as sub path and have them replaced with the recorded destination. This process is repeated until a scan passes without performing a substitution. Usually, two scan passes will be sufficient. Much of the internal work has been prototyped by [Simon Richter](https://salsa.debian.org/sjr/dpkg/-/tree/wip-canonical-paths) and can be used. It demonstrates how the `fsys_namenode` can be augmented with a canonicalized path and how `fsys_hash_find_node` can be extended with a new flag to differentiate between lookups considering aliases or exact names. It differs from what is proposed here in the API to configure aliases and in possibly storing partially canonicalized versions of file names.