This is an interesting proposal and discussion. I assume that the people who worked on it looked into various different possibilities for its implementation and decide on the current one, but I have a few questions:
- Since there are people concerned about the increased size of the binary, and since none of the fields are mandatory, would it be beneficial to use a package URL (PURL[1]) instead? That way, a few bytes can be saved (a few values are included in the same key). E.g. { "type":"rpm", "os":"fedora", "osVersion":"33", "name":"coreutils", "version":"4711.0815.fc13", "architecture":"arm32", "osCpe": "cpe:/o:fedoraproject:fedora:33", "debugInfoUrl": "https://debuginfod.fedoraproject.org/"} would become: { "purl":"pkg:rpm/fedora/coreutils@4711.0815.fc13?arch=arm32&distro=fedora-13", "osCpe": "cpe:/o:fedoraproject:fedora:33", "debugInfoUrl": "https://debuginfod.fedoraproject.org/"} - There are a few existing formats for software identification and SBOMs: - SPDX[2], used in the example spec in the proposal - SWID tags[3][4] - OWASP CycloneDX[5] - CPE[6], used in the example JSON above - PURL[1] - probably more :D a few of those are too verbose to be considered, but in particular CycloneDX can be expressed in JSON and supports PURL - What level of trustworthiness would the generated JSON have? Is it relevant or in scope? Some of the existing formats support signatures, e.g. CycloneDX supports JSF[7]. Thanks! [1]: https://github.com/package-url/purl-spec [2]: https://spdx.github.io/spdx-spec/ [3]: http://standards.iso.org/iso/19770/-2/2015-current/schema.xsd [4]: https://csrc.nist.gov/schema/swid/2015-extensions/swid-2015-extensions-1.0.xsd [5]: https://cyclonedx.org/docs/1.3/json/#components [6]: https://csrc.nist.gov/projects/security-content-automation-protocol/specifications/cpe [7]: https://cyberphone.github.io/doc/security/jsf.html On Mon, Oct 25, 2021 at 9:09 PM Ben Cotton <bcot...@redhat.com> wrote: > https://fedoraproject.org/wiki/Changes/Package_information_on_ELF_objects > > == Summary == > All binaries (executables and shared libraries) are annotated with an > ELF note that identifies the rpm for which this file was built. This > allows binaries to be identified when they are distributed without any > of the rpm metadata. `systemd-coredump` uses this to log package > versions when reporting crashes. > > == Owner == > * Name: [[User:Zbyszek|Zbigniew Jędrzejewski-Szmek]] > * Email: zbys...@in.waw.pl > * Name: Lennart Poettering > * Email: mzsrq...@0pointer.net > > > == Detailed Description == > People mix binaries (programs and libraries) from different > distributions (for example using Fedora containers on Debian or vice > versa), and distribute binaries without packaging metadata (for > example by stripping everything except the binary from a container > image, also removing `/usr/lib/.build-id/*`), compile their own rpm > packages (for internal distribution and installation), and compile and > distribute their own binaries. Sometimes we need to introspect a > binary and figure out its provenance, for example when a program > crashes and we are looking at a core dump, but also when we have a > binary without the packaging metadata. When the need to introspect a > binary arises, we have some very good mechanisms to show the > provenance: when a file is installed through the package manager we > can directly list the providing package, but even without this we can > use build-ids embedded in the binary to uniquely identify the > originating build. But those mechanisms work best when we're in the > realm of a single distribution. In particular, build-ids can be easily > tied to a source rpm, but only when we have the source rpm is part of > the distribution and the build-id was registered in the appropriate > database which maps build-ids to real package names. When we move > outside of the realm of a single distribution, it can be hard to > figure out where a given binary originates from. If we know that a > binary is from a given distribution, we may be able to use some > distro-specific mechanism to figure out this information. But those > mechanisms will be different for different distributions and will > often require network access. With this change we aim to provide a > mechanism that is is very simple, provides a "human-readable" origin > information without further processing, is portable across distros, > and works without network access. > > The directly motivating use case is display of core dumps. Right now > we have build-ids, but those are just opaque hexadecimal numbers that > are not meaningful to users. We would like to immediately list > versions of packages involved in the crash (including both the program > and any libraries it links to). It is not enough to query the rpm > database to do the equivalent of `rpm -qf …`: very often programs > crash after some packages have been upgraded and the binaries loaded > into memory are not the binaries that are currently present on disk, > or when through some mishap, the binaries on disk do not match the > installed rpms. A mechanism that works without rpm database lookup or > network access allows this information to be showed immediately in > `coredumpctl` listings and journal entries about the crash. This > includes crashes that happen in the initrd and sandboxed containers. > > A second motivating use case is when users distribute their own > binaries and would like to collect crash information. Build-ids are a > solution that is technically possible, but easy to get wrong in > practice: users would need to immediately record the build-id after > the build and store the mapping to program names, versions, and build > number in some database. It's much easier to be able to record > something during the build in the build product itself. > > A third motivating use case is the general mixing of Fedora binaries > with programs and libraries from different distributions, both with > our binaries being used as the base for foreign binaries, and the > other way around. Whilst most distributions provide some mechanism to > figure out the source build information, those mechanisms vary by > distribution and may not be easy to access from a "foreign" system. > Such mixing is expected with containers, flatpaks, snaps, Python > binary wheels, anaconda packages, and quite often when somebody > compiles a binary and puts it up on the web for other people to > download. > > We propose a new mechanism which is designed to be very simple but > extensible: a small JSON document is embedded in an section in the ELF > binary. This document can be easily read by a human if necessary, but > it is also well-defined and can be processed programatically. For > example, `systemd-coredump` will immediately make use of this to > display package ''nevra'' information for crashes. The format is also > easy to generate, so it can be added to any build system, either using > the helpers that we provide or even reimplemented from scratch. > > For the case where we mix binaries from different distros (the third > motivating use case above), this approach is the most useful when this > system is used by all distros and even non-distro builds. The more > widely it is used, the more useful it becomes. The specification was > developed in collaboration with Debian developers, and we hope that > Fedora and Debian will lead the way for this to become as widely used > as build-ids. But even if the information is only available from some > distros, it is still useful, except that fallback mechanisms need to > be implemented. > > === Existing system: `.note.gnu.build-id` === > > We already have build-ids: every ELF object has a `.note.gnu.build-id` > note, and given a core file, we can read the build-id and look it up > in the rpm database (`dnf repoquery --whatprovides debuginfo(build-id) > = …`) to map it to a package name. > Build-ids are unique and compact and very generic and work as expected > in general. But they have some downsides: > * build-ids are not very informative for users. Before the build-id is > converted back to the appropriate package, it's completely opaque. > * build-ids require a working rpm database or an internet connection > to map to the package name. > > Three important cases: > * minimal containers: the rpm database is not installed in the > containers. The information about build-ids needs to be stored > externally, so package name information is not available immediately, > but only after offline processing. The new note doesn't depend on the > rpm db in any way. > * handling of a core from a container, where the container and host > have different distros > * self-built and external packages: unless a lot of care is taken to > keep access to the debuginfo packages, this information may be lost. > The new note is available even if the repository metadata gets lost. > Users can easily provide equivalent information in a format that makes > sense in their own environment. It should work even when rpms and debs > and other formats are mixed, e.g. during container image creation. > > === New system: `.note.package` === > > The new note is created and propagated similarly to > `.note.gnu.build-id`. The difference is that we inject the information > about package ''nevra'' from the build system. > > The implementation is very simple: `%{build_ldflags}` are extended > with a command to insert a custom note as a separate section in an ELF > object. See [https://github.com/systemd/package-notes/blob/main/hello.spec > hello.spec] for an example. This is done in the default macros, so all > packages that use the prescribed link flags will be affected. > > The note is a compact json string. This allows the format to be > trivially extensible (new fields can be added at will), easy to > process (json is extremely popular and parsers are widely available). > Using a single field rather than a set of separated notes is more > space-efficient. With multiple fields the padding and alignment > requirements cause unnecessary overhead. > > The system was designed with cross-distro collaboration and is > flexible enough to identify binaries from different packaging formats > and build systems (rpms, debs, custom binaries). > > See https://systemd.io/COREDUMP_PACKAGE_METADATA/ for detailed > description of the format. > > One of the advantages of using an ELF note, as opposed to say a series > of extended attributes on the binary itself, is that the ELF note gets > automatically captured and copied into a core file by the kernel. > Extended attributes would have to be copied manually, which might not > even be possible because the binary on disk may have been removed by > the time the crash is analyzed. > > The overhead is about 200 bytes for each ELF object. > We have about overall 33200 files in `/usr/s?bin/` and about 36600 > `.so` files (F35, single architecture, > results from `dnf repoquery -l 2>/dev/null | rg '^/usr/s?bin/' | sort > -u | wc -l`, > `dnf repoquery -l 2>/dev/null | rg '^/usr/lib64/.*\.so$' |sort -u|wc -l`). > If we do this for the whole distro, we get 69800 × 200 = 13 MB. > For a typical installation, we can expect about 300–400 kB. > Thus the overhead of additionally used space is neglible (also see the > Feedback section for more discussion). > > Precise measurements TBD once this is turned on and we have real > measurements for a larger number of builds. > > === Examples === > <pre> > $ objdump -s -j .note.package build/libhello.so > > build/libhello.so: file format elf64-x86-64 > > Contents of section .note.package: > 02ec 04000000 63000000 7e1afeca 46444f00 ....c...~...FDO. > 02fc 7b227479 7065223a 2272706d 222c226e {"type":"rpm","n > 030c 616d6522 3a226865 6c6c6f22 2c227665 ame":"hello","ve > 031c 7273696f 6e223a22 302d312e 66633335 rsion":"0-1.fc35 > 032c 2e783836 5f363422 2c226f73 43706522 .x86_64","osCpe" > 033c 3a226370 653a2f6f 3a666564 6f726170 :"cpe:/o:fedorap > 034c 726f6a65 63743a66 65646f72 613a3333 roject:fedora:33 > 035c 227d0000 "}.. > </pre> > > <pre> > $ readelf --notes build/hello | grep "description data" | sed -e > "s/\s*description data: //g" -e "s/ //g" | xxd -p -r | jq > readelf: build/hello: Warning: Gap in build notes detected from 0x1091 to > 0x10de > readelf: build/hello: Warning: Gap in build notes detected from 0x1091 to > 0x10af > readelf: build/hello: Warning: Gap in build notes detected from 0x1091 to > 0x119f > { > "type": "rpm", > "name": "hello", > "version": "0-1.fc35.x86_64", > "osCpe": "cpe:/o:fedoraproject:fedora:33" > } > </pre> > > <pre> > $ coredumpctl info > PID: 44522 (fsverity) > ... > Package: fsverity-utils/1.3-1 > build-id: ac89bf7175b04d7eec7f6544a923f45be111f0be > Message: Process 44522 (fsverity) of user 1000 dumped core. > > Found module > /home/bluca/git/fsverity-utils/libfsverity.so.0 with build-id: > fa40fdfb79aea84167c98ca8a89add9ac4f51069 > Metadata for module > /home/bluca/git/fsverity-utils/libfsverity.so.0 owned by FDO found: { > "packageType" : "deb", > "package" : "fsverity-utils", > "packageVersion" : "1.3-1" > } > > Found module linux-vdso.so.1 with build-id: > aba08e06103f725e26f1d7c178fb6b76a564a35d > Found module libpthread.so.0 with build-id: > e91114987a0147bd050addbd591eb8994b29f4b3 > Found module libdl.so.2 with build-id: > d3583c742dd47aaa860c5ae0c0c5bdbcd2d54f61 > Found module ld-linux-x86-64.so.2 with build-id: > f25dfd7b95be4ba386fd71080accae8c0732b711 > Found module libcrypto.so.1.1 with build-id: > 749142d5ee728a76e7cdc61fd79d2311a77405a2 > Found module libc.so.6 with build-id: > 18b9a9a8c523e5cfe5b5d946d605d09242f09798 > Found module fsverity with build-id: > ac89bf7175b04d7eec7f6544a923f45be111f0be > Metadata for module fsverity owned by FDO found: { > "packageType" : "deb", > "package" : "fsverity-utils", > "packageVersion" : "1.3-1" > } > > Stack trace of thread 44522: > #0 0x00007fe7c8af26f4 __GI___nanosleep (libc.so.6 + > 0xc66f4) > #1 0x00007fe7c8af262a __sleep (libc.so.6 + 0xc662a) > #2 0x00005608481407dd main (fsverity + 0x27dd) > #3 0x00007fe7c8a5009b __libc_start_main (libc.so.6 + > 0x2409b) > #4 0x000056084814094a _start (fsverity + 0x294a) > </pre> > > == Feedback == > See [https://github.com/systemd/systemd/issues/18433 systemd issue > #18433] for upstream discussion and implementation proposals. > > === Concerns about additional changes to files === > > <pre> > 17:32:30 <Eighth_Doctor> I think zbyszek underestimates how much of a > problem it is to stamp every ELF binary with ''nevra'' data > 17:32:44 <mhroncok> zbyszek: so, assuming python has ~100 ELF .so > files and I change one text file > 17:33:22 <mhroncok> (ignore for the time being that the .so files > often changed because of toolchain updates and assume they are stable) > </pre> > > I tested this with python3.10. So far there are 13 builds of that > package in F35: > `python3.10-3.10.0-1.fc35`, > `python3.10-3.10.0~a6-1.fc35`, > `python3.10-3.10.0~a6-2.fc35`, > `python3.10-3.10.0~a7-1.fc35`, > `python3.10-3.10.0~b1-1.fc35`, > `python3.10-3.10.0~b2-2.fc35`, > `python3.10-3.10.0~b2-3.fc35`, > `python3.10-3.10.0~b3-1.fc35`, > `python3.10-3.10.0~b4-1.fc35`, > `python3.10-3.10.0~b4-2.fc35`, > `python3.10-3.10.0~b4-3.fc35`, > `python3.10-3.10.0~rc1-1.fc35`, > `python3.10-3.10.0~rc2-1.fc35`. > I extracted the builds (for `.x86_64`) and made a list of all `.so` > files (1368 files), and calculated sha256 hashes for them. No two > files repeat, there are 1368 distinct hashes. So the files are > '''already''' different between builds and the additional proposed > metadata does will not make a significant difference. > > Note that this range of Python versions encompasses periods when the > package is under development and undergoes significant changes (alpha > versions), and when it's only undergoing small changes (rc versions). > > The fact that we get different files in each build is not surprising, > because files embed build-ids which differ between builds. But even if > we ignore those, binaries generally differ between builds. Even sizes > tend to vary between builds: there are 636 distinct `.so` file sizes, > i.e. on average any given size only repeats twice (presumably most > often for the same file). Running `diffoscope` on `.so` files from > different builds shows minor changes in the assembly which I did not > analyze futher. > > If people have specific questions, for example about overhead in some > scenario, I'd be happy to answer them. Until now, the issues that were > raised were very vague, so it's impossible to answer them. > > === Why not just use the rpm database? === > > <pre> > 17:34:33 <dcantrell> The main reason for this appears to be that we > need the RPM db locally to resolve build-ids to package names. But > since containers wipe /var/lib/rpm, we can't do that. So the solution > is to put the ''nevra'' in ELF metadata? > 17:34:39 <dcantrell> That feels like the wrong approach. > </pre> > > First, there are legitimate reasons to strip packaging metadata from > images. For example, for an initrd image from rpms, I get 117 MB of > files (without compression), and out of this `/var/lib/rpm` is 5.9 MB, > and `/var/lib/dnf` is 4.2 MB. This is an overhead of 9%. This is ''not > much'', but still too much to keep in the image unless necessary. > Similar ratios will happen for containers of similar size. Reducing > image size by one tenth is important. There is no `rpm` or `dnf` in > the image, to the package database is not even usable without external > tools. > > As discussed on IRC > ( > https://meetbot.fedoraproject.org/teams/fesco/fesco.2021-05-11-17.01.log.html > ), > the containers ''we'' build don't wipe this metadata, but custom > Dockerfiles do that. > > Second, as described in Description section above, not everybody and > everything uses rpm. The Fedora motto is "we make an operating system > and we make it easy for you to do useful stuff with it" (and yes, this > is an actual quote from the official docs), and this stuff involves > reusing our binaries in containers and custom installations and > whatnot, not just straightforward installations with `dnf`. And in the > other direction, people will build their own binaries that are not > packaged as rpms. But it is still important to be able to figure out > the exact version of a binary, especially after it crashes. > > === Why do this in Fedora? === > > <pre> > 17:36:49 <mhroncok> I don't understand how non-rpm distros and custom > built binaries are affected by our rpm-build environment :/ > </pre> > > The idea is that we inject this into our build system, and Debian > injects this into their build system, and so on… As mentioned, this is > a cross-distro effort. Also, people can use it in their custom build > systems if they build and distribute binaries internally. The scheme > would obviously be most useful if used comprehensively, but it's still > useful when available partially. We hope that Fedora can lead the way. > (This is similar to build-ids: when initially adopted, they were used > only by some distros, but were useful even then. Nowadays, with > comprehensive adoption, they are even more useful.) > > https://hpc.guix.info/blog/2021/09/whats-in-a-package/ contains a nice > description of a pathological case of packaging hacks and binary > redistribution. When trying to unravel something like this, > information embedded directly in the binaries would be quite useful. > > > == Benefit to Fedora == > A simple and reliable way to gather information about package versions > of programs is added. > It enhances, instead of replacing, the existing mechanisms. > It is particularly useful when reporting crash dumps, but can also be > used for image introspection and forensincs, license checks and > version scans on containers, etc. > > If we adopt this in Fedora, Fedora leads the way on implementing the > standard. Fedora binaries used in any context can be easily > recognized. Fedora binaries provide a better basis to build things. > > If other distros adopt this, we can introspect and report on those > binaries easily within the Fedora context. For example, when somebody > is using a container with some programs that originate in the Debian > ecosystem, we would be able to identify those programs without tools > like `apt` or `dpkg-query`. Core dump analaysis executed in the Fedora > host can easily provide useful information about programs from foreign > builds. > > == Implementation in Other Distributions == > === Microsoft CBL-Mariner === > [https://en.wikipedia.org/wiki/CBL-Mariner CBL-Mariner] is an > [https://github.com/microsoft/CBL-Mariner open source] Linux > distribution created by Microsoft, targeted at first-party and > container workloads on Azure. It is used both as a container runner > host and a base container image. > Mariner adopted the ELF stamping packaging metadata spec in > [ > https://github.com/microsoft/CBL-Mariner/blob/1.0/SPECS/mariner-rpm-macros/gen-ld-script.sh > version 1.0], initially to add OS metadata, and package-level metadata > will be added in a following release. > === Debian === > A package-level proof-of-concept is included in the > [https://github.com/systemd/package-notes/blob/main/dh_package_notes > package-notes > <https://github.com/systemd/package-notes/blob/main/dh_package_notespackage-notes>] > repository. > A [https://salsa.debian.org/bluca/debhelper/-/tree/notes_metadata > system-level > <https://salsa.debian.org/bluca/debhelper/-/tree/notes_metadatasystem-level> > proof-of-concept] that enables ELF stamping by default in > all builds implicitly will be proposed for adoption in the future. > > == Scope == > * Proposal owners: > ** create a specification (First version DONE: > [https://systemd.io/COREDUMP_PACKAGE_METADATA > COREDUMP_PACKAGE_METADATA]. We might need to make some adjustments > based on the deployment in Fedora, but no big changes are expected.) > ** write a script to generate the package note (First version DONE: > [ > https://github.com/systemd/package-notes/blob/main/generate-package-notes.py > generate-package-notes.py]) > ** provide a patch for `redhat-rpm-config` to insert appropriate > compilation options > ** extend systemd's coredumpctl to extract and display this > information (DONE: [https://github.com/systemd/systemd/pull/19135 PR > #19135], available in systemd-249) > ** submit pull request to Packaging Guidelines > > * Other developers: > ** possibly add support in abrt? > > * Release engineering: There should be no impact. > > * Policies and guidelines: > The new flags should be mentioned in Packaging Guidelines. > > * Trademark approval: N/A (not needed for this Change) > N/A > > * Alignment with Objectives: > It might be relevant for Minimization. Even though it increases the > image size a tiny bit, it makes minimized images work a bit better. > > == Upgrade/compatibility impact == > No impact. > > == How To Test == > <pre> > $ bash -c 'kill -SEGV $$' > $ coredumpctl > TIME PID UID GID SIG COREFILE EXE > SIZE PACKAGE > > Mon 2021-03-01 14:37:22 CET 855151 1000 1000 SIGSEGV present > /usr/bin/bash 51.7K bash-5.1.0-2.fc34.x86_64 > </pre> > > == User Experience == > `coredumpctl` should display information about package versions. > > `readelf --notes` or similar tools can be used on `.so` files and > compiled programs > to extract the JSON blurb that describes the originating package. > > == Dependencies == > None. > > == Contingency Plan == > > * Contingency mechanism: Remove the new compilation flags. Rebuild any > packages that were build with the new flags. > * Contingency deadline: Beta freeze. > * Blocks release? No. > > == Documentation == > * https://systemd.io/COREDUMP_PACKAGE_METADATA/ > * https://github.com/systemd/package-notes > > See also [[Changes/DebuginfodByDefault]]. > > > > -- > Ben Cotton > He / Him / His > Fedora Program Manager > Red Hat > TZ=America/Indiana/Indianapolis > _______________________________________________ > devel mailing list -- devel@lists.fedoraproject.org > To unsubscribe send an email to devel-le...@lists.fedoraproject.org > Fedora Code of Conduct: > https://docs.fedoraproject.org/en-US/project/code-of-conduct/ > List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines > List Archives: > https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org > Do not reply to spam on the list, report it: > https://pagure.io/fedora-infrastructure >
_______________________________________________ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-le...@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure