Re: Arch Linux minimal container userland 100% reproducible - now what?

2024-04-04 Thread David A. Wheeler via rb-general



> On Apr 2, 2024, at 1:11 PM, John Gilmore  wrote:
> 
> For me, the distinction is that the local storage is under the direct
> control of the person trying to rebuild, while the network and the
> servers elsewhere in the network are not.  If local storage is
> unreliable, you can fix or replace it, and continue with your work.

There are obviously many advantages to local storage.

However, if you locally record cryptographic hashes, and re-download the
bits for (say) a compiler, you could still reproduce the results
*if* the information is still available where you're downloading it from
(or can find an alternative source). The key is that "if" condition.

The risk of not having local copies is the risk of loss of availability.
However, many sites are fairly reliable. I'd hate to tell someone they
can't verify reproducible builds just because they don't (currently)
have a local copy of everything. Indeed, you want multiple verifications
of reproducible builds, and they'll have to get their data from somewhere.

It's sometimes much easier to send the source including build instructions,
information on how to download the rest, and the cryptographic hashes for
what is not bundled.

--- David A. Wheeler



Re: Arch Linux minimal container userland 100% reproducible - now what?

2024-04-03 Thread Richard Purdie
On Tue, 2024-04-02 at 10:11 -0700, John Gilmore wrote:
> James Addison wrote that local storage can contain errors.  I agree.
> 
> > My guess is that we could get into near-unsolvable philosophical territory
> > along this path, but I think it's worth being skeptical of the notions that
> > local-storage is always trustworthy and that the network should always be
> > avoided.
> 
> For me, the distinction is that the local storage is under the direct
> control of the person trying to rebuild, while the network and the
> servers elsewhere in the network are not.  If local storage is
> unreliable, you can fix or replace it, and continue with your work.
> 
> I am looking for reproducibility that is completely doable by the person
> trying to do it, at any time after when they obtain a limited number of
> key items by any means: the bootable binary of the OS release, and what
> the GPL calls the "Corresponding Source".
> 
> And, I am very happy to be seeing lots of incremental progress along the way!

FWIW Yocto Project/OpenEmbedded is able to do something like this.

The builds are "cross" and sufficiently isolated from the host that the
host OS doesn't influence the output. By that I mean we build a cross
compiler and then use the cross compiler to build the target. 

Whilst the intermediate cross compiler may differ bitwise depending on
the host compiler, the generated target output should always be the
same. I say "should" as there can be theoretical contamination sources
but we test this on our infrastructure with diverse hosts (Debian,
Ubuntu, Fedora, Alma, Rocky and OpenSUSE systems of differing versions)
and check we always get the same output. This is what our reproducible
claim is measuring, that this output doesn't differ between those
systems.

The build system doesn't allow network access outside the initial
"fetch" step and it verifies some form of checksum of every external
source input.

The inputs can be fetched from their upstream location, or from a
mirror. The project maintains a mirror but users can also have a local
one of their own. Since the inputs are checksum verified, it doesn't
really matter where.

So the things needed to build a given output are:

* the metadata (build instructions)
* the build system itself
* sources or a sources mirror (which is verified against the metadata)
* some kind of host to run the build

For the host to run the build, it can be an off the shelf
ubuntu/debian/fedora/whatever or it can also be one of our own output
images, leading to effective self hosting.

Each of the above things are things which someone can easily archive
and restore without significant issue or knowledge.

This can therefore all be done by anyone, meaning someone building a
product using embedded linux (our target users) can rebuild their
output incorporating any security fixes needed for example, years from
now.

I'd note this isn't theoretical, there are companies doing this today
using the self hosting images so there isn't a dependency on any other
distro either.

Cheers,

Richard



Re: Arch Linux minimal container userland 100% reproducible - now what?

2024-04-02 Thread John Gilmore
James Addison wrote that local storage can contain errors.  I agree.

> My guess is that we could get into near-unsolvable philosophical territory
> along this path, but I think it's worth being skeptical of the notions that
> local-storage is always trustworthy and that the network should always be
> avoided.

For me, the distinction is that the local storage is under the direct
control of the person trying to rebuild, while the network and the
servers elsewhere in the network are not.  If local storage is
unreliable, you can fix or replace it, and continue with your work.

I am looking for reproducibility that is completely doable by the person
trying to do it, at any time after when they obtain a limited number of
key items by any means: the bootable binary of the OS release, and what
the GPL calls the "Corresponding Source".

And, I am very happy to be seeing lots of incremental progress along the way!

John

PS: I have a local archive of the source ISO images and the binary ISO
images of many Ubuntu, Fedora, Debian, BSD, etc releases.  It all fits
easily on a single hard disk drive, and that drive has many backups from
different times.  The images all have checksums that were checked when I
obtained the images.  The checksums are in the backups, so I can see if
my copies were tampered with or merely suffered from storage degradation
over time.

And I can easily copy the whole thing and send you a copy, if you want
one; or put it on the Internet (some of the releases are available from
me now via BitTorrent).  If those distros were reproducible, I could
verify that each of those binary releases was untampered.  Or YOU could,
without my help, after you got a copy from me or from anyone.  And if
you suspected a binary Ken Thompson attack, you could use those releases
locally at your site, as the source material for an arbitrarily intense
diverse double-compilation check.  Without my help, and without the help
of anyone else on the Internet.

In short, making a local archive of reproducible binaries and their
corresponding sources, readily enables all the verifications that we are
trying to make common in the world.



Re: Arch Linux minimal container userland 100% reproducible - now what?

2024-04-02 Thread James Addison via rb-general
Hi John,

On Fri, 29 Mar 2024 at 19:29, John Gilmore  wrote:
>
> kpcyrd  wrote:
> > 1) There's currently no way to tell if a package can be built offline
> > (without trying yourself).
>
> Packages that can't be built offline are not reproducible, by
> definition.  They depend on outside events and circumstances
> in order for a third party to reproduce them successfully.
>
> So, fixing that in each package would be a prerequisite to making a
> reproducible Arch distro (in my opinion).

This perspective is valuable because it is certainly true that unreliable
or unexpected responses from a network adapter could cause software builds to
fail, be delayed, or contain errors.

However I fail to see why any of those circumstances would not be
equally possible
in the case of equivalent responses from physically or locally attached I/O
devices.

A storage device could be considered a node on a local network that no other
host is able to communicate with directly; and to my knowledge it's rarely the
case that traffic to-and-from local storage devices is inspected for integrity
by hardware/software outside of the device that it is connected to (which
isn't necessarily the place that it makes sense to run those checks).

My guess is that we could get into near-unsolvable philosophical territory
along this path, but I think it's worth being skeptical of the notions that
local-storage is always trustworthy and that the network should always be
avoided.

Regards,
James


Re: Arch Linux minimal container userland 100% reproducible - now what?

2024-03-29 Thread HW42
John Gilmore:
> kpcyrd  wrote:
>> 1) There's currently no way to tell if a package can be built offline 
>> (without trying yourself).
> 
> Packages that can't be built offline are not reproducible, by
> definition.  They depend on outside events and circumstances
> in order for a third party to reproduce them successfully.
> 
> So, fixing that in each package would be a prerequisite to making a
> reproducible Arch distro (in my opinion).

I don't agree. For example the r-b.o [1] definition doesn't mandate who
needs to archive what. We probably can agree that we mean a "verifiable
path from source to binary code" (and not just repeatability, which is
also sometimes meant by reproducible builds in other contexts), but
beyond that the details and motivations will be different depending on
who you ask.

To be clear I don't say what you like to see is not worthwhile. Actually
I'm very sympathic to such archiving goals. But if Arch Linux, as
kpcyrd's mails suggest, right now just want to verify their builder
output soon-ish after upload that's fine too and can be called
reproducible, in my opinion.

[1]: https://reproducible-builds.org/docs/definition/

> I don't understand why a "source tree" would store a checksum of a
> source tarball or source file, rather than storing the actual source
> tarball or source file.  You can't compile a checksum.

How distros store their source code is different, due to different
needs, historic circumstances, etc.. And the approach of just having the
packaging definition and patches and then referring the "original" source
is common and I certainly see the advantages.

> kpcyrd  wrote:
>> Specifically Gentoo and OpenBSD Ports have solutions for this that I 
>> really like, they store a generated list of URLs along with a 
>> cryptographic checksum in a separate file, which includes crates 
>> referenced in e.g. a project's Cargo.lock.
> 
> I don't know what a crate or a Cargo.lock is,

It's Cargo's (Rust' package/dependency manager) way to pin specific
dependencies, including hashes of those.

> but rather than fix the problem at its source (include the source
> files), you propose to add another complex circumvention alongside the
> existing package building infrastructure?  What is the advantage of
> that over merely doing the "cargo fetch" early rather than late and
> putting all the resulting source files into the Arch source package?

I'm not an Arch developer, but probably because a package source repo
like [2] is much easier for them to handle than if they would commit the
source of all (transitive) dependencies [3].

[2]: https://gitlab.archlinux.org/archlinux/packaging/packages/rage-encryption
[3]: https://github.com/str4d/rage/blob/v0.10.0/Cargo.lock

(Note that Arch made the, for a classic Linux distro currently rather
unusual, decision to build Rust programs with the exact dependencies
upstream has defined and not separately package those libraries.)

>> 3) All of this doesn't take BUILDINFO files into account
> 
> The BUILDINFO files are part of the source distribution needed
> to reproduce the binary distribution.  So they would go on the
> source ISO image.
> 
>> I did some digging and downloaded the buildinfo files for each package 
>> that is present in the archlinux-2024.03.01 iso
> 
> Thank you for doing that digging!
> 
>>   Using plenty of different gcc versions looks 
>> annoying, but is only an issue for bootstrapping, not for reproducible 
>> builds (as long as everything is fully documented).
> 
> I agree that it's annoying.  It compounds the complexity of reproducing
> the build.  Does Arch get some benefit from doing so?
> 
> Ideally, a binary release ISO would be built with a single set of
> compiler tools.  Why is Arch using a dozen compiler versions?  Just to
> avoid rebuilding binary packages once the binary release's engineers
> decide what compiler is going to be this release's gold-standard
> compiler?  (E.g. The one that gets installed when the user runs pacman
> to install gcc.)  Or do the release-engineers never actually standardize
> on a compiler -- perhaps new ones get thrown onto some server whenever
> someone likes, and suddenly all the users who install a compiler just
> start using that one?

If you look at classic Linux distros it's the norm to iteratively add
packages to your repo and build new packages with what is in the
(development) repo at this time. So a single snapshot of a repo will in
nearly all cases not contain all versions to reproduce the packages in
that snapshot.

You will find this in Arch, Debian, Fedora,  Some other distros like
Yocto, might make different decisions, but those are rather the
exception.

> It currently seems that there is no guarantee that on day X, if you
> install gcc on Arch (from the Internet) and on the same day you pull in
> the source code of pacman package Y, that it will even build with the
> Day X version of gcc.  Is that true?

As described above for the 

Re: Arch Linux minimal container userland 100% reproducible - now what?

2024-03-29 Thread John Gilmore
kpcyrd  wrote:
> 1) There's currently no way to tell if a package can be built offline 
> (without trying yourself).

Packages that can't be built offline are not reproducible, by
definition.  They depend on outside events and circumstances
in order for a third party to reproduce them successfully.

So, fixing that in each package would be a prerequisite to making a
reproducible Arch distro (in my opinion).

I don't understand why a "source tree" would store a checksum of a
source tarball or source file, rather than storing the actual source
tarball or source file.  You can't compile a checksum.

kpcyrd  wrote:
> Specifically Gentoo and OpenBSD Ports have solutions for this that I 
> really like, they store a generated list of URLs along with a 
> cryptographic checksum in a separate file, which includes crates 
> referenced in e.g. a project's Cargo.lock.

I don't know what a crate or a Cargo.lock is, but rather than fix the
problem at its source (include the source files), you propose to add
another complex circumvention alongside the existing package building
infrastructure?  What is the advantage of that over merely doing the
"cargo fetch" early rather than late and putting all the resulting
source files into the Arch source package?

> 3) All of this doesn't take BUILDINFO files into account

The BUILDINFO files are part of the source distribution needed
to reproduce the binary distribution.  So they would go on the
source ISO image.

> I did some digging and downloaded the buildinfo files for each package 
> that is present in the archlinux-2024.03.01 iso

Thank you for doing that digging!

>   Using plenty of different gcc versions looks 
> annoying, but is only an issue for bootstrapping, not for reproducible 
> builds (as long as everything is fully documented).

I agree that it's annoying.  It compounds the complexity of reproducing
the build.  Does Arch get some benefit from doing so?

Ideally, a binary release ISO would be built with a single set of
compiler tools.  Why is Arch using a dozen compiler versions?  Just to
avoid rebuilding binary packages once the binary release's engineers
decide what compiler is going to be this release's gold-standard
compiler?  (E.g. The one that gets installed when the user runs pacman
to install gcc.)  Or do the release-engineers never actually standardize
on a compiler -- perhaps new ones get thrown onto some server whenever
someone likes, and suddenly all the users who install a compiler just
start using that one?

It currently seems that there is no guarantee that on day X, if you
install gcc on Arch (from the Internet) and on the same day you pull in
the source code of pacman package Y, that it will even build with the
Day X version of gcc.  Is that true?

John



Re: Arch Linux minimal container userland 100% reproducible - now what?

2024-03-29 Thread kpcyrd

On 3/29/24 6:48 AM, John Gilmore wrote:

John Gilmore  wrote:
Bootstrappable builds are a different thing.  Worthwhile, but not
what I was asking for.  I just wanted provable reproducibility from two
ISO images and nothing more.

I was asking that a bare amd64 be able to boot from an Arch Linux
*binary* ISO image.  And then be fed a matching Arch Linux *source* ISO
image.  And that the scripts in the source image would be able to
reproduce the binary image from its source code, running the binaries
(like the kernel, shell, and compiler) from the binary ISO image to do
the rebuilds (without Internet access).

This should be much simpler than doing a bootstrap from bare metal
*without* a binary ISO image.


I think this project would still be somewhat involved:

1) There's currently no way to tell if a package can be built offline 
(without trying yourself). Some distros have `options=(!net)`-like 
settings, but pacman currently doesn't. Needing network access for 
things like `cargo fetch` or `go mod download` is considered acceptable 
in Arch Linux, since these extra inputs are pinned by cryptographic hash 
(the PKGBUILD acts as a merkle-tree root).


Specifically Gentoo and OpenBSD Ports have solutions for this that I 
really like, they store a generated list of URLs along with a 
cryptographic checksum in a separate file, which includes crates 
referenced in e.g. a project's Cargo.lock. When unpacking them to the 
right location the build itself does not need any additional network 
resources and can run fully offline.


This concept currently does not exist in pacman, one would potentially 
need to generate 100+ lines into the source= array of a PKGBUILD (and 
another 200+ lines for checksums if 2 checksum algorithms are used). 
This is currently considered bad style, because the PKGBUILD is supposed 
to be short, simple and easy to read/understand/audit.


2) The official ISO is meant for installation and maintenance, but does 
not contain a compiler, and I'm not sure it should. Many of the other 
base-devel packages are also missing, but since you also need the build 
dependencies of all the packages you're using (recursively?) this should 
likely be its own ISO (at which point you could also include the source 
code however).


3) All of this doesn't take BUILDINFO files into account, you can use 
Arch Linux as a source-based distro, but if you want exact matches with 
the official packages you would need to match the compiler version that 
was used for each respective package.


I did some digging and downloaded the buildinfo files for each package 
that is present in the archlinux-2024.03.01 iso (using the 
archlinux-userland-fs-cmp tool) and in total these gcc versions have 
been used (gcc7 being part of the usb_modeswitch build environment, but 
I didn't bother investigating why):


gcc7-7.4.1+20181207-3-x86_64
gcc-9.2.0-4-x86_64
gcc-9.3.0-1-x86_64
gcc-10.1.0-1-x86_64
gcc-10.1.0-2-x86_64
gcc-10.2.0-3-x86_64
gcc-10.2.0-4-x86_64
gcc-10.2.0-6-x86_64
gcc-11.1.0-1-x86_64
gcc-11.2.0-4-x86_64
gcc-12.1.0-2-x86_64
gcc-12.2.0-1-x86_64
gcc-12.2.1-1-x86_64
gcc-12.2.1-2-x86_64
gcc-12.2.1-4-x86_64
gcc-13.1.1-1-x86_64
gcc-13.1.1-2-x86_64
gcc-13.2.1-3-x86_64
gcc-13.2.1-4-x86_64
gcc-13.2.1-5-x86_64

And these versions of the Rust compiler:

rust-1:1.74.0-1-x86_64
rust-1:1.75.0-2-x86_64
rust-1:1.76.0-1-x86_64

In total the build environment of all packages consists of 3704 
different (pkgname, pkgver) tuples.


If you disregard this, the packages you build with such an ISO wouldn't 
match the official packages, but 2 groups with the same ISO could likely 
produce matching binary packages (assuming they have a way to derive a 
deterministic SOURCE_DATE_EPOCH value from that ISO).


From there on you'd "only" need to bootstrap a path to these binary 
seeds, but that's also why I pointed out this is more relevant to 
bootstrappable builds. Using plenty of different gcc versions looks 
annoying, but is only an issue for bootstrapping, not for reproducible 
builds (as long as everything is fully documented).



If someday an Electromagnetic Pulse weapon destroys all the running
computers, we'd like to bootstrap the whole industry up again, without
breadboarding 8-bit micros and manually toggling in programs.  Instead,
a chip foundry can take these two ISOs and a bare laptop out of a locked
fire-safe, reboot the (Arch Linux) world from them, and then use that
Linux machine to control the chip-making and chip-testing machines that
can make more high-function chips.  (This would depend on the
chip-makers keeping good offline fireproof backups of their own
application software -- but even if they had that, they can't reboot and
maintain the chip foundry without working source code for their
controller's OS.)


I'm personally not interested in this scenario, I'm aware Allan McRae is 
looking for funding for pacman development. Maybe somebody could sponsor 
development of a "build without network" feature in pacman, or 

Re: Arch Linux minimal container userland 100% reproducible - now what?

2024-03-28 Thread John Gilmore
John Gilmore  wrote:
> It seems to me that the next step in making the Arch release ISOs
> reproducible is to have the Arch release engineering team create a
> source-code release ISO that matches each binary release ISO.  Then you
> (or anyone) could test the reproducibility of the release by having
> merely those two ISO images and a bare amd64 computer (without even an
> Internet connection).

kpcyrd  wrote:
> I think this falls under "bootstrappable builds", a bare amd64 computer 
> still needs something to boot into (a CD with only source code won't do 
> the trick).

Bootstrappable builds are a different thing.  Worthwhile, but not
what I was asking for.  I just wanted provable reproducibility from two
ISO images and nothing more.

I was asking that a bare amd64 be able to boot from an Arch Linux
*binary* ISO image.  And then be fed a matching Arch Linux *source* ISO
image.  And that the scripts in the source image would be able to
reproduce the binary image from its source code, running the binaries
(like the kernel, shell, and compiler) from the binary ISO image to do
the rebuilds (without Internet access).

This should be much simpler than doing a bootstrap from bare metal
*without* a binary ISO image.

And if your source/binary ISO images can do that, it's not just an
academic exercise in reproducibility.  It can also produce a new binary
ISO that is built from that source ISO plus a few patches (e.g. for
fixing security issues).  Or, it can "recompile-the-world" after you (or
any user) makes a small change to a kernel, include file, library, or
compiler -- and show exactly how many programs compile to something
*different* as a result.  Basically, that pair of ISOs becomes a seed
that can carry forward, or fork, the whole distribution.  For anybody
who receives them.  That is the promise of free software, but the
complexity of modern distros plus the convenience of ubiquitous
Internet have inadvertently tended to undermine that promise.  Until
the reproducible builds effort!

If someday an Electromagnetic Pulse weapon destroys all the running
computers, we'd like to bootstrap the whole industry up again, without
breadboarding 8-bit micros and manually toggling in programs.  Instead,
a chip foundry can take these two ISOs and a bare laptop out of a locked
fire-safe, reboot the (Arch Linux) world from them, and then use that
Linux machine to control the chip-making and chip-testing machines that
can make more high-function chips.  (This would depend on the
chip-makers keeping good offline fireproof backups of their own
application software -- but even if they had that, they can't reboot and
maintain the chip foundry without working source code for their
controller's OS.)

John



Re: Arch Linux minimal container userland 100% reproducible - now what?

2024-03-28 Thread kpcyrd

On 3/26/24 5:03 PM, Michael Schierl via rb-general wrote:

So we can expect many year/month pairs embedded in manpages that got
unnoticed since mostly the build happens in the same month? Or have they
been manually vetted?


The results on reproducible.archlinux.org don't aim to guarantee the 
absence of reproducible builds issues, they instead aim to confirm the 
binary can be built from the given source code and build instructions 
(which is, at least for me, why I'm working on reproducible builds, 
since this means we can take the source code at face value for what's in 
the binaries).


Embedded timestamps are considered bad because they are usually a 
show-stopper for this (and timestamps with second/minute precision still 
are for us). There's a different kind of system that tries to prove the 
absence of reproducible builds issues - I've referred to this as "build 
environment fuzzing" in the past and it's the kind of thing 
tests.reproducible-builds.org does.


These results also still exist for Arch Linux[1] (since 2017), and if 
you're concerned about this you could check over there, but since Arch 
Linux _integrates_ with other eco-systems (instead of re-implementing 
them like Debian tries to), some builds fail to build if the clock is 
too far off, since https certificates would be considered expired. 
There's a lot of `curl -k` going on to work around this, but e.g. cargo 
has no option to "turn off all security", so these packages simply won't 
build on there.


[1]: https://tests.reproducible-builds.org/archlinux/

In late 2019 it turned out to be easier to "do the real thing" instead 
of trying to find more workarounds, and "not having enough 
true-positives" isn't really a problem we're having at the moment. If 
you find a false-negative please shout.


If anybody is bothered by the claims Arch Linux is making they're very 
welcome to run a rebuilder with a clock that is off by 48h (this would 
be interesting to have, but still wouldn't guarantee the absence of 
other reproducible builds issues, like missing Cargo.lock files).



Apart from Guix pushing bootstrappable builds for quite some time,
recent builds of Freedesktop SDK (container userland mostly used for
flatpaks) are fully bootstrapped from stage0 - except for Rust which is
not boostrapped via mrustc but built using the binary package from 
upstream.


Is there any public website I could look at for results? According to 
our tests, having reproducible distro tooling isn't enough because 
there's still plenty of opensource software doing silly things in their 
build processes.



Assuming I wanted to bootstrap some (non-reproducible) Arch setup from
Freedesktop SDK and then use it to verify the reproducible builds, what
steps would I have to take?


If you want to bootstrap the 114 packages that are present in 
docker.io/library/archlinux from source, you would need to:


- Build any version of pacman (which is C and shell scripts, but for 
makepkg you might even get away with just the shell scripts)
- Download all 114 buildinfo files for these packages (they are 
contained inside of the package itself)
- Identify all packages and their versions that are referenced in there 
as build dependency
- Build these packages on Freedesktop SDK with `makepkg --nodeps`, this 
disables dependency checks and simply assumes the required 
tools/compilers are going to be in $PATH - the checksums of packages 
built this way are naturally going to be different from the official 
packages but that's ok
- Use the packages you built to setup the build environment that is 
described in each buildinfo file
- Run the build with makepkg and SOURCE_DATE_EPOCH set to the value in 
the buildinfo file


This should result in exact matches of the official packages, but of 
course there are a few things that could go wrong so I can not make any 
guarantees.


Instead of doing the last two steps you could also remove the signature 
checks in archlinux-repro[2] and populate its download cache folder with 
the packages you built yourself, archlinux-repro then takes care of the 
rest.


[2]: https://github.com/archlinux/archlinux-repro


Has anything like that been tried for Arch? How many dependency loops
are there in the build dependencies of the packages mentioned above, and
can they be broken by using packages from Freedesktop SDK?


I'm not aware of anybody having tried this. There wasn't much point in 
trying without having achieved reproducible builds first.


cheers,
kpcyrd


Re: Arch Linux minimal container userland 100% reproducible - now what?

2024-03-26 Thread Chris Lamb
Hey kpcyrd,

Super excited about the energy in this thread. :)

I'll probably reply to a different part of the conversation
tomorrow, but just to very quickly append something to this bit:

> This kind of [archive] service is crucial for implementing
> reproducible builds (because this is used to setup the build
> environment described in BUILDINFO files), and
> reproducible-builds.org has recently received $350k to implement an
> analogous service for Debian (to be able to catch up with Arch
> Linux).

I think h01ger already talked to you a bit on IRC, but the long and
short of it is that, well, reproducible-builds.org wishes it had those
kind of resources to dedicate towards building such a service! Yes, we
did manage to secure some funding recently, and no doubt some of that
will help kickstart an analogous snapshot service. But the amount,
timeframe and associated deliverables don't quite, alas, match your
summary. Still, the important thing here is that your passion is
infectious. :)


Best wishes,

-- 
  o
⬋   ⬊  Chris Lamb
   o o reproducible-builds.org 
⬊   ⬋
  o


Re: Arch Linux minimal container userland 100% reproducible - now what?

2024-03-24 Thread Bernhard M. Wiedemann via rb-general

On 21/03/2024 21.38, kpcyrd wrote:
- libjpeg-turbo: this package contains a .jar file that is built by 
CMake and contains timestamps of the buildtime, but there's no way in 
CMake to pass --date to the jar executable to normalize this


You could use strip-nondeterminism for post-processing there.
For some reason it is reproducible in my openSUSE tests without us doing 
any extra steps.

https://ismypackagereproducibleyet.org/?pkg=libjpeg-turbo


- librsvg: the 3 rebuilders I've checked produced a .text section that is 6 bytes shorter (0x2dda2c vs 0x2dda26), I didn't investigate further yet, the diff is quite long because a lot of addresses are mismatching as a consequence 


My notes have https://gitlab.gnome.org/GNOME/librsvg/-/issues/1015 which 
turned out to be from pango mis-rendering text when font files were absent.


Ciao
Bernhard M.


Re: Arch Linux minimal container userland 100% reproducible - now what?

2024-03-22 Thread John Gilmore
Congratulations on closing in toward Arch Linux reproducibility!!!

kpcyrd  wrote:
> Specifically what I mean - given a line like this:
> 
> FROM
> archlinux@sha256:2dbd72d1e5510e047db7f441bf9069e9c53391b87e04e5bee3f379cd03cec060
> 
> I want to reproduce the artifact(s) that are pulled in by this, with
> the packages our Arch Linux rebuilders have reproduced from source
> code. From what I understand this hash points to a json manifest that
> is not contained in the container image itself and was generated by
> the registry (should we archive them?), and this manifest then points
> to the sha256 of the tar containing the filesystem (I'm possibly
> missing an indirection here).

I have no experience with Arch -- am just reading what's on their
website.  From a quick glance at their docs, the Arch distribution
*only* distributes binary packages.  They only offer URLs for source
code, requiring that users depend on a working Internet connection and
what could be a large, arbitrary set of HTTPS servers that in theory
contain the matching source code.  See:

  https://wiki.archlinux.org/title/Arch_build_system

(I'm not sure how that even meets the requirements of the GPL for
binary distributors to make the matching source code available to
recipients of the binaries.)

It seems to me that the next step in making the Arch release ISOs
reproducible is to have the Arch release engineering team create a
source-code release ISO that matches each binary release ISO.  Then you
(or anyone) could test the reproducibility of the release by having
merely those two ISO images and a bare amd64 computer (without even an
Internet connection).  (Someone other than their releng team could do
this shortly after the binary release, hoping that none of the URLs
becomes inaccessible in the meantime.  But the right time to gather the
full source code for reproducibility is when they themselves pull in the
source code to BUILD those binary packages that they will put in their
release ISO.)

Making users reproduce an ISO full of binary packages by downloading the
sources from all over the Internet seems highly prone to fail -- in the
first few months, let alone five or ten years later.

Even Arch's binary releases are only available from Arch for three
(monthly) release cycles.  Then you're on your own if you want to find a
copy of what they released, like the one that was current last
Christmas.  See:

  https://archlinux.org/releng/releases/

Arch may do great release engineering (I hope they do!), but it's
apparently not *archival* release engineering.

John


Re: Arch Linux minimal container userland 100% reproducible - now what?

2024-03-21 Thread kpcyrd

On 3/20/24 19:21, David A. Wheeler via rb-general wrote:

But you know what I'm going to ask :-). What steps are left, if any, before the 
"normal" Arch Linux packages that people install are reproducible (at least in 
core Arch Linux)? Has that milestone been achieved? Will it be achieved once some package 
updates are released? Or is there something more, and if so, what is it?

Sorry, it wasn't clear to me if this was some sort of special set of "test 
packages" or if they were the normal Arch Linux packages.


hi, thanks for raising this question so I can clarify. :)

This is already the real deal, it's exact matches with the packages on 
our mirrors as used and installed by users.


For a minimal bootable Arch Linux system (using systemd-boot instead of 
grub) there's only the Linux kernel missing - this is because of 
CONFIG_MODULE_SIG=y being set in our kernel.


I also tried installing a minimal usable graphical system with lightdm, 
i3 and alacritty, on that setup there's only 4 unreproducible packages 
left (according to data from reproducible.archlinux.org):


- cairo: this was a build failure due to network issues, two other 
rebuilders have cleared this package so hopefully it's getting marked as 
reproducible on the next automatic retry
- libjpeg-turbo: this package contains a .jar file that is built by 
CMake and contains timestamps of the buildtime, but there's no way in 
CMake to pass --date to the jar executable to normalize this
- librsvg: the 3 rebuilders I've checked produced a .text section that 
is 6 bytes shorter (0x2dda2c vs 0x2dda26), I didn't investigate further 
yet, the diff is quite long because a lot of addresses are mismatching 
as a consequence

- linux: explained above

For CMake I've opened an issue in their gitlab that could be used to 
track this topic (or work on it): 
https://gitlab.kitware.com/cmake/cmake/-/issues/25804


cheers,
kpcyrd


Re: Arch Linux minimal container userland 100% reproducible - now what?

2024-03-20 Thread David A. Wheeler via rb-general



> On Mar 20, 2024, at 8:42 AM, kpcyrd  wrote:
> 
> hello,
> 
> in last week's email to the reproducible-builds email list[1] about 
> reproducible Arch Linux I mentioned there's only one unreproducible package 
> left in docker.io/library/archlinux.
> 
> [1]: 
> https://lists.reproducible-builds.org/pipermail/rb-general/2024-March/003291.html
> 
> Due to amazing work by dvzrv and Foxboron this package is now also 
> reproducible!

That is fantastic, congratulations!!

But you know what I'm going to ask :-). What steps are left, if any, before the 
"normal" Arch Linux packages that people install are reproducible (at least in 
core Arch Linux)? Has that milestone been achieved? Will it be achieved once 
some package updates are released? Or is there something more, and if so, what 
is it?

Sorry, it wasn't clear to me if this was some sort of special set of "test 
packages" or if they were the normal Arch Linux packages.

--- David A. Wheeler



Arch Linux minimal container userland 100% reproducible - now what?

2024-03-20 Thread kpcyrd

hello,

in last week's email to the reproducible-builds email list[1] about 
reproducible Arch Linux I mentioned there's only one unreproducible 
package left in docker.io/library/archlinux.


[1]: 
https://lists.reproducible-builds.org/pipermail/rb-general/2024-March/003291.html


Due to amazing work by dvzrv and Foxboron this package is now also 
reproducible!


 INFO  arch_repro_status > All packages are reproducible!
 INFO  arch_repro_status > Your system is 100.00% reproducible.

To try for yourself use:

podman run --rm -t archlinux sh -c 'pacman -Suy arch-repro-status 
--noconfirm && arch-repro-status'


However:

Where do we go from here? It would be cool if the OCI container image 
itself could also be reproduced (bit-for-bit), but I'm not sure if 
there's any prior work (specifically for images listed as 'official' on 
Docker Hub)?


Specifically what I mean - given a line like this:

FROM 
archlinux@sha256:2dbd72d1e5510e047db7f441bf9069e9c53391b87e04e5bee3f379cd03cec060


I want to reproduce the artifact(s) that are pulled in by this, with the 
packages our Arch Linux rebuilders have reproduced from source code. 
From what I understand this hash points to a json manifest that is not 
contained in the container image itself and was generated by the 
registry (should we archive them?), and this manifest then points to the 
sha256 of the tar containing the filesystem (I'm possibly missing an 
indirection here).


Hopefully one of the many SBOM formats can help with this. :P

I know the container image is built from these two repositories but I 
don't have any in-depth knowledge:


- 
https://github.com/docker-library/official-images/blob/master/library/archlinux

- https://gitlab.archlinux.org/archlinux/archlinux-docker

The only work towards reproducible container images I'm aware of is by 
Akihiro Suda:


https://github.com/reproducible-containers/repro-get#are-container-images-bit-to-bit-reproducible

I'm suspecting the current scripts used by Arch Linux would still be 
prone to mirror changes[2] though, meaning new package uploads would end 
up in our reproduced artifacts (causing mismatches) and the container 
image could only be reproduced for a short amount of time.


[2]: 
https://gitlab.archlinux.org/archlinux/archlinux-docker/-/blob/98cd79111dd530447f491d547d14f3c38e227e46/scripts/make-rootfs.sh#L24-29


I'm also not sure if there's a missing puzzle piece with reproducible 
containers in regards to this manifest json that is generated by the 
registry. The image digest being unpredictable has also been mentioned 
in a cosign github issue[3].


[3]: https://github.com/sigstore/cosign/issues/2516

Input much appreciated!

## Caveats

Probably worth mentioning, at the time of writing there's no consensus 
across multiple orgs yet, the https://reproducible.archlinux.org 
instance reports this status, two other rebuilders don't report the full 
100% yet.


$ arch-repro-status -r https://reproducible.crypto-lab.ch
[...]
 INFO  arch_repro_status > 3/118 packages are not reproducible.
 INFO  arch_repro_status > Your system is 97.46% reproducible.

$ arch-repro-status -r https://wolfpit.net/rebuild
[...]
 INFO  arch_repro_status > 3/118 packages are not reproducible.
 INFO  arch_repro_status > Your system is 97.46% reproducible.

The packages in question are part of this rebuild todo (specifically 
gcc-libs, glibc, ncurses):


https://archlinux.org/todo/rebuild-core-with-reproducible-pacman/

Meaning there's currently some luck involved for these 3 packages, e.g. 
using btrfs currently increases your chances to get an exact match 
(after a few tries). We're obviously trying to get rid of this caveat 
though.


---

If you appreciate this flavor of supply-chain security you may be 
interested in repro-env[4] that I'm currently trying to land[5] in 
ubuntu 24.04 LTS, but is blocked by Debian's libnettle[6].


[4]: https://github.com/kpcyrd/repro-env
[5]: https://tracker.debian.org/pkg/rust-repro-env
[6]: https://tracker.debian.org/pkg/nettle

cheers,
kpcyrd