Re: git repositories vs. tarballs

Simon Josefsson via Gnulib discussion list Sun, 14 Apr 2024 23:50:16 -0700

Bruno Haible <br...@clisp.org> writes:

> Hi Simon,
>
> In the other thread [1][2][2a], but see also [3] and [4], you are asking


Hi Bruno -- thanks for attempting to bring som order to this complicated
matter!  I am in agreement with most of what you say, although some
comments below.

>> Has this changed, so we should recommend maintainers
>> to 'EXTRA_DIST = bootstrap bootstrap-funclib.sh bootstrap.conf' so this
>> is even possible?
>
> 1) I think changing the contents of a tarball ad-hoc, like this, will not
>    lead to satisfying results, because too many packages will do things
>    differently.

Right, and people tend to jump to the incorrect conclusion that running
autoreconf -fvi or running ./bootstrap from a tarball is a good idea.

Rather than trying to fix that solution, I think we should guide these
people towards using 'git-archive' style tarballs instead.  Then they
will need to do all the work that is actually required to bootstrap a
project, including getting all the dependencies in place.

Some will succeed in that.

Some will give up and realize they wanted the traditional curated
tarball after all, and go back to it, and this time hopefully not do the
'autoreconf -fi' harmful dance.

In both situations, I think we are better off than with the current
situation.  Now people take the 'make dist' tarballs and try to reverse
engineer all the required dependencies to regenerate all artifacts, and
do a half-baked job at that, with an end result that is even harder to
audit than what we started with.

>    (Y) Some distros want to be able to verify the tarballs.[9] (I don't agree
>        with this. If you can't trust the release manager who produced the
>        tarballs (C), you cannot trust (A) either. If there is a mechanism
>        for verifying (C) from (A), criminals will commit their malware
>        entirely into (A).)

I have another perspective here.  I don't think people necessarily want
to blindly trust either the git repository source code (A) or tarball
with generated code and source code (C).  So people will want the
ability to audit and verify everything.  Once people start to work on
auditing, they realize that there is no way around auditing (A).  You
need to audit XZUtils source code to gain trust in XZUtils.  So people
work on doing that.  Then someone realize that people aren't actually
using git source code (A) to build the XZUtils binaries -- they are
using (A) plus generated content, that is the full tarball (C).  However
auditing (C) is just a waste of human time if there was a way to avoid
using (C) completely, and have people use (A) directly.  This isn't all
that complicated, I just did it for Libntlm and will try to do the same
for other packages.

I think you are right that if we succeed with this, criminals will put
their malware directly into git source code repositories.  However that
is addressed by the people working on reviewing the core code of
projects.  There is no longer any need for people to spend time auditing
tarballs with a lot of other stuff in them.  This time can be redirected
towards auditing the code.  Which over the years saves a lot of human
cycles.

Most code audits I've seen focus on what's in git, not what's in the
tarball nor in the binary packages that people use.  Which is how it
should be -- the build environment is better to audit on its own rathen
than as part of the upstream code audit.

> 6) How could (X) be implemented?
>
>    The main differences between (A) and (C) are [10]:
>      - Tarballs contain source code from other packages.
>      - Tarballs contain generated files.
>      - Tarballs contain localizations.
>
>    I could imagine an intermediate step between (A) and (C):
>
>      (B) is for users with many packages installed and for distros, to apply
>          modifications (even to the set of gnulib modules) and then build
>          binaries of the package for one or more architectures, without
>          needing to fetch anything (other than build prerequisites) from the
>          network.
>
>    This is a different stage than (A), because most developers don't want
>    to commit source code from other packages into (A) — due to size — nor
>    to commit generated files into (A) — due to hassles with branches.
>
>    Going from (A) to (B) means pulling additional sources from the network.
>    It could be implemented
>      - by "git submodule update --init", or
>      - by 'npm' for JavaScript packages, or
>      - by 'cargo' for Rust packages [11]
>    and, for the localizations:
>      - essentially by a 'wget' command that fetches the *.po files.
>
>    The proposed name of a script that does this is 'autopull.sh'.
>    But I am equally open to a declarative YAML file instead of a shell script.

Another point of view is to give up on forcing the autopull part on
users -- instead we can mention the required dependencies in README and
let the user/packager worry about having them available.  At least as an
option.

The reason for giving up here is that different users seems to want to
do this in wildly different ways, and validate or patch the dependencies
in different ways.  Presuming that we know better than the person
building the software leads to some of the conflicting demands that we
see here.

For example, some people build without network connections.  Then 'git
submodule update --init' doesn't work.  So we provide them with some
alternative and add --gnulib-srcdir to ./bootstrap, which added
complexity.  Then someone comes up with another method they preferred,
and we added --gnulib-refdir.  Then someone realized yet another way
they preferred, and we added GNULIB_REVISION.  This suggests this is an
evolving space and that is getting a bit away from our expertise or
field of domain: as maintainer of Libntlm I would prefer to focus on
Libntlm source code and document the dependencies people need.

Another example is wget of *.po files.  I feel uncomfortable with that,
what if someone corrupts https://translationproject.org/ and serves
malicious files, possibly on a per-IP basis to selected users?  There is
no checksum checking going on, autopull trust whatever it gets.  Now
maybe translation files cannot trigger vulnerabilities, and maybe there
aren't any vulnerabilities to being with, but verifying that isn't as
easy.  The files are arbitrary untrusted data.  I'm considering adding
*.po files into git, like we used to do in the CVS days, and update them
manually once in a while before a release. Having a po/SHA256SUMS,
curated by the maintainer, that autopull could verify files against
would be another approach.  But then old versions of the software cannot
be bootstrapped any more, if it is not possible to fetch the *.po files
by SHA256 checksum value.  I'm not sure what the best solution is.

> 7) How could (Y) be implemented?
>    Like in (E+), we would define:
>
>      (C+) Like (C), plus:
>           A user with all kinds of special tools can determine whether (C)
>           was built with a published build recipe, without tampering.
>
>    Again, this requires
>      - formalizing the notion of a build environment,
>      - adding this build environment into (C).
>
>    For example, we would need a way to specify a build dependency on a
>    particular version of groff or texinfo or doxygen (for the documentation),
>    a particular version of m4, autoconf, automake (for the configure script
>    and Makefile.ins).
>
>    So far, some people have published their build environment in form of
>    ad-hoc plain text ("This release was bootstrapped with the following 
> tools")
>    inside release announcements. [12] Of course, that's the wrong place to
>    do so, because a user who receives (C) and wants to verify it does not
>    want to search for the release announcement in order to get the build
>    environment.

It is better than nothing, though.

Ideally what you need is SHA256 checksums for all binaries (including
its auxilliary files) that were needed to reproduce the full 'make dist'
tarball, and ideally a way to recreate all those binaries from source,
chaining back to some small binary hand-curated seed.  This is
essentially what Guix is working on, but we are pretty far away from
this being the standard environment.

Alternatively, just offer a SHA256 checksum of your 'git archive' source
code, and let distributors and others worry about assuring the build
process.

>    Some people are suggesting that (Y) could be implemented on top of (X) [9].
>    That is, the distro should start from (B), not (C). However, I think it
>    does not change much of the problem. The user's question "can I trust (C),
>    built by the package's release manager" is replaced with two questions
>      "can I trust (B), built by the package's release manager" and
>      "can I trust (C), built by the distro's build service".

Again, to offer another perspective, consider if a users doesn't
necessarily want to trust anything.  So we should make it easy to audit
what they need to trust, namely:

  1) Upstream source code from 'git archive'.

  2) The distribution build recipes.  That is the debian/ sub-directory
  or RPM spec file, usually fairly short.

  3) The build infrastructure.

Today users have to audit 1), 2) and 3) but ALSO the following:

  4) Upstream's 'make dist' tarball with generated content added coming
  from other packages (autoconf, automake, gnulib, texinfo etc).

  5) Debian's *.orig.tar.gz tarball, which may be different from
  upstream's tarball, and often also contains generated files.

Auditing these two last artifacts takes human time.  If you can compute
a checksum proving 1) == 4) == 5) then that time would be saved.

/Simon

signature.asc
Description: PGP signature

Re: git repositories vs. tarballs

Reply via email to