Re: On pull request workflows for the GNU toolchain

Carlos O'Donell via Gcc Fri, 20 Sep 2024 11:26:53 -0700

On 9/19/24 11:51 AM, Joseph Myers wrote:
> 1. Introduction

Thanks for writing this up!


> This message expands on my remarks at the Cauldron (especially the
> patch review and maintenance BoF, and the Sourceware infrastructure
> BoF) regarding desired features for a system providing pull request
> functionality (patch submission via creating branches that are then
> proposed using some kind of web interface or API, with a central
> database then tracking the status of each pull request and review
> comments thereon automatically), for use by the GNU toolchain (or one
> or more components thereof - there is no need for each component to
> make the same decision about moving to such software and workflow, and
> indeed we have no mechanism to make such decisions for the toolchain
> as a whole).
> 
> This does not advocate a particular choice of software for such
> functionality (though much of the discussion seemed to suggest Forgejo
> as the most likely starting point), or a particular choice of where to
> host it.  Hosting would of course need to meet appropriate security
> requirements, and to achieve a passing grade on the GNU Ethical
> Repository Criteria, and the software would need to be entirely free
> software.  Where relevant features are not already supported, it's
> important that the software is receptive to the addition of such
> features (including cases where part of the functionality is provided
> by software specific to the GNU toolchain or parts thereof - such as
> for the custom checks currently implemented with git hooks - and the
> underlying software provides appropriate interfaces to allow
> integration of such external pieces).  The list of features here may
> be a good basis for reviewing what particular forge software supports
> and whether other features can be added, directly or through use of
> appropriate APIs.
> 
> Forge software may provide other pieces such as bug tracking or wikis
> that we currently handle separately from git hosting.  In such cases,
> we should be able to disable those pieces and keep using the existing
> bug tracking and wiki software (while having the option to decide
> independently to migrate those if desired).
> 
> I consider the overall benefits of such a move to be having more
> structured data about all changes proposed for inclusion and their
> status (needing review, needing changes from the author, under
> discussion, needing merge from mainline, etc.), to help all people
> involved in the patch submission and review process to track such
> information and to find patches needing review as applicable, along
> with providing a more familiar workflow for many people that avoids
> many of the problems with email (which affect experienced contributors
> working around corporate email systems, not just new contributors).
> It would not of course by itself turn people with no interest in or
> understanding of systems software development into contributors (for
> example, people without knowledge of directories and hierarchical file
> storage, or people who only understand software development as web
> development).  Nor would it prevent the accumulation of large backlogs
> of unreviewed patches, as is evident from many large and active
> projects using PR workflows with large numbers of open PRs.
> 
> As Richard noted in his BoF, email sucks.  As I noted in reply, so do
> the web and web browsers when trying to deal with large amounts of
> patch review state (when one wishes to apply one's own view, not the
> forge's, of what is resolved and what needs attention).  As I also
> noted, in the Sourceware infrastructure BoF, tools such as patchwork
> and b4 are the right answer to the wrong question: trying to get
> structured data about patch submissions when working from the axiom
> that emails on a mailing list should be the primary source of truth
> for everything needing review, rather than starting from more
> structured data and generating emails as one form of output.
> 
> Moving to a pull request system is not expected to change policies
> regarding who can approve a change for inclusion, or the technical
> limits on who can cause a change to enter mainline (e.g. all people
> with commit access would be expected to be able to use a button in a
> web interface to cause to PR to be merged, though policy might limit
> when they should do so).  We can of course choose to change policies,
> either as part of adopting a PR system or later.

Agreed.

> 
> 
> 2. Key features
> 
> (a) Some forges have a design that emphasises the tree you get after a
> proposed contribution, but not the sequence of commits to get there.

This has changed a lot in recent years in gitlab and github where per-commit 
reviews
were emerging as a needed property of the web UI, so the commits (rather than 
the
final state of the tree) were being exposed for review purposes. I see this as a
general maturing of the tooling towards what we already knew in systems design,
that the commits and the clean logical changes were a necessary prerequisite to
managing the complexity of the change.

> For the toolchain, we care about the clean, logical sequence of
> commits on the mainline branch.  (We also use linear history on
> mainline, but that's a secondary matter - though certainly we'd want
> any forge used to support such linear history so that property doesn't
> need to change as part of adopting pull request workflow.)  Having a
> clean sequence of commits has some implications for forge support:
> 
> * Support for reviewing the proposed commit message, not just the
>   diffs, is important (and it should be clear what commit message
>   would result from merging any pull request).
> 
> * Patch series and dependencies between patches are important.  In
>   such cases, fixes for issues from review should go into the
>   appropriate logical commit in the PR (with rebasing as necessary),
>   and it should be possible at all times to see what the sequence of
>   commits on mainline would look like.  (E.g. people use some
>   workarounds on GitHub to manage PR dependencies, but those result in
>   second and subsequent PRs in a series showing the full set of diffs
>   from a PR and those it depends on, rather than just the logical
>   diffs for one PR.)

This area has seen a lot of change recently in the large forges, and warrants a
review of current offerings.

>   I consider patch series and dependencies to be separate but related
>   things: a patch series may not have strictly linear dependencies
>   (and it's often useful to merge the more straightforward patches
>   from a series while still discussing and revising others), while a
>   patch may depend on other patches that it's not logically part of
>   the same series as.  They are, however, closely related, and a
>   sufficient solution for dependencies might also be adequate for many
>   cases of series.

Agreed.

>   Note that series can sometimes have hundreds of patches; any
>   solution for patch series and dependencies needs to scale that far.
> 
>   There is of course the common case of a single-patch submission,
>   where the patch is ready for inclusion after some number of fixes.
>   In those cases, it's probably convenient if it's not necessary to
>   rebase - provided it's clear that a particular PR would be
>   squash-merged, and also what the commit message would be for the
>   final fixed commit.
> 
> * Given the need for rebasing when working with patch series, it's
>   important to have good support for rebasing.  In particular, all
>   revisions of the changes for a PR that was rebased need to remain
>   permanently available (e.g. through appropriate documented refs to
>   fetch to get all revisions of all PRs).
> 
> (b) Various people have built efficient workflows for going through
> all patch submissions and comments (or all in a particular area), even
> when only reviewing a small proportion, and have concerns about
> efficiency of a web interface when working with many patches and
> comments.  It's important to have good API support to allow people to
> build tools supporting their own workflow like this without needing to
> use the browser interface (and following their own logic, not the
> forge's, for what changes are of interest).  Good API support might,
> for example, include a straightforward way to get all changes to PR
> and comment data and metadata since some particular point, as well as
> for actions such as reviewing / commenting / approving a PR.  Such API
> support might be similar to what's needed to ensure people can readily
> get and maintain a local replica of all the key data and its history
> for all PRs.

Agreed.

> Replication like that is also important for reliably ensuring key data
> remains available even if the forge software subsequently ceases to be
> maintained.  Consider, for example, the apparent disappearance of the
> data from www-gnats.gnu.org (we occasionally come across references to
> old bug reports from that system in glibc development, but don't have
> any way to resolve those references).
> 
> Another use of such an API would be to allow maintaining local copies
> of all PR comments etc. in a plain text form that can be searched with
> grep.
> 
> (c) Given that a transition would be from using mailing lists, it's
> important to have good plain text outward email notifications for all
> new merge requests, comments thereon and other significant actions
> (changing metadata, approvals / merges, etc.) - through whatever
> combination of built-in support in a forge and local implementation
> using the API.  Such notifications would go to the existing mailing
> lists (note that the choice of mailing lists for a patch can depend on
> which parts of the tree it changes - for example, the choice between
> binutils and gdb-patches lists, or some GCC patches going to the
> fortran or libstdc++ lists) - as well as to individuals concerned with
> / subscribed to a given PR.  Some very routine changes, such as
> reports of clean CI results, might be omitted by default from
> notifications sent to the mailing lists.
> 
> "good" includes quoting appropriate diff hunks being comments on.
> It's OK to have an HTML multipart as well, but the quality of the
> plain text part matters.  Diffs themselves should be included in
> new-PR emails (and those for changes to a PR) unless very large.
> 
> I do not however suggest trying to take incoming email (unstructured)
> and turn it into more structured comments on particular parts of a PR
> in the database.

Agreed Re: incoming email.

> Similarly, commit emails should continue to go to the existing mailing
> lists for those.
> 
> (d) Rather than the forge owning the mainline branch in the sense of
> every commit having to go through an approved PR, at least initially
> it should be possible for people to push directly to mainline as well,
> so the transition doesn't have to be instantaneous (and in particular,
> if a change was posted before the transition and approved after it, it
> shouldn't be necessary to turn it into a PR).  Longer term, I think it
> would be a good idea to move to everything going through the PR system
> (with people able to self-approve and merge their own PRs immediately
> where they can commit without review at present), so there is better
> structured uniform tracking of all changes and associated CI
> information etc. - however, direct commits to branches other than
> mainline (and maybe release branches) should continue to be OK long
> term (although it should also be possible to make a PR to such a
> branch if desired).
> 
> Beyond putting everything through PRs, eventually I'd hope to have
> merges to mainline normally all go through a CI system that makes sure
> there are no regressions for at least one configuration before merging
> changes.

Agreed.
 
> (e) All existing pre-commit checks from hooks should be kept in some
> form, to maintain existing invariants on both tree and commit contents
> (some hook checks make sure that commits don't have commit messages
> that would cause other automated processes to fall over later, for
> example).

These could all move to pre-commit CI checks that block merging.

> (f) Existing cron jobs that read from or commit to repositories should
> use normal remote git access, not filesystem access to the repository,
> including using APIs to create and self-merge PRs when pushing to the
> repository is involved.
 
Agreed.

Thanks for writing this up!

-- 
Cheers,
Carlos.

Re: On pull request workflows for the GNU toolchain

Reply via email to