Hi, Thanks Raúl for bringing this up since it's an important topic! I'd like to provide more context for your proposal and share my particular problems with the release process.
On Mon, May 9, 2022 at 2:33 PM Raul Cumplido <r...@voltrondata.com> wrote: > > Hi, > > I would like to propose a change in our release process. > > The rationale for the change is to avoid introducing new issues once a > Release Candidate has already been cut by only merging specific commits to > new release candidates. > > Currently once a new Release Candidate is required we drop the previous > version branch and create a new Release Candidate from the master branch on > the repository [1]. Actually dropping the previous "release-<version>" branch is not a requirement, but it's indeed not clearly documented in the release guidelines. > This has the problem that we might introduce new bugs > to the Release creating the need of cutting further release candidates. We introduced the release branches for this exact scenario, so we can create releases independently from the master branch. > As an example, for the release 7.0.0, 10 release candidates were required The reason for the notorious 7.0.0-RC10 is different, more on that later. > and for the release 8.0.0 there was the need to remove a specific commit that > introduced some new issues [2]. For the release 8.0.0 we were able to find > it early but it could have potentially been introduced and created the need > for further RCs. > > I would like to propose the following workflow. > When creating the initial RC, create both an rc1 branch and the version > branch from master. > release-x.0.0 and release-x.0.0.rc1 > > If a new RC is required, drop the release-x.0.0 (as we do today) and create > a new RC branch from the previous RC branch (instead of master), then > cherry pick only the specific commits that have been identified to be part > of the new release candidate. We can automate the cherrypick process via a > script specifying the JIRA tickets or the commit hashes that we want to add > to the new release candidate. Once the new RC branch is ready, create a new > version branch from it and proceed as today. This is why I manually cherry-picked 4 commits from the master branch to the new release branch [1] excluding that specific patch. Note, that there was a single blocker [2], but I still included 3 additional patches: 2 low-risk bug fixes and a patch for the verification. > The commits to be added to the release once a release candidate has already > been cut will usually be fixes for the release but could also be features > if there is community consensus that a feature must be introduced to the > release. I'd have also included both the python UDF [3] and GCS [4] patches since they are really valuable features. In the first case we noticed the broken packaging builds from the nightly report, this is why I had to cherry-pick commits from the master rather than cutting RC3 directly from the master branch (there is no other difference). In the second case the PR simply didn't make it due to the same reason [5] which we managed to catch before merging the patch. > This change will allow us to have a more granular control of what goes in > the release once a release candidate has been cut and speed up the release Since your proposal is already implemented, the actionable item I see here is to properly document it in the release management guidelines. > by focusing both the release manager's and the community's efforts and > potentially reducing the number of RCs to be created and verified. Regarding the notorious 7.0.0-RC10 release candidate: I developed a habit to execute the source verification tasks before calling a vote while waiting for the packaging builds to finish. If there is an issue it doesn't reach the VOTE phase. Just took a look and the 6th release candidate (7.0.0-RC5) was the first one I managed to send out a VOTE email for. Out of the 11 release candidates I created for the 7.0.0 release only 4 made it until the voting. Before that release the number of RC verification crossbow tasks kept growing but without the ability to run them on a nightly basis. Meaning that we were unable to tell whether the verification tasks will pass for a certain commit and just noticing issues after creating a release candidate. Right after the 7.0.0 release we refactored [6] the source verification scripts and crossbow tasks to support verifying specific git commits, local checkouts and actual release candidates. Since then we have nightly verification builds so we get notified about the failing builds and haven't even tried to create the first release candidate until we had failing verification tasks. This was the single reason why we didn't have 10+ release candidates this time. After spending countless sleepless nights with arrow releases I'd like to raise awareness of three other problems bothering me: PROBLEM 1: Rush period before the release: One or two weeks before the release we start to incrementally postpone the issues which are unlikely to make it into the release but there are features we would still like to squeeze in. There are too many simultaneously moving parts right before the release, possibly introducing new issues. Since we release many implementations at once and there are multiple stakeholders focusing on different features it's generally hard to "reach consensus" about what to exclude and what to wait for. We're trying our best to include as much value to each release as we can while trying to avoid significant delays in delivery date. PROBLEM 2: Decoupled packaging and verification builds Due to the on-demand nature of the crossbow tasks we often forget to trigger crossbow builds before merging a PR resulting in nightly failures which we need to fix in follow-up PRs. Ideally if we were able to run all of our builds on all of the PRs before merging we could keep the master branch in an always-relasable state. This is a tradeoff we made to spare CI resources for the apache/arrow repository but soon enough we will reach the capacity limits of crossbow as well (for example I had to manually stop-and-restart macOS crossbow builds during the release process to avoid waiting 12 more hours). PROBLEM 3: Lack of interest in nightly builds despite their importance We usually let nightly builds to continuously fail for days or even weeks hiding more and more issues over time. This adds up before the release making the rush period even worse. I'm not sure what's the exact reason, probably the mixture of just a few subscribers to the builds@ mailing list and the poor readability of nightly reports (which keeps improving thanks to Raúl). Thanks, Krisztian [1]: https://github.com/apache/arrow/commits/release-8.0.0 [2]: https://github.com/apache/arrow/commit/0d30a05212b1448f53233f2ab325924311d76e54 [3]: https://github.com/apache/arrow/pull/12590 [4]: https://github.com/apache/arrow/pull/12763 [5]: https://github.com/apache/arrow/pull/12763#issuecomment-1109022291 [6]: https://github.com/apache/arrow/pull/12320 > > Thanks, > Raúl > > [1] > https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide > [2] https://github.com/apache/arrow/pull/12590#issuecomment-1116144088