Re: [PROPOSAL] Introduce build sheriffs to monitor and

Jason Porter Fri, 02 Aug 2024 13:09:43 -0700

Sounds like the conversation has evolved a bit more than the original proposal. 
Francisco, would you mind creating a new reply with the update and consensus 
we've achieved so far, or at least as you understand it? Thank you.


On 2024/08/01 17:20:22 Francisco Javier Tirado Sarti wrote:
> Hi Tiago,
> I fully agree with that procedure (and I think skipping the issue for small
> ones, like the one I just opened, is also the way to go)
> 
> On Thu, Aug 1, 2024 at 7:13 PM Tiago Bento <[email protected]> wrote:
> 
> > Paolo, I understand your proposal, I'm just concerned that the build
> > sheriff role keeps rotating around the same people, as not everyone is
> > available to volunteer and/or act as such. I know it's not your
> > intention, but it's something that can happen. We can't expect all
> > contributors to express interest in being the build sheriff for a
> > month, nor that this will be something we can maintain running
> > sustainably.
> >
> > Francisco, filing a new issue and fixing the problem on a separate PR
> > is indeed the way to go, IMHO. What we usually do on `kie-tools` is:
> > 1. Send a PR with a change.
> > 2. Observe red PR checks, unrelated to the changes introduced.
> > 3. Open an issue and send a separate PR targeting the same branch of
> > the original PR, fixing the problem on the PR checks.
> > 4. Review and merge this second PR, closing the new issue.
> > 5. Retrigger PR checks on the original PR.
> > 6. Observe a green build, review and merge it normally.
> >
> > * Sometimes we skip opening an issue, if the effort to fix it is small
> > enough, and we can use the PR description to provide enough context
> > for reviewers and watchers of the repo.
> >
> > The important thing, IMHO, is that the original PR doesn't get merged
> > before the unrelated issue on PR checks is fixed. Otherwise we open a
> > credit line that allows us to fall into tech debt :) My view has
> > always been that we need to collectively cherish our CI, PR checks and
> > automations, seeing those as the canonical way to build our software.
> > But it's been really hard to cherish something so distant from our
> > day-to-day work, especially when we all can, to some extent, continue
> > operating the same way we've been for the last who knows how many
> > years, somewhat ignoring the systems we currently have :/
> >
> > On Thu, Aug 1, 2024 at 11:54 AM Francisco Javier Tirado Sarti
> > <[email protected]> wrote:
> > >
> > > I forgot to mention that another topic that is difficult to fix but easy
> > to
> > > discuss is if the Jenkins machines executing the test are properly
> > > dimensioned for the test we are executing.
> > > For example, in the previous PR,  the timeout to startup the Keycloak
> > > quarkus instance IT test was increased to 2 minutes, because the default
> > of
> > > 1 minute does not seem to be enough for downloading and running the
> > > keycloak image.
> > > The CI is already taking ages.
> > > Either we increase our HW resources for testing, or we start reducing our
> > > test scope.
> > >
> > > On Thu, Aug 1, 2024 at 5:42 PM Francisco Javier Tirado Sarti <
> > > [email protected]> wrote:
> > >
> > > > By the way I opened
> > > > https://github.com/apache/incubator-kie-kogito-examples/pull/1991 for
> > > > fixing
> > > >
> > https://ci-builds.apache.org/job/KIE/job/kogito/job/10.0.x/job/nightly/job/kogito-examples.build-and-deploy/17/
> > > > So it can be said that I acted as sheriff, but since Im weak and cannot
> > > > hold the pressure, I pass the torch (or the start) to the next one ;)
> > > >
> > > >
> > > > On Thu, Aug 1, 2024 at 5:37 PM Francisco Javier Tirado Sarti <
> > > > [email protected]> wrote:
> > > >
> > > >> Hi Tiago,
> > > >> About point 2, when the issue blocking the merge is really unrelated,
> > it
> > > >> won't be a better approach to open a separate issue to fix the
> > unrelated
> > > >> issue?
> > > >> I think we agree that is better for tracking (so you do not see an
> > > >> unrelated change in a PR history) and will avoid the undesired
> > situation of
> > > >> two developers trying to fix the same unrelated issue from two
> > simultaneous
> > > >> PRs (one of the two eventually has to trigger the rebase and realize
> > the
> > > >> broken test is already fixed, but still, there are less chances of
> > them
> > > >> working in the same problem if there is an issue in the issue list)
> > > >>
> > > >>
> > > >> On Thu, Aug 1, 2024 at 4:58 PM Tiago Bento <[email protected]>
> > wrote:
> > > >>
> > > >>> Thanks Paolo for starting this conversation. Let me bring a little
> > bit
> > > >>> of my perspective to it.
> > > >>>
> > > >>> Although I agree that having people "dedicated" to the quality and
> > > >>> stability of our CI and other automations would be better than what
> > we
> > > >>> have today, having our builds break so often that we need a system in
> > > >>> place to deal with them is a symptom of other problems, IMHO.
> > > >>>
> > > >>> The complexity of our CI systems and automations is discouraging for
> > > >>> most people to get involved. Without the system itself changing and
> > > >>> being more approachable, having "build sheriffs" will only make the
> > > >>> separation between "development" and "CI" bigger, and we'll be
> > reliant
> > > >>> on a small group of people who'll become solely responsible for
> > either
> > > >>> fixing stuff other people broke, or chasing them to fix it. When
> > > >>> inevitably these experts can't or simply don't want to contribute to
> > > >>> this area of the community anymore, we're in big trouble.
> > > >>>
> > > >>> My opinion is that we could try and concentrate our efforts to reduce
> > > >>> the barrier of entry to maintaining the CI and automations we have,
> > > >>> while putting a system in place that will naturally have each one of
> > > >>> us know at least the basics of how the CI and automations work.
> > > >>>
> > > >>> From my experience maintaining `kie-tools`, a few things help
> > reaching
> > > >>> that point:
> > > >>> 1. Having local builds be as similar as possible to CI builds. No
> > > >>> fancy commands or profiles that only run on CI.
> > > >>> 2. Red PRs can't be merged. Ever. If your PR became red for
> > "unrelated
> > > >>> reasons", you then become responsible to fix the "unrelated issue",
> > > >>> helping everyone else not face the same problem.
> > > >>> 3. Having a CI system with the least amount of abstractions possible.
> > > >>> Less CI code == less cognitive load == smaller barrier of entry.
> > > >>>
> > > >>> Moving away from Jenkins for PR checks and concentrating on GitHub
> > > >>> Actions is, IMHO, already a great step in that direction.
> > > >>>
> > > >>> I hope I could bring something positive to the discussion.
> > > >>>
> > > >>> Thanks!
> > > >>>
> > > >>> Regards,
> > > >>>
> > > >>> Tiago Bento
> > > >>>
> > > >>> On Thu, Aug 1, 2024 at 10:08 AM Gabriele Cardosi
> > > >>> <[email protected]> wrote:
> > > >>> >
> > > >>> > Thanks for clarification, Paolo!
> > > >>> >
> > > >>> > Il giorno gio 1 ago 2024 alle ore 15:46 Paolo Bizzarri <
> > > >>> [email protected]>
> > > >>> > ha scritto:
> > > >>> >
> > > >>> > > Hi Gabriele,
> > > >>> > >
> > > >>> > > it is a mix of various stuff.
> > > >>> > >
> > > >>> > > For example, take the various issues that I reported in the
> > analysis
> > > >>> done
> > > >>> > > for 10.x branch. Most of them apply just the same for the main
> > > >>> branch.
> > > >>> > >
> > > >>> > > For example
> > > >>> > >
> > > >>> > >
> > > >>>
> > https://ci-builds.apache.org/job/KIE/job/kogito/job/main/job/tools/job/kogito-clean-old-nightly-images/
> > > >>> > >
> > > >>> > > Now this is probably a build that has to be just deleted - but
> > still
> > > >>> it is
> > > >>> > > always red, and we need someone that looks at it and decide that
> > > >>> yes, we
> > > >>> > > need to get rid of it, create a corresponding kie issue and go
> > after
> > > >>> it.
> > > >>> > >
> > > >>> > > Another example:
> > > >>> > >
> > > >>> > >
> > > >>>
> > https://ci-builds.apache.org/job/KIE/job/kogito/job/10.0.x/job/nightly/job/kogito-examples.build-and-deploy/17/
> > > >>> > >
> > > >>> > > This test has been failing almost every day in the last few days.
> > > >>> Either we
> > > >>> > > need to make it a little more stable, or get rid of it.
> > > >>> > >
> > > >>> > > And so on.
> > > >>> > >
> > > >>> > > The goal of the sheriff is to keep the top level folder in good
> > > >>> health, and
> > > >>> > > that means that all the underlying jobs are healthy.
> > > >>> > >
> > > >>> > > I hope this clarifies my proposal.
> > > >>> > >
> > > >>> > > Regards
> > > >>> > >
> > > >>> > > Paolo
> > > >>> > >
> > > >>> > >
> > > >>> > >
> > > >>> > > On Thu, Aug 1, 2024 at 3:18 PM Gabriele Cardosi <
> > > >>> > > [email protected]>
> > > >>> > > wrote:
> > > >>> > >
> > > >>> > > > Hi Paolo,
> > > >>> > > > may you explain exactly what you mean with "builds are often
> > > >>> broken" ?
> > > >>> > > May
> > > >>> > > > you give an example of such and, in the example, what should
> > the
> > > >>> > > "sheriff"
> > > >>> > > > do to manage it ? (Sorry, I just need to understand what you
> > are
> > > >>> > > referring
> > > >>> > > > to)
> > > >>> > > >
> > > >>> > > > Thanks!
> > > >>> > > >
> > > >>> > > > Il giorno gio 1 ago 2024 alle ore 15:09 Paolo Bizzarri <
> > > >>> > > [email protected]>
> > > >>> > > > ha scritto:
> > > >>> > > >
> > > >>> > > > > Hello kie mates,
> > > >>> > > > >
> > > >>> > > > > please find my proposal in the following.
> > > >>> > > > >
> > > >>> > > > > PROBLEM
> > > >>> > > > > - builds are often broken and they stay broken for a long
> > time.
> > > >>> There
> > > >>> > > > seem
> > > >>> > > > > to be not a clear definition of who should take care of this
> > > >>> > > > >
> > > >>> > > > > CONTEXT
> > > >>> > > > > - fixing builds is slow, annoying and tipically is more a
> > job of
> > > >>> > > chasing
> > > >>> > > > > someone else than fixing it yourself. So it becomes quickly
> > > >>> wearing.
> > > >>> > > > >
> > > >>> > > > > PROPOSED SOLUTION
> > > >>> > > > > - identify a number of build sheriffs that look at the
> > various
> > > >>> builds,
> > > >>> > > > open
> > > >>> > > > > the relevant issues for tracking and chase other devs and
> > > >>> contributors
> > > >>> > > to
> > > >>> > > > > fix the issues themselves. The sheriffs are not supposed to
> > fix
> > > >>> > > > everything
> > > >>> > > > > by themselves, but instead to keep the attention of other
> > > >>> developers on
> > > >>> > > > the
> > > >>> > > > > status of the builds.
> > > >>> > > > > I suggest we have three sheriffs, that stay around for one
> > > >>> month and
> > > >>> > > > then
> > > >>> > > > > pass the token to someone else: one for drools and
> > optaplanner,
> > > >>> one for
> > > >>> > > > > kogito, one for kie-tools.
> > > >>> > > > >
> > > >>> > > > > Let me know your ideas and feedback.
> > > >>> > > > >
> > > >>> > > > > Regards
> > > >>> > > > >
> > > >>> > > > > Paolo
> > > >>> > > > >
> > > >>> > > >
> > > >>> > >
> > > >>>
> > > >>> ---------------------------------------------------------------------
> > > >>> To unsubscribe, e-mail: [email protected]
> > > >>> For additional commands, e-mail: [email protected]
> > > >>>
> > > >>>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> >
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PROPOSAL] Introduce build sheriffs to monitor and

Reply via email to