Re: [DISCUSS] Do not merge PRs with "unrelated" test failures.

Piotr Nowojski Fri, 25 Jun 2021 00:07:49 -0700

+1 for the general idea, however I have concerns about a couple of details.


> I would first try to not introduce the exception for local builds.
> It makes it quite hard for others to verify the build and to make sure
that the right things were executed.

I would counter Till's proposal to ignore local green builds. If committer
is merging and closing a PR, with official azure failure, but there was a
green build before or in local azure it's IMO enough to leave the message:

> Latest build failure is a known issue: FLINK-12345
> Green local build: URL

This should address Till's concern about verification.

On the other hand I have concerns about disabling tests.* It shouldn't be
the PR author/committer that's disabling a test on his own, as that's a
conflict of interests*. I have however no problems with disabling test
instabilities that were marked as "blockers" though, that should work
pretty well. But the important thing here is to correctly judge bumping
priorities of test instabilities based on their frequency and current
general health of the system. I believe that release managers should be
playing a big role here in deciding on the guidelines of what should be a
priority of certain test instabilities.

What I mean by that is two example scenarios:
1. if we have a handful of very frequently failing tests and a handful of
very rarely failing tests (like one reported failure and no another
occurrence in many months, and let's even say that the failure looks like
infrastructure/network timeout), we should focus on the frequently failing
ones, and probably we are safe to ignore for the time being the rare issues
- at least until we deal with the most pressing ones.
2. If we have tons of rarely failing test instabilities, we should probably
start addressing them as blockers.

I'm using my own conscious and my best judgement when I'm
bumping/decreasing priorities of test instabilities (and bugs), but as
individual committer I don't have the full picture. As I wrote above, I
think release managers are in a much better position to keep adjusting
those kind of guidelines.

Best, Piotrek

pt., 25 cze 2021 o 08:10 Yu Li <[email protected]> napisał(a):

> +1 for Xintong's proposal.
>
> For me, resolving problems directly (fixing the infrastructure issue,
> disabling unstable tests and creating blocker JIRAs to track the fix and
> re-enable them asap, etc.) is (in most cases) better than working around
> them (verify locally, manually check and judge the failure as "unrelated",
> etc.), and I believe the proposal could help us pushing those more "real"
> solutions forward.
>
> Best Regards,
> Yu
>
>
> On Fri, 25 Jun 2021 at 10:58, Yangze Guo <[email protected]> wrote:
>
> > Creating a blocker issue for the manually disabled tests sounds good to
> me.
> >
> > Minor: I'm still a bit worried about the commits merged before we fix
> > the unstable tests can also break those tests. Instead of letting the
> > assigners keep a look at all potentially related commits, they can
> > maintain a branch that is periodically synced with the master branch
> > while enabling the unstable test. So that they can catch the breaking
> > changes asap.
> >
> > Best,
> > Yangze Guo
> >
> > On Thu, Jun 24, 2021 at 9:52 PM Till Rohrmann <[email protected]>
> > wrote:
> > >
> > > I like the idea of creating a blocker issue for a disabled test. This
> > will
> > > force us to resolve it in a timely manner and it won't fall through the
> > > cracks.
> > >
> > > Cheers,
> > > Till
> > >
> > > On Thu, Jun 24, 2021 at 8:06 AM Jingsong Li <[email protected]>
> > wrote:
> > >
> > > > +1 to Xintong's proposal
> > > >
> > > > I also have some concerns about unstable cases.
> > > >
> > > > I think unstable cases can be divided into these types:
> > > >
> > > > - Force majeure: For example, network timeout, sudden environmental
> > > > collapse, they are accidental and can always be solved by triggering
> > azure
> > > > again. Committers should wait for the next green azure.
> > > >
> > > > - Obvious mistakes: For example, some errors caused by obvious
> reasons
> > may
> > > > be repaired quickly. At this time, do we need to wait, or not wait
> and
> > just
> > > > ignore?
> > > >
> > > > - Difficult questions: These problems are very difficult to find.
> There
> > > > will be no solution for a while and a half. We don't even know the
> > reason.
> > > > At this time, we should ignore it. (Maybe it's judged by the author
> of
> > the
> > > > case. But what about the old case whose author can't be found?)
> > > >
> > > > So, the ignored cases should be the block of the next release until
> the
> > > > reason is found or the case is fixed?  We need to ensure that someone
> > will
> > > > take care of these cases, because there is no deepening of failed
> > tests, no
> > > > one may continue to pay attention to these cases.
> > > >
> > > > I think this guideline should consider these situations, and show how
> > to
> > > > solve them.
> > > >
> > > > Best,
> > > > Jingsong
> > > >
> > > > On Thu, Jun 24, 2021 at 10:57 AM Jark Wu <[email protected]> wrote:
> > > >
> > > > > Thanks to Xintong for bringing up this topic, I'm +1 in general.
> > > > >
> > > > > However, I think it's still not very clear how we address the
> > unstable
> > > > > tests.
> > > > > I think this is a very important part of this new guideline.
> > > > >
> > > > > According to the discussion above, if some tests are unstable, we
> can
> > > > > manually disable it.
> > > > > But I have some questions in my mind:
> > > > > 1) Is the instability judged by the committer themselves or by some
> > > > > metrics?
> > > > > 2) Should we log the disable commit in the corresponding issue and
> > > > increase
> > > > > the priority?
> > > > > 3) What if nobody looks into this issue and this becomes some
> > potential
> > > > > bugs released with the new version?
> > > > > 4) If no person is actively working on the issue, who should
> > re-enable
> > > > it?
> > > > > Would it block PRs again?
> > > > >
> > > > >
> > > > > Best,
> > > > > Jark
> > > > >
> > > > >
> > > > > On Thu, 24 Jun 2021 at 10:04, Xintong Song <[email protected]>
> > > > wrote:
> > > > >
> > > > > > Thanks all for the feedback.
> > > > > >
> > > > > > @Till @Yangze
> > > > > >
> > > > > > I'm also not convinced by the idea of having an exception for
> local
> > > > > builds.
> > > > > > We need to execute the entire build (or at least the failing
> stage)
> > > > > > locally, to make sure subsequent test cases prevented by the
> > failure
> > > > one
> > > > > > are all executed. In that case, it's probably easier to rerun the
> > build
> > > > > on
> > > > > > azure than locally.
> > > > > >
> > > > > > Concerning disabling unstable test cases that regularly block PRs
> > from
> > > > > > merging, maybe we can say that such cases can only be disabled
> when
> > > > > someone
> > > > > > is actively looking into it, likely the person who disabled the
> > case.
> > > > If
> > > > > > this person is no longer actively working on it, he/she should
> > enable
> > > > the
> > > > > > case again no matter if it is fixed or not.
> > > > > >
> > > > > > @Jing
> > > > > >
> > > > > > Thanks for the suggestions.
> > > > > >
> > > > > > +1 to provide guidelines on handling test failures.
> > > > > >
> > > > > > 1. Report the test failures in the JIRA.
> > > > > > >
> > > > > >
> > > > > > +1 on this. Currently, the release managers are monitoring the ci
> > and
> > > > > cron
> > > > > > build instabilities and reporting them on JIRA. We should also
> > > > encourage
> > > > > > other contributors to do that for PRs.
> > > > > >
> > > > > > 2. Set a deadline to find out the root cause and solve the
> failure
> > for
> > > > > the
> > > > > > > new created JIRA  because we could not block other commit
> merges
> > for
> > > > a
> > > > > > long
> > > > > > > time
> > > > > > >
> > > > > > 3. What to do if the JIRA has not made significant progress when
> > > > reached
> > > > > to
> > > > > > > the deadline time?
> > > > > >
> > > > > >
> > > > > > I'm not sure about these two. It feels a bit against the
> voluntary
> > > > nature
> > > > > > of open source projects.
> > > > > >
> > > > > > IMHO, frequent instabilities are more likely to be upgraded to
> the
> > > > > critical
> > > > > > / blocker priority, receive more attention and eventually get
> > fixed.
> > > > > > Release managers are also responsible for looking for assignees
> for
> > > > such
> > > > > > issues. If a case is still not fixed soonish, even with all these
> > > > > efforts,
> > > > > > I'm not sure how setting a deadline can help this.
> > > > > >
> > > > > > 4. If we disable the respective tests temporarily, we also need a
> > > > > mechanism
> > > > > > > to ensure the issue would be continued to be investigated in
> the
> > > > > future.
> > > > > > >
> > > > > >
> > > > > > +1. As mentioned above, we may consider disabling such tests iff
> > > > someone
> > > > > is
> > > > > > actively working on it.
> > > > > >
> > > > > > Thank you~
> > > > > >
> > > > > > Xintong Song
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Wed, Jun 23, 2021 at 9:56 PM JING ZHANG <[email protected]
> >
> > > > wrote:
> > > > > >
> > > > > > > Hi Xintong,
> > > > > > > +1 to the proposal.
> > > > > > > In order to better comply with the rule, it is necessary to
> > describe
> > > > > > what's
> > > > > > > best practice if encountering test failure which seems
> unrelated
> > with
> > > > > the
> > > > > > > current commits.
> > > > > > > How to avoid merging PR with test failures and not blocking
> code
> > > > > merging
> > > > > > > for a long time?
> > > > > > > I tried to think about the possible steps, and found there are
> > some
> > > > > > > detailed problems that need to be discussed in a step further:
> > > > > > > 1. Report the test failures in the JIRA.
> > > > > > > 2. Set a deadline to find out the root cause and solve the
> > failure
> > > > for
> > > > > > the
> > > > > > > new created JIRA  because we could not block other commit
> merges
> > for
> > > > a
> > > > > > long
> > > > > > > time
> > > > > > >     When is a reasonable deadline here?
> > > > > > > 3. What to do if the JIRA has not made significant progress
> when
> > > > > reached
> > > > > > to
> > > > > > > the deadline time?
> > > > > > >     There are several situations as follows, maybe different
> > cases
> > > > need
> > > > > > > different approaches.
> > > > > > >     1. the JIRA is non-assigned yet
> > > > > > >     2. not found the root cause yet
> > > > > > >     3. not found a good solution, but already found the root
> > cause
> > > > > > >     4. found a solution, but it needs more time to be done.
> > > > > > > 4. If we disable the respective tests temporarily, we also
> need a
> > > > > > mechanism
> > > > > > > to ensure the issue would be continued to be investigated in
> the
> > > > > future.
> > > > > > >
> > > > > > > Best regards,
> > > > > > > JING ZHANG
> > > > > > >
> > > > > > > Stephan Ewen <[email protected]> 于2021年6月23日周三 下午8:16写道：
> > > > > > >
> > > > > > > > +1 to Xintong's proposal
> > > > > > > >
> > > > > > > > On Wed, Jun 23, 2021 at 1:53 PM Till Rohrmann <
> > > > [email protected]>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > I would first try to not introduce the exception for local
> > > > builds.
> > > > > It
> > > > > > > > makes
> > > > > > > > > it quite hard for others to verify the build and to make
> sure
> > > > that
> > > > > > the
> > > > > > > > > right things were executed. If we see that this becomes an
> > issue
> > > > > then
> > > > > > > we
> > > > > > > > > can revisit this idea.
> > > > > > > > >
> > > > > > > > > Cheers,
> > > > > > > > > Till
> > > > > > > > >
> > > > > > > > > On Wed, Jun 23, 2021 at 4:19 AM Yangze Guo <
> > [email protected]>
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > +1 for appending this to community guidelines for merging
> > PRs.
> > > > > > > > > >
> > > > > > > > > > @Till Rohrmann
> > > > > > > > > > I agree that with this approach unstable tests will not
> > block
> > > > > other
> > > > > > > > > > commit merges. However, it might be hard to prevent
> merging
> > > > > commits
> > > > > > > > > > that are related to those tests and should have been
> passed
> > > > them.
> > > > > > > It's
> > > > > > > > > > true that this judgment can be made by the committers,
> but
> > no
> > > > one
> > > > > > can
> > > > > > > > > > ensure the judgment is always precise and so that we have
> > this
> > > > > > > > > > discussion thread.
> > > > > > > > > >
> > > > > > > > > > Regarding the unstable tests, how about adding another
> > > > exception:
> > > > > > > > > > committers verify it in their local environment and
> > comment in
> > > > > such
> > > > > > > > > > cases?
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Yangze Guo
> > > > > > > > > >
> > > > > > > > > > On Tue, Jun 22, 2021 at 8:23 PM 刘建刚 <
> > [email protected]
> > > > >
> > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > It is a good principle to run all tests successfully
> > with any
> > > > > > > change.
> > > > > > > > > > This
> > > > > > > > > > > means a lot for project's stability and development. I
> > am big
> > > > > +1
> > > > > > > for
> > > > > > > > > this
> > > > > > > > > > > proposal.
> > > > > > > > > > >
> > > > > > > > > > > Best
> > > > > > > > > > > liujiangang
> > > > > > > > > > >
> > > > > > > > > > > Till Rohrmann <[email protected]> 于2021年6月22日周二
> > 下午6:36写道：
> > > > > > > > > > >
> > > > > > > > > > > > One way to address the problem of regularly failing
> > tests
> > > > > that
> > > > > > > > block
> > > > > > > > > > > > merging of PRs is to disable the respective tests for
> > the
> > > > > time
> > > > > > > > being.
> > > > > > > > > > Of
> > > > > > > > > > > > course, the failing test then needs to be fixed. But
> at
> > > > least
> > > > > > > that
> > > > > > > > > way
> > > > > > > > > > we
> > > > > > > > > > > > would not block everyone from making progress.
> > > > > > > > > > > >
> > > > > > > > > > > > Cheers,
> > > > > > > > > > > > Till
> > > > > > > > > > > >
> > > > > > > > > > > > On Tue, Jun 22, 2021 at 12:00 PM Arvid Heise <
> > > > > [email protected]
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > I think this is overall a good idea. So +1 from my
> > side.
> > > > > > > > > > > > > However, I'd like to put a higher priority on
> > > > > infrastructure
> > > > > > > > then,
> > > > > > > > > in
> > > > > > > > > > > > > particular docker image/artifact caches.
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Tue, Jun 22, 2021 at 11:50 AM Till Rohrmann <
> > > > > > > > > [email protected]
> > > > > > > > > > >
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks for bringing this topic to our attention
> > > > Xintong.
> > > > > I
> > > > > > > > think
> > > > > > > > > > your
> > > > > > > > > > > > > > proposal makes a lot of sense and we should
> follow
> > it.
> > > > It
> > > > > > > will
> > > > > > > > > > give us
> > > > > > > > > > > > > > confidence that our changes are working and it
> > might
> > > > be a
> > > > > > > good
> > > > > > > > > > > > incentive
> > > > > > > > > > > > > to
> > > > > > > > > > > > > > quickly fix build instabilities. Hence, +1.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > > Till
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Tue, Jun 22, 2021 at 11:12 AM Xintong Song <
> > > > > > > > > > [email protected]>
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi everyone,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > In the past a couple of weeks, I've observed
> > several
> > > > > > times
> > > > > > > > that
> > > > > > > > > > PRs
> > > > > > > > > > > > are
> > > > > > > > > > > > > > > merged without a green light from the CI tests,
> > where
> > > > > > > failure
> > > > > > > > > > cases
> > > > > > > > > > > > are
> > > > > > > > > > > > > > > considered *unrelated*. This may not always
> cause
> > > > > > problems,
> > > > > > > > but
> > > > > > > > > > would
> > > > > > > > > > > > > > > increase the chance of breaking our code base.
> In
> > > > fact,
> > > > > > it
> > > > > > > > has
> > > > > > > > > > > > occurred
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > me twice in the past few weeks that I had to
> > revert a
> > > > > > > commit
> > > > > > > > > > which
> > > > > > > > > > > > > breaks
> > > > > > > > > > > > > > > the master branch due to this.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I think it would be nicer to enforce a stricter
> > rule,
> > > > > > that
> > > > > > > no
> > > > > > > > > PRs
> > > > > > > > > > > > > should
> > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > merged without passing CI.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > The problems of merging PRs with "unrelated"
> test
> > > > > > failures
> > > > > > > > are:
> > > > > > > > > > > > > > > - It's not always straightforward to tell
> > whether a
> > > > > test
> > > > > > > > > > failures are
> > > > > > > > > > > > > > > related or not.
> > > > > > > > > > > > > > > - It prevents subsequent test cases from being
> > > > > executed,
> > > > > > > > which
> > > > > > > > > > may
> > > > > > > > > > > > fail
> > > > > > > > > > > > > > > relating to the PR changes.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > To make things easier for the committers, the
> > > > following
> > > > > > > > > > exceptions
> > > > > > > > > > > > > might
> > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > considered acceptable.
> > > > > > > > > > > > > > > - The PR has passed CI in the contributor's
> > personal
> > > > > > > > workspace.
> > > > > > > > > > > > Please
> > > > > > > > > > > > > > post
> > > > > > > > > > > > > > > the link in such cases.
> > > > > > > > > > > > > > > - The CI tests have been triggered multiple
> > times, on
> > > > > the
> > > > > > > > same
> > > > > > > > > > > > commit,
> > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > each stage has at least passed for once. Please
> > also
> > > > > > > comment
> > > > > > > > in
> > > > > > > > > > such
> > > > > > > > > > > > > > cases.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > If we all agree on this, I'd update the
> community
> > > > > > > guidelines
> > > > > > > > > for
> > > > > > > > > > > > > merging
> > > > > > > > > > > > > > > PRs wrt. this proposal. [1]
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Please let me know what do you think.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thank you~
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Xintong Song
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > [1]
> > > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > https://cwiki.apache.org/confluence/display/FLINK/Merging+Pull+Requests
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Best, Jingsong Lee
> > > >
> >
>

Re: [DISCUSS] Do not merge PRs with "unrelated" test failures.

Reply via email to