Re: [DISCUSS] Do not merge PRs with "unrelated" test failures.

Yangze Guo Thu, 24 Jun 2021 19:58:20 -0700

Creating a blocker issue for the manually disabled tests sounds good to me.


Minor: I'm still a bit worried about the commits merged before we fix
the unstable tests can also break those tests. Instead of letting the
assigners keep a look at all potentially related commits, they can
maintain a branch that is periodically synced with the master branch
while enabling the unstable test. So that they can catch the breaking
changes asap.

Best,
Yangze Guo

On Thu, Jun 24, 2021 at 9:52 PM Till Rohrmann <[email protected]> wrote:
>
> I like the idea of creating a blocker issue for a disabled test. This will
> force us to resolve it in a timely manner and it won't fall through the
> cracks.
>
> Cheers,
> Till
>
> On Thu, Jun 24, 2021 at 8:06 AM Jingsong Li <[email protected]> wrote:
>
> > +1 to Xintong's proposal
> >
> > I also have some concerns about unstable cases.
> >
> > I think unstable cases can be divided into these types:
> >
> > - Force majeure: For example, network timeout, sudden environmental
> > collapse, they are accidental and can always be solved by triggering azure
> > again. Committers should wait for the next green azure.
> >
> > - Obvious mistakes: For example, some errors caused by obvious reasons may
> > be repaired quickly. At this time, do we need to wait, or not wait and just
> > ignore?
> >
> > - Difficult questions: These problems are very difficult to find. There
> > will be no solution for a while and a half. We don't even know the reason.
> > At this time, we should ignore it. (Maybe it's judged by the author of the
> > case. But what about the old case whose author can't be found?)
> >
> > So, the ignored cases should be the block of the next release until the
> > reason is found or the case is fixed?  We need to ensure that someone will
> > take care of these cases, because there is no deepening of failed tests, no
> > one may continue to pay attention to these cases.
> >
> > I think this guideline should consider these situations, and show how to
> > solve them.
> >
> > Best,
> > Jingsong
> >
> > On Thu, Jun 24, 2021 at 10:57 AM Jark Wu <[email protected]> wrote:
> >
> > > Thanks to Xintong for bringing up this topic, I'm +1 in general.
> > >
> > > However, I think it's still not very clear how we address the unstable
> > > tests.
> > > I think this is a very important part of this new guideline.
> > >
> > > According to the discussion above, if some tests are unstable, we can
> > > manually disable it.
> > > But I have some questions in my mind:
> > > 1) Is the instability judged by the committer themselves or by some
> > > metrics?
> > > 2) Should we log the disable commit in the corresponding issue and
> > increase
> > > the priority?
> > > 3) What if nobody looks into this issue and this becomes some potential
> > > bugs released with the new version?
> > > 4) If no person is actively working on the issue, who should re-enable
> > it?
> > > Would it block PRs again?
> > >
> > >
> > > Best,
> > > Jark
> > >
> > >
> > > On Thu, 24 Jun 2021 at 10:04, Xintong Song <[email protected]>
> > wrote:
> > >
> > > > Thanks all for the feedback.
> > > >
> > > > @Till @Yangze
> > > >
> > > > I'm also not convinced by the idea of having an exception for local
> > > builds.
> > > > We need to execute the entire build (or at least the failing stage)
> > > > locally, to make sure subsequent test cases prevented by the failure
> > one
> > > > are all executed. In that case, it's probably easier to rerun the build
> > > on
> > > > azure than locally.
> > > >
> > > > Concerning disabling unstable test cases that regularly block PRs from
> > > > merging, maybe we can say that such cases can only be disabled when
> > > someone
> > > > is actively looking into it, likely the person who disabled the case.
> > If
> > > > this person is no longer actively working on it, he/she should enable
> > the
> > > > case again no matter if it is fixed or not.
> > > >
> > > > @Jing
> > > >
> > > > Thanks for the suggestions.
> > > >
> > > > +1 to provide guidelines on handling test failures.
> > > >
> > > > 1. Report the test failures in the JIRA.
> > > > >
> > > >
> > > > +1 on this. Currently, the release managers are monitoring the ci and
> > > cron
> > > > build instabilities and reporting them on JIRA. We should also
> > encourage
> > > > other contributors to do that for PRs.
> > > >
> > > > 2. Set a deadline to find out the root cause and solve the failure for
> > > the
> > > > > new created JIRA  because we could not block other commit merges for
> > a
> > > > long
> > > > > time
> > > > >
> > > > 3. What to do if the JIRA has not made significant progress when
> > reached
> > > to
> > > > > the deadline time?
> > > >
> > > >
> > > > I'm not sure about these two. It feels a bit against the voluntary
> > nature
> > > > of open source projects.
> > > >
> > > > IMHO, frequent instabilities are more likely to be upgraded to the
> > > critical
> > > > / blocker priority, receive more attention and eventually get fixed.
> > > > Release managers are also responsible for looking for assignees for
> > such
> > > > issues. If a case is still not fixed soonish, even with all these
> > > efforts,
> > > > I'm not sure how setting a deadline can help this.
> > > >
> > > > 4. If we disable the respective tests temporarily, we also need a
> > > mechanism
> > > > > to ensure the issue would be continued to be investigated in the
> > > future.
> > > > >
> > > >
> > > > +1. As mentioned above, we may consider disabling such tests iff
> > someone
> > > is
> > > > actively working on it.
> > > >
> > > > Thank you~
> > > >
> > > > Xintong Song
> > > >
> > > >
> > > >
> > > > On Wed, Jun 23, 2021 at 9:56 PM JING ZHANG <[email protected]>
> > wrote:
> > > >
> > > > > Hi Xintong,
> > > > > +1 to the proposal.
> > > > > In order to better comply with the rule, it is necessary to describe
> > > > what's
> > > > > best practice if encountering test failure which seems unrelated with
> > > the
> > > > > current commits.
> > > > > How to avoid merging PR with test failures and not blocking code
> > > merging
> > > > > for a long time?
> > > > > I tried to think about the possible steps, and found there are some
> > > > > detailed problems that need to be discussed in a step further:
> > > > > 1. Report the test failures in the JIRA.
> > > > > 2. Set a deadline to find out the root cause and solve the failure
> > for
> > > > the
> > > > > new created JIRA  because we could not block other commit merges for
> > a
> > > > long
> > > > > time
> > > > >     When is a reasonable deadline here?
> > > > > 3. What to do if the JIRA has not made significant progress when
> > > reached
> > > > to
> > > > > the deadline time?
> > > > >     There are several situations as follows, maybe different cases
> > need
> > > > > different approaches.
> > > > >     1. the JIRA is non-assigned yet
> > > > >     2. not found the root cause yet
> > > > >     3. not found a good solution, but already found the root cause
> > > > >     4. found a solution, but it needs more time to be done.
> > > > > 4. If we disable the respective tests temporarily, we also need a
> > > > mechanism
> > > > > to ensure the issue would be continued to be investigated in the
> > > future.
> > > > >
> > > > > Best regards,
> > > > > JING ZHANG
> > > > >
> > > > > Stephan Ewen <[email protected]> 于2021年6月23日周三 下午8:16写道：
> > > > >
> > > > > > +1 to Xintong's proposal
> > > > > >
> > > > > > On Wed, Jun 23, 2021 at 1:53 PM Till Rohrmann <
> > [email protected]>
> > > > > > wrote:
> > > > > >
> > > > > > > I would first try to not introduce the exception for local
> > builds.
> > > It
> > > > > > makes
> > > > > > > it quite hard for others to verify the build and to make sure
> > that
> > > > the
> > > > > > > right things were executed. If we see that this becomes an issue
> > > then
> > > > > we
> > > > > > > can revisit this idea.
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Till
> > > > > > >
> > > > > > > On Wed, Jun 23, 2021 at 4:19 AM Yangze Guo <[email protected]>
> > > > wrote:
> > > > > > >
> > > > > > > > +1 for appending this to community guidelines for merging PRs.
> > > > > > > >
> > > > > > > > @Till Rohrmann
> > > > > > > > I agree that with this approach unstable tests will not block
> > > other
> > > > > > > > commit merges. However, it might be hard to prevent merging
> > > commits
> > > > > > > > that are related to those tests and should have been passed
> > them.
> > > > > It's
> > > > > > > > true that this judgment can be made by the committers, but no
> > one
> > > > can
> > > > > > > > ensure the judgment is always precise and so that we have this
> > > > > > > > discussion thread.
> > > > > > > >
> > > > > > > > Regarding the unstable tests, how about adding another
> > exception:
> > > > > > > > committers verify it in their local environment and comment in
> > > such
> > > > > > > > cases?
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Yangze Guo
> > > > > > > >
> > > > > > > > On Tue, Jun 22, 2021 at 8:23 PM 刘建刚 <[email protected]
> > >
> > > > > wrote:
> > > > > > > > >
> > > > > > > > > It is a good principle to run all tests successfully with any
> > > > > change.
> > > > > > > > This
> > > > > > > > > means a lot for project's stability and development. I am big
> > > +1
> > > > > for
> > > > > > > this
> > > > > > > > > proposal.
> > > > > > > > >
> > > > > > > > > Best
> > > > > > > > > liujiangang
> > > > > > > > >
> > > > > > > > > Till Rohrmann <[email protected]> 于2021年6月22日周二 下午6:36写道：
> > > > > > > > >
> > > > > > > > > > One way to address the problem of regularly failing tests
> > > that
> > > > > > block
> > > > > > > > > > merging of PRs is to disable the respective tests for the
> > > time
> > > > > > being.
> > > > > > > > Of
> > > > > > > > > > course, the failing test then needs to be fixed. But at
> > least
> > > > > that
> > > > > > > way
> > > > > > > > we
> > > > > > > > > > would not block everyone from making progress.
> > > > > > > > > >
> > > > > > > > > > Cheers,
> > > > > > > > > > Till
> > > > > > > > > >
> > > > > > > > > > On Tue, Jun 22, 2021 at 12:00 PM Arvid Heise <
> > > [email protected]
> > > > >
> > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > I think this is overall a good idea. So +1 from my side.
> > > > > > > > > > > However, I'd like to put a higher priority on
> > > infrastructure
> > > > > > then,
> > > > > > > in
> > > > > > > > > > > particular docker image/artifact caches.
> > > > > > > > > > >
> > > > > > > > > > > On Tue, Jun 22, 2021 at 11:50 AM Till Rohrmann <
> > > > > > > [email protected]
> > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Thanks for bringing this topic to our attention
> > Xintong.
> > > I
> > > > > > think
> > > > > > > > your
> > > > > > > > > > > > proposal makes a lot of sense and we should follow it.
> > It
> > > > > will
> > > > > > > > give us
> > > > > > > > > > > > confidence that our changes are working and it might
> > be a
> > > > > good
> > > > > > > > > > incentive
> > > > > > > > > > > to
> > > > > > > > > > > > quickly fix build instabilities. Hence, +1.
> > > > > > > > > > > >
> > > > > > > > > > > > Cheers,
> > > > > > > > > > > > Till
> > > > > > > > > > > >
> > > > > > > > > > > > On Tue, Jun 22, 2021 at 11:12 AM Xintong Song <
> > > > > > > > [email protected]>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi everyone,
> > > > > > > > > > > > >
> > > > > > > > > > > > > In the past a couple of weeks, I've observed several
> > > > times
> > > > > > that
> > > > > > > > PRs
> > > > > > > > > > are
> > > > > > > > > > > > > merged without a green light from the CI tests, where
> > > > > failure
> > > > > > > > cases
> > > > > > > > > > are
> > > > > > > > > > > > > considered *unrelated*. This may not always cause
> > > > problems,
> > > > > > but
> > > > > > > > would
> > > > > > > > > > > > > increase the chance of breaking our code base. In
> > fact,
> > > > it
> > > > > > has
> > > > > > > > > > occurred
> > > > > > > > > > > > to
> > > > > > > > > > > > > me twice in the past few weeks that I had to revert a
> > > > > commit
> > > > > > > > which
> > > > > > > > > > > breaks
> > > > > > > > > > > > > the master branch due to this.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I think it would be nicer to enforce a stricter rule,
> > > > that
> > > > > no
> > > > > > > PRs
> > > > > > > > > > > should
> > > > > > > > > > > > be
> > > > > > > > > > > > > merged without passing CI.
> > > > > > > > > > > > >
> > > > > > > > > > > > > The problems of merging PRs with "unrelated" test
> > > > failures
> > > > > > are:
> > > > > > > > > > > > > - It's not always straightforward to tell whether a
> > > test
> > > > > > > > failures are
> > > > > > > > > > > > > related or not.
> > > > > > > > > > > > > - It prevents subsequent test cases from being
> > > executed,
> > > > > > which
> > > > > > > > may
> > > > > > > > > > fail
> > > > > > > > > > > > > relating to the PR changes.
> > > > > > > > > > > > >
> > > > > > > > > > > > > To make things easier for the committers, the
> > following
> > > > > > > > exceptions
> > > > > > > > > > > might
> > > > > > > > > > > > be
> > > > > > > > > > > > > considered acceptable.
> > > > > > > > > > > > > - The PR has passed CI in the contributor's personal
> > > > > > workspace.
> > > > > > > > > > Please
> > > > > > > > > > > > post
> > > > > > > > > > > > > the link in such cases.
> > > > > > > > > > > > > - The CI tests have been triggered multiple times, on
> > > the
> > > > > > same
> > > > > > > > > > commit,
> > > > > > > > > > > > and
> > > > > > > > > > > > > each stage has at least passed for once. Please also
> > > > > comment
> > > > > > in
> > > > > > > > such
> > > > > > > > > > > > cases.
> > > > > > > > > > > > >
> > > > > > > > > > > > > If we all agree on this, I'd update the community
> > > > > guidelines
> > > > > > > for
> > > > > > > > > > > merging
> > > > > > > > > > > > > PRs wrt. this proposal. [1]
> > > > > > > > > > > > >
> > > > > > > > > > > > > Please let me know what do you think.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thank you~
> > > > > > > > > > > > >
> > > > > > > > > > > > > Xintong Song
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > [1]
> > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > https://cwiki.apache.org/confluence/display/FLINK/Merging+Pull+Requests
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> > --
> > Best, Jingsong Lee
> >

Re: [DISCUSS] Do not merge PRs with "unrelated" test failures.

Reply via email to