Re: Chronically flaky tests

Tyson Hamilton Tue, 04 Aug 2020 07:59:48 -0700

On Thu, Jul 30, 2020 at 6:24 PM Ahmet Altay <al...@google.com> wrote:


> I like:
> *Include ignored or quarantined tests in the release notes*
> *Run flaky tests only in postcommit* (related? *Separate flaky tests into
> quarantine job*)
>

The quarantine job would allow them to run in presubmit still, we would
just not use it to determine the health of a PR or block submission.


> *Require link to Jira to rerun a test*
>
> I am concerned about:
> *Add Gradle or Jenkins plugin to retry flaky tests* - because it is a
> convenient place for real bugs to hide.
>

This concern has come up a few times now so I feel like this is a route we
shouldn't pursue further.


>
> I do not know much about:
> *Consider Gradle Enterprise*
> https://testautonation.com/analyse-test-results-deflake-flaky-tests/
>

There is a subscription fee for Gradle Enterprise but it offers a lot of
support for flaky tests and other metrics. I have a meeting to talk with
them on August 7th about the pricing model for open source projects. From
what I understand, last time we spoke with them they didn't have a good
model for open source projects and the fee was tied into the number of
developers in the project.


>
>
> Thank you for putting this list! I believe even if we can commit to doing
> some of these we would have a much healthier project. If we can build
> consensus on implementing, I will be happy to work on some of them.
>
> On Fri, Jul 24, 2020 at 1:54 PM Kenneth Knowles <k...@apache.org> wrote:
>
>> Adding
>> https://testautonation.com/analyse-test-results-deflake-flaky-tests/ to
>> the list which seems a more powerful test history tool.
>>
>> On Fri, Jul 24, 2020 at 1:51 PM Kenneth Knowles <k...@apache.org> wrote:
>>
>>> Had some off-list chats to brainstorm and I wanted to bring ideas back
>>> to the dev@ list for consideration. A lot can be combined. I would
>>> really like to have a section in the release notes. I like the idea of
>>> banishing flakes from pre-commit (since you can't tell easily if it was a
>>> real failure caused by the PR) and auto-retrying in post-commit (so we can
>>> gather data on exactly what is flaking without a lot of manual
>>> investigation).
>>>
>>> *Include ignored or quarantined tests in the release notes*
>>> Pro:
>>>  - Users are aware of what is not being tested so may be silently broken
>>>  - It forces discussion of ignored tests to be part of our community
>>> processes
>>> Con:
>>>  - It may look bad if the list is large (this is actually also a Pro
>>> because if it looks bad, it is bad)
>>>
>>> *Run flaky tests only in postcommit*
>>> Pro:
>>>  - isolates the bad signal so pre-commit is not affected
>>>  - saves pointless re-runs in pre-commit
>>>  - keeps a signal in post-commit that we can watch, instead of losing it
>>> completely when we disable a test
>>>  - maybe keeps the flaky tests in job related to what they are testing
>>> Con:
>>>  - we have to really watch post-commit or flakes can turn into failures
>>>
>>> *Separate flaky tests into quarantine job*
>>> Pro:
>>>  - gain signal for healthy tests, as with disabling or running in
>>> post-commit
>>>  - also saves pointless re-runs
>>> Con:
>>>  - may collect bad tests so that we never look at it so it is the same
>>> as disabling the test
>>>  - lots of unrelated tests grouped into signal instead of focused on
>>> health of a particular component
>>>
>>> *Add Gradle or Jenkins plugin to retry flaky tests*
>>> https://blog.gradle.org/gradle-flaky-test-retry-plugin
>>> https://plugins.jenkins.io/flaky-test-handler/
>>> Pro:
>>>  - easier than Jiras with human pasting links; works with moving flakes
>>> to post-commit
>>>  - get a somewhat automated view of flakiness, whether in pre-commit or
>>> post-commit
>>>  - don't get stopped by flakiness
>>> Con:
>>>  - maybe too easy to ignore flakes; we should add all flakes (not just
>>> disabled or quarantined) to the release notes
>>>  - sometimes flakes are actual bugs (like concurrency) so treating this
>>> as OK is not desirable
>>>  - without Jiras, no automated release notes
>>>  - Jenkins: retry only will work at job level because it needs Maven to
>>> retry only failed (I think)
>>>  - Jenkins: some of our jobs may have duplicate test names (but might
>>> already be fixed)
>>>
>>> *Consider Gradle Enterprise*
>>> Pro:
>>>  - get Gradle scan granularity of flake data (and other stuff)
>>>  - also gives module-level health which we do not have today
>>> Con:
>>>  - cost and administrative burden unknown
>>>  - we probably have to do some small work to make our jobs compatible
>>> with their history tracking
>>>
>>> *Require link to Jira to rerun a test*
>>> Instead of saying "Run Java PreCommit" you have to link to the bug
>>> relating to the failure.
>>> Pro:
>>>  - forces investigation
>>>  - helps others find out about issues
>>> Con:
>>>  - adds a lot of manual work, or requires automation (which will
>>> probably be ad hoc and fragile)
>>>
>>> Kenn
>>>
>>> On Mon, Jul 20, 2020 at 11:59 AM Brian Hulette <bhule...@google.com>
>>> wrote:
>>>
>>>> > I think we are missing a way for checking that we are making progress
>>>> on P1 issues. For example, P0 issues block releases and this obviously
>>>> results in fixing/triaging/addressing P0 issues at least every 6 weeks. We
>>>> do not have a similar process for flaky tests. I do not know what would be
>>>> a good policy. One suggestion is to ping (email/slack) assignees of issues.
>>>> I recently missed a flaky issue that was assigned to me. A ping like that
>>>> would have reminded me. And if an assignee cannot help/does not have the
>>>> time, we can try to find a new assignee.
>>>>
>>>> Yeah I think this is something we should address. With the new jira
>>>> automation at least assignees should get an email notification after 30
>>>> days because of a jira comment like [1], but that's too long to let a test
>>>> continue to flake. Could Beam Jira Bot ping every N days for P1s that
>>>> aren't making progress?
>>>>
>>>> That wouldn't help us with P1s that have no assignee, or are assigned
>>>> to overloaded people. It seems we'd need some kind of dashboard or report
>>>> to capture those.
>>>>
>>>> [1]
>>>> https://issues.apache.org/jira/browse/BEAM-8101?focusedCommentId=17121918&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17121918
>>>>
>>>> On Fri, Jul 17, 2020 at 1:09 PM Ahmet Altay <al...@google.com> wrote:
>>>>
>>>>> Another idea, could we change our "Retest X" phrases with "Retest X
>>>>> (Reason)" phrases? With this change a PR author will have to look at 
>>>>> failed
>>>>> test logs. They could catch new flakiness introduced by their PR, file a
>>>>> JIRA for a flakiness that was not noted before, or ping an existing JIRA
>>>>> issue/raise its severity. On the downside this will require PR authors to
>>>>> do more.
>>>>>
>>>>> On Fri, Jul 17, 2020 at 6:46 AM Tyson Hamilton <tyso...@google.com>
>>>>> wrote:
>>>>>
>>>>>> Adding retries can be beneficial in two ways, unblocking a PR, and
>>>>>> collecting metrics about the flakes.
>>>>>>
>>>>>
>>>>> Makes sense. I think we will still need to have a plan to remove
>>>>> retries similar to re-enabling disabled tests.
>>>>>
>>>>>
>>>>>>
>>>>>> If we also had a flaky test leaderboard that showed which tests are
>>>>>> the most flaky, then we could take action on them. Encouraging someone 
>>>>>> from
>>>>>> the community to fix the flaky test is another issue.
>>>>>>
>>>>>> The test status matrix of tests that is on the GitHub landing page
>>>>>> could show flake level to communicate to users which modules are losing a
>>>>>> trustable test signal. Maybe this shows up as a flake % or a code 
>>>>>> coverage
>>>>>> % that decreases due to disabled flaky tests.
>>>>>>
>>>>>
>>>>> +1 to a dashboard that will show a "leaderboard" of flaky tests.
>>>>>
>>>>>
>>>>>>
>>>>>> I didn't look for plugins, just dreaming up some options.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Jul 16, 2020, 5:58 PM Luke Cwik <lc...@google.com> wrote:
>>>>>>
>>>>>>> What do other Apache projects do to address this issue?
>>>>>>>
>>>>>>> On Thu, Jul 16, 2020 at 5:51 PM Ahmet Altay <al...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I agree with the comments in this thread.
>>>>>>>> - If we are not re-enabling tests back again or we do not have a
>>>>>>>> plan to re-enable them again, disabling tests only provides us 
>>>>>>>> temporary
>>>>>>>> relief until eventually users find issues instead of disabled tests.
>>>>>>>> - I feel similarly about retries. It is reasonable to add retries
>>>>>>>> for reasons we understand. Adding retries to avoid flakes is similar to
>>>>>>>> disabling tests. They might hide real issues.
>>>>>>>>
>>>>>>>> I think we are missing a way for checking that we are making
>>>>>>>> progress on P1 issues. For example, P0 issues block releases and this
>>>>>>>> obviously results in fixing/triaging/addressing P0 issues at least 
>>>>>>>> every 6
>>>>>>>> weeks. We do not have a similar process for flaky tests. I do not know 
>>>>>>>> what
>>>>>>>> would be a good policy. One suggestion is to ping (email/slack) 
>>>>>>>> assignees
>>>>>>>> of issues. I recently missed a flaky issue that was assigned to me. A 
>>>>>>>> ping
>>>>>>>> like that would have reminded me. And if an assignee cannot help/does 
>>>>>>>> not
>>>>>>>> have the time, we can try to find a new assignee.
>>>>>>>>
>>>>>>>> Ahmet
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Jul 16, 2020 at 11:52 AM Valentyn Tymofieiev <
>>>>>>>> valen...@google.com> wrote:
>>>>>>>>
>>>>>>>>> I think the original discussion[1] on introducing tenacity might
>>>>>>>>> answer that question.
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>> https://lists.apache.org/thread.html/16060fd7f4d408857a5e4a2598cc96ebac0f744b65bf4699001350af%40%3Cdev.beam.apache.org%3E
>>>>>>>>>
>>>>>>>>> On Thu, Jul 16, 2020 at 10:48 AM Rui Wang <ruw...@google.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Is there an observation that enabling tenacity improves the
>>>>>>>>>> development experience on Python SDK? E.g. less wait time to get PR 
>>>>>>>>>> pass
>>>>>>>>>> and merged? Or it might be a matter of a right number of retry to 
>>>>>>>>>> align
>>>>>>>>>> with the "flakiness" of a test?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> -Rui
>>>>>>>>>>
>>>>>>>>>> On Thu, Jul 16, 2020 at 10:38 AM Valentyn Tymofieiev <
>>>>>>>>>> valen...@google.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> We used tenacity[1] to retry some unit tests for which we
>>>>>>>>>>> understood the nature of flakiness.
>>>>>>>>>>>
>>>>>>>>>>> [1]
>>>>>>>>>>> https://github.com/apache/beam/blob/3b9aae2bcaeb48ab43a77368ae496edc73634c91/sdks/python/apache_beam/runners/portability/fn_api_runner/fn_runner_test.py#L1156
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Jul 16, 2020 at 10:25 AM Kenneth Knowles <
>>>>>>>>>>> k...@apache.org> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Didn't we use something like that flaky retry plugin for Python
>>>>>>>>>>>> tests at some point? Adding retries may be preferable to disabling 
>>>>>>>>>>>> the
>>>>>>>>>>>> test. We need a process to remove the retries ASAP though. As Luke 
>>>>>>>>>>>> says
>>>>>>>>>>>> that is not so easy to make happen. Having a way to make P1 bugs 
>>>>>>>>>>>> more
>>>>>>>>>>>> visible in an ongoing way may help.
>>>>>>>>>>>>
>>>>>>>>>>>> Kenn
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Jul 16, 2020 at 8:57 AM Luke Cwik <lc...@google.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I don't think I have seen tests that were previously disabled
>>>>>>>>>>>>> become re-enabled.
>>>>>>>>>>>>>
>>>>>>>>>>>>> It seems as though we have about ~60 disabled tests in Java
>>>>>>>>>>>>> and ~15 in Python. Half of the Java ones seem to be in 
>>>>>>>>>>>>> ZetaSQL/SQL due to
>>>>>>>>>>>>> missing features so unrelated to being a flake.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Jul 16, 2020 at 8:49 AM Gleb Kanterov <
>>>>>>>>>>>>> g...@spotify.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> There is something called test-retry-gradle-plugin [1]. It
>>>>>>>>>>>>>> retries tests if they fail, and have different modes to handle 
>>>>>>>>>>>>>> flaky tests.
>>>>>>>>>>>>>> Did we ever try or consider using it?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [1]: https://github.com/gradle/test-retry-gradle-plugin
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Jul 16, 2020 at 1:15 PM Gleb Kanterov <
>>>>>>>>>>>>>> g...@spotify.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I agree with what Ahmet is saying. I can share my
>>>>>>>>>>>>>>> perspective, recently I had to retrigger build 6 times due to 
>>>>>>>>>>>>>>> flaky tests,
>>>>>>>>>>>>>>> and each retrigger took one hour of waiting time.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I've seen examples of automatic tracking of flaky tests,
>>>>>>>>>>>>>>> where a test is considered flaky if both fails and succeeds for 
>>>>>>>>>>>>>>> the same
>>>>>>>>>>>>>>> git SHA. Not sure if there is anything we can enable to get this
>>>>>>>>>>>>>>> automatically.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> /Gleb
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, Jul 16, 2020 at 2:33 AM Ahmet Altay <
>>>>>>>>>>>>>>> al...@google.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I think it will be reasonable to disable/sickbay any flaky
>>>>>>>>>>>>>>>> test that is actively blocking people. Collective cost of 
>>>>>>>>>>>>>>>> flaky tests for
>>>>>>>>>>>>>>>> such a large group of contributors is very significant.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Most of these issues are unassigned. IMO, it makes sense to
>>>>>>>>>>>>>>>> assign these issues to the most relevant person (who added the 
>>>>>>>>>>>>>>>> test/who
>>>>>>>>>>>>>>>> generally maintains those components). Those people can either 
>>>>>>>>>>>>>>>> fix and
>>>>>>>>>>>>>>>> re-enable the tests, or remove them if they no longer provide 
>>>>>>>>>>>>>>>> valuable
>>>>>>>>>>>>>>>> signals.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Ahmet
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Jul 15, 2020 at 4:55 PM Kenneth Knowles <
>>>>>>>>>>>>>>>> k...@apache.org> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The situation is much worse than that IMO. My experience
>>>>>>>>>>>>>>>>> of the last few days is that a large portion of time went to 
>>>>>>>>>>>>>>>>> *just
>>>>>>>>>>>>>>>>> connecting failing runs with the corresponding Jira tickets 
>>>>>>>>>>>>>>>>> or filing new
>>>>>>>>>>>>>>>>> ones*.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Summarized on PRs:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  -
>>>>>>>>>>>>>>>>> https://github.com/apache/beam/pull/12272#issuecomment-659050891
>>>>>>>>>>>>>>>>>  -
>>>>>>>>>>>>>>>>> https://github.com/apache/beam/pull/12273#issuecomment-659070317
>>>>>>>>>>>>>>>>>  -
>>>>>>>>>>>>>>>>> https://github.com/apache/beam/pull/12225#issuecomment-656973073
>>>>>>>>>>>>>>>>>  -
>>>>>>>>>>>>>>>>> https://github.com/apache/beam/pull/12225#issuecomment-657743373
>>>>>>>>>>>>>>>>>  -
>>>>>>>>>>>>>>>>> https://github.com/apache/beam/pull/12224#issuecomment-657744481
>>>>>>>>>>>>>>>>>  -
>>>>>>>>>>>>>>>>> https://github.com/apache/beam/pull/12216#issuecomment-657735289
>>>>>>>>>>>>>>>>>  -
>>>>>>>>>>>>>>>>> https://github.com/apache/beam/pull/12216#issuecomment-657780781
>>>>>>>>>>>>>>>>>  -
>>>>>>>>>>>>>>>>> https://github.com/apache/beam/pull/12216#issuecomment-657799415
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The tickets:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-10460
>>>>>>>>>>>>>>>>> SparkPortableExecutionTest
>>>>>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-10471
>>>>>>>>>>>>>>>>> CassandraIOTest > testEstimatedSizeBytes
>>>>>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-10504
>>>>>>>>>>>>>>>>> ElasticSearchIOTest > testWriteFullAddressing and 
>>>>>>>>>>>>>>>>> testWriteWithIndexFn
>>>>>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-10470
>>>>>>>>>>>>>>>>> JdbcDriverTest
>>>>>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-8025
>>>>>>>>>>>>>>>>> CassandraIOTest > @BeforeClass (classmethod)
>>>>>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-8454
>>>>>>>>>>>>>>>>> FnHarnessTest
>>>>>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-10506
>>>>>>>>>>>>>>>>> SplunkEventWriterTest
>>>>>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-10472
>>>>>>>>>>>>>>>>> direct runner ParDoLifecycleTest
>>>>>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-9187
>>>>>>>>>>>>>>>>> DefaultJobBundleFactoryTest
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Here are our P1 test flake bugs:
>>>>>>>>>>>>>>>>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22)%20AND%20resolution%20%3D%20Unresolved%20AND%20labels%20%3D%20flake%20ORDER%20BY%20priority%20DESC%2C%20updated%20DESC
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> It seems quite a few of them are actively hindering people
>>>>>>>>>>>>>>>>> right now.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Kenn
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, Jul 15, 2020 at 4:23 PM Andrew Pilloud <
>>>>>>>>>>>>>>>>> apill...@google.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> We have two test suites that are responsible for a large
>>>>>>>>>>>>>>>>>> percentage of our flaky tests and  both have bugs open for 
>>>>>>>>>>>>>>>>>> about a year
>>>>>>>>>>>>>>>>>> without being fixed. These suites are ParDoLifecycleTest (
>>>>>>>>>>>>>>>>>> BEAM-8101
>>>>>>>>>>>>>>>>>> <https://issues.apache.org/jira/browse/BEAM-8101>) in
>>>>>>>>>>>>>>>>>> Java and BigQueryWriteIntegrationTests in python (py3
>>>>>>>>>>>>>>>>>> BEAM-9484
>>>>>>>>>>>>>>>>>> <https://issues.apache.org/jira/browse/BEAM-9484>, py2
>>>>>>>>>>>>>>>>>> BEAM-9232
>>>>>>>>>>>>>>>>>> <https://issues.apache.org/jira/browse/BEAM-9232>, old
>>>>>>>>>>>>>>>>>> duplicate BEAM-8197
>>>>>>>>>>>>>>>>>> <https://issues.apache.org/jira/browse/BEAM-8197>).
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Are there any volunteers to look into these issues? What
>>>>>>>>>>>>>>>>>> can we do to mitigate the flakiness until someone has time 
>>>>>>>>>>>>>>>>>> to investigate?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Andrew
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>

Re: Chronically flaky tests

Reply via email to