Re: Chronically flaky tests

Kenneth Knowles Fri, 24 Jul 2020 13:52:28 -0700

Had some off-list chats to brainstorm and I wanted to bring ideas back to
the dev@ list for consideration. A lot can be combined. I would really like
to have a section in the release notes. I like the idea of banishing flakes
from pre-commit (since you can't tell easily if it was a real failure
caused by the PR) and auto-retrying in post-commit (so we can gather data
on exactly what is flaking without a lot of manual investigation).


*Include ignored or quarantined tests in the release notes*
Pro:
 - Users are aware of what is not being tested so may be silently broken
 - It forces discussion of ignored tests to be part of our community
processes
Con:
 - It may look bad if the list is large (this is actually also a Pro
because if it looks bad, it is bad)

*Run flaky tests only in postcommit*
Pro:
 - isolates the bad signal so pre-commit is not affected
 - saves pointless re-runs in pre-commit
 - keeps a signal in post-commit that we can watch, instead of losing it
completely when we disable a test
 - maybe keeps the flaky tests in job related to what they are testing
Con:
 - we have to really watch post-commit or flakes can turn into failures

*Separate flaky tests into quarantine job*
Pro:
 - gain signal for healthy tests, as with disabling or running in
post-commit
 - also saves pointless re-runs
Con:
 - may collect bad tests so that we never look at it so it is the same as
disabling the test
 - lots of unrelated tests grouped into signal instead of focused on health
of a particular component

*Add Gradle or Jenkins plugin to retry flaky tests*
https://blog.gradle.org/gradle-flaky-test-retry-plugin
https://plugins.jenkins.io/flaky-test-handler/
Pro:
 - easier than Jiras with human pasting links; works with moving flakes to
post-commit
 - get a somewhat automated view of flakiness, whether in pre-commit or
post-commit
 - don't get stopped by flakiness
Con:
 - maybe too easy to ignore flakes; we should add all flakes (not just
disabled or quarantined) to the release notes
 - sometimes flakes are actual bugs (like concurrency) so treating this as
OK is not desirable
 - without Jiras, no automated release notes
 - Jenkins: retry only will work at job level because it needs Maven to
retry only failed (I think)
 - Jenkins: some of our jobs may have duplicate test names (but might
already be fixed)

*Consider Gradle Enterprise*
Pro:
 - get Gradle scan granularity of flake data (and other stuff)
 - also gives module-level health which we do not have today
Con:
 - cost and administrative burden unknown
 - we probably have to do some small work to make our jobs compatible with
their history tracking

*Require link to Jira to rerun a test*
Instead of saying "Run Java PreCommit" you have to link to the bug relating
to the failure.
Pro:
 - forces investigation
 - helps others find out about issues
Con:
 - adds a lot of manual work, or requires automation (which will probably
be ad hoc and fragile)

Kenn

On Mon, Jul 20, 2020 at 11:59 AM Brian Hulette <bhule...@google.com> wrote:

> > I think we are missing a way for checking that we are making progress on
> P1 issues. For example, P0 issues block releases and this obviously results
> in fixing/triaging/addressing P0 issues at least every 6 weeks. We do not
> have a similar process for flaky tests. I do not know what would be a good
> policy. One suggestion is to ping (email/slack) assignees of issues. I
> recently missed a flaky issue that was assigned to me. A ping like that
> would have reminded me. And if an assignee cannot help/does not have the
> time, we can try to find a new assignee.
>
> Yeah I think this is something we should address. With the new jira
> automation at least assignees should get an email notification after 30
> days because of a jira comment like [1], but that's too long to let a test
> continue to flake. Could Beam Jira Bot ping every N days for P1s that
> aren't making progress?
>
> That wouldn't help us with P1s that have no assignee, or are assigned to
> overloaded people. It seems we'd need some kind of dashboard or report to
> capture those.
>
> [1]
> https://issues.apache.org/jira/browse/BEAM-8101?focusedCommentId=17121918&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17121918
>
> On Fri, Jul 17, 2020 at 1:09 PM Ahmet Altay <al...@google.com> wrote:
>
>> Another idea, could we change our "Retest X" phrases with "Retest X
>> (Reason)" phrases? With this change a PR author will have to look at failed
>> test logs. They could catch new flakiness introduced by their PR, file a
>> JIRA for a flakiness that was not noted before, or ping an existing JIRA
>> issue/raise its severity. On the downside this will require PR authors to
>> do more.
>>
>> On Fri, Jul 17, 2020 at 6:46 AM Tyson Hamilton <tyso...@google.com>
>> wrote:
>>
>>> Adding retries can be beneficial in two ways, unblocking a PR, and
>>> collecting metrics about the flakes.
>>>
>>
>> Makes sense. I think we will still need to have a plan to remove retries
>> similar to re-enabling disabled tests.
>>
>>
>>>
>>> If we also had a flaky test leaderboard that showed which tests are the
>>> most flaky, then we could take action on them. Encouraging someone from the
>>> community to fix the flaky test is another issue.
>>>
>>> The test status matrix of tests that is on the GitHub landing page could
>>> show flake level to communicate to users which modules are losing a
>>> trustable test signal. Maybe this shows up as a flake % or a code coverage
>>> % that decreases due to disabled flaky tests.
>>>
>>
>> +1 to a dashboard that will show a "leaderboard" of flaky tests.
>>
>>
>>>
>>> I didn't look for plugins, just dreaming up some options.
>>>
>>>
>>>
>>>
>>> On Thu, Jul 16, 2020, 5:58 PM Luke Cwik <lc...@google.com> wrote:
>>>
>>>> What do other Apache projects do to address this issue?
>>>>
>>>> On Thu, Jul 16, 2020 at 5:51 PM Ahmet Altay <al...@google.com> wrote:
>>>>
>>>>> I agree with the comments in this thread.
>>>>> - If we are not re-enabling tests back again or we do not have a plan
>>>>> to re-enable them again, disabling tests only provides us temporary relief
>>>>> until eventually users find issues instead of disabled tests.
>>>>> - I feel similarly about retries. It is reasonable to add retries for
>>>>> reasons we understand. Adding retries to avoid flakes is similar to
>>>>> disabling tests. They might hide real issues.
>>>>>
>>>>> I think we are missing a way for checking that we are making progress
>>>>> on P1 issues. For example, P0 issues block releases and this obviously
>>>>> results in fixing/triaging/addressing P0 issues at least every 6 weeks. We
>>>>> do not have a similar process for flaky tests. I do not know what would be
>>>>> a good policy. One suggestion is to ping (email/slack) assignees of 
>>>>> issues.
>>>>> I recently missed a flaky issue that was assigned to me. A ping like that
>>>>> would have reminded me. And if an assignee cannot help/does not have the
>>>>> time, we can try to find a new assignee.
>>>>>
>>>>> Ahmet
>>>>>
>>>>>
>>>>> On Thu, Jul 16, 2020 at 11:52 AM Valentyn Tymofieiev <
>>>>> valen...@google.com> wrote:
>>>>>
>>>>>> I think the original discussion[1] on introducing tenacity might
>>>>>> answer that question.
>>>>>>
>>>>>> [1]
>>>>>> https://lists.apache.org/thread.html/16060fd7f4d408857a5e4a2598cc96ebac0f744b65bf4699001350af%40%3Cdev.beam.apache.org%3E
>>>>>>
>>>>>> On Thu, Jul 16, 2020 at 10:48 AM Rui Wang <ruw...@google.com> wrote:
>>>>>>
>>>>>>> Is there an observation that enabling tenacity improves the
>>>>>>> development experience on Python SDK? E.g. less wait time to get PR pass
>>>>>>> and merged? Or it might be a matter of a right number of retry to align
>>>>>>> with the "flakiness" of a test?
>>>>>>>
>>>>>>>
>>>>>>> -Rui
>>>>>>>
>>>>>>> On Thu, Jul 16, 2020 at 10:38 AM Valentyn Tymofieiev <
>>>>>>> valen...@google.com> wrote:
>>>>>>>
>>>>>>>> We used tenacity[1] to retry some unit tests for which we
>>>>>>>> understood the nature of flakiness.
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> https://github.com/apache/beam/blob/3b9aae2bcaeb48ab43a77368ae496edc73634c91/sdks/python/apache_beam/runners/portability/fn_api_runner/fn_runner_test.py#L1156
>>>>>>>>
>>>>>>>> On Thu, Jul 16, 2020 at 10:25 AM Kenneth Knowles <k...@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Didn't we use something like that flaky retry plugin for Python
>>>>>>>>> tests at some point? Adding retries may be preferable to disabling the
>>>>>>>>> test. We need a process to remove the retries ASAP though. As Luke 
>>>>>>>>> says
>>>>>>>>> that is not so easy to make happen. Having a way to make P1 bugs more
>>>>>>>>> visible in an ongoing way may help.
>>>>>>>>>
>>>>>>>>> Kenn
>>>>>>>>>
>>>>>>>>> On Thu, Jul 16, 2020 at 8:57 AM Luke Cwik <lc...@google.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I don't think I have seen tests that were previously disabled
>>>>>>>>>> become re-enabled.
>>>>>>>>>>
>>>>>>>>>> It seems as though we have about ~60 disabled tests in Java and
>>>>>>>>>> ~15 in Python. Half of the Java ones seem to be in ZetaSQL/SQL due to
>>>>>>>>>> missing features so unrelated to being a flake.
>>>>>>>>>>
>>>>>>>>>> On Thu, Jul 16, 2020 at 8:49 AM Gleb Kanterov <g...@spotify.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> There is something called test-retry-gradle-plugin [1]. It
>>>>>>>>>>> retries tests if they fail, and have different modes to handle 
>>>>>>>>>>> flaky tests.
>>>>>>>>>>> Did we ever try or consider using it?
>>>>>>>>>>>
>>>>>>>>>>> [1]: https://github.com/gradle/test-retry-gradle-plugin
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Jul 16, 2020 at 1:15 PM Gleb Kanterov <g...@spotify.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I agree with what Ahmet is saying. I can share my perspective,
>>>>>>>>>>>> recently I had to retrigger build 6 times due to flaky tests, and 
>>>>>>>>>>>> each
>>>>>>>>>>>> retrigger took one hour of waiting time.
>>>>>>>>>>>>
>>>>>>>>>>>> I've seen examples of automatic tracking of flaky tests, where
>>>>>>>>>>>> a test is considered flaky if both fails and succeeds for the same 
>>>>>>>>>>>> git SHA.
>>>>>>>>>>>> Not sure if there is anything we can enable to get this 
>>>>>>>>>>>> automatically.
>>>>>>>>>>>>
>>>>>>>>>>>> /Gleb
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Jul 16, 2020 at 2:33 AM Ahmet Altay <al...@google.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I think it will be reasonable to disable/sickbay any flaky
>>>>>>>>>>>>> test that is actively blocking people. Collective cost of flaky 
>>>>>>>>>>>>> tests for
>>>>>>>>>>>>> such a large group of contributors is very significant.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Most of these issues are unassigned. IMO, it makes sense to
>>>>>>>>>>>>> assign these issues to the most relevant person (who added the 
>>>>>>>>>>>>> test/who
>>>>>>>>>>>>> generally maintains those components). Those people can either 
>>>>>>>>>>>>> fix and
>>>>>>>>>>>>> re-enable the tests, or remove them if they no longer provide 
>>>>>>>>>>>>> valuable
>>>>>>>>>>>>> signals.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Ahmet
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Jul 15, 2020 at 4:55 PM Kenneth Knowles <
>>>>>>>>>>>>> k...@apache.org> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> The situation is much worse than that IMO. My experience of
>>>>>>>>>>>>>> the last few days is that a large portion of time went to *just 
>>>>>>>>>>>>>> connecting
>>>>>>>>>>>>>> failing runs with the corresponding Jira tickets or filing new 
>>>>>>>>>>>>>> ones*.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Summarized on PRs:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  -
>>>>>>>>>>>>>> https://github.com/apache/beam/pull/12272#issuecomment-659050891
>>>>>>>>>>>>>>  -
>>>>>>>>>>>>>> https://github.com/apache/beam/pull/12273#issuecomment-659070317
>>>>>>>>>>>>>>  -
>>>>>>>>>>>>>> https://github.com/apache/beam/pull/12225#issuecomment-656973073
>>>>>>>>>>>>>>  -
>>>>>>>>>>>>>> https://github.com/apache/beam/pull/12225#issuecomment-657743373
>>>>>>>>>>>>>>  -
>>>>>>>>>>>>>> https://github.com/apache/beam/pull/12224#issuecomment-657744481
>>>>>>>>>>>>>>  -
>>>>>>>>>>>>>> https://github.com/apache/beam/pull/12216#issuecomment-657735289
>>>>>>>>>>>>>>  -
>>>>>>>>>>>>>> https://github.com/apache/beam/pull/12216#issuecomment-657780781
>>>>>>>>>>>>>>  -
>>>>>>>>>>>>>> https://github.com/apache/beam/pull/12216#issuecomment-657799415
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The tickets:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-10460
>>>>>>>>>>>>>> SparkPortableExecutionTest
>>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-10471
>>>>>>>>>>>>>> CassandraIOTest > testEstimatedSizeBytes
>>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-10504
>>>>>>>>>>>>>> ElasticSearchIOTest > testWriteFullAddressing and 
>>>>>>>>>>>>>> testWriteWithIndexFn
>>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-10470
>>>>>>>>>>>>>> JdbcDriverTest
>>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-8025
>>>>>>>>>>>>>> CassandraIOTest > @BeforeClass (classmethod)
>>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-8454
>>>>>>>>>>>>>> FnHarnessTest
>>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-10506
>>>>>>>>>>>>>> SplunkEventWriterTest
>>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-10472 direct
>>>>>>>>>>>>>> runner ParDoLifecycleTest
>>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-9187
>>>>>>>>>>>>>> DefaultJobBundleFactoryTest
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Here are our P1 test flake bugs:
>>>>>>>>>>>>>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22)%20AND%20resolution%20%3D%20Unresolved%20AND%20labels%20%3D%20flake%20ORDER%20BY%20priority%20DESC%2C%20updated%20DESC
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> It seems quite a few of them are actively hindering people
>>>>>>>>>>>>>> right now.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Kenn
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Jul 15, 2020 at 4:23 PM Andrew Pilloud <
>>>>>>>>>>>>>> apill...@google.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> We have two test suites that are responsible for a large
>>>>>>>>>>>>>>> percentage of our flaky tests and  both have bugs open for 
>>>>>>>>>>>>>>> about a year
>>>>>>>>>>>>>>> without being fixed. These suites are ParDoLifecycleTest (
>>>>>>>>>>>>>>> BEAM-8101 <https://issues.apache.org/jira/browse/BEAM-8101>)
>>>>>>>>>>>>>>> in Java and BigQueryWriteIntegrationTests in python (py3
>>>>>>>>>>>>>>> BEAM-9484 <https://issues.apache.org/jira/browse/BEAM-9484>,
>>>>>>>>>>>>>>> py2 BEAM-9232
>>>>>>>>>>>>>>> <https://issues.apache.org/jira/browse/BEAM-9232>, old
>>>>>>>>>>>>>>> duplicate BEAM-8197
>>>>>>>>>>>>>>> <https://issues.apache.org/jira/browse/BEAM-8197>).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Are there any volunteers to look into these issues? What can
>>>>>>>>>>>>>>> we do to mitigate the flakiness until someone has time to 
>>>>>>>>>>>>>>> investigate?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Andrew
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>

Re: Chronically flaky tests

Reply via email to