Re: Chronically flaky tests

Kenneth Knowles Fri, 24 Jul 2020 14:02:22 -0700

Adding https://testautonation.com/analyse-test-results-deflake-flaky-tests/ to
the list which seems a more powerful test history tool.


On Fri, Jul 24, 2020 at 1:51 PM Kenneth Knowles <[email protected]> wrote:

> Had some off-list chats to brainstorm and I wanted to bring ideas back to
> the dev@ list for consideration. A lot can be combined. I would really
> like to have a section in the release notes. I like the idea of banishing
> flakes from pre-commit (since you can't tell easily if it was a real
> failure caused by the PR) and auto-retrying in post-commit (so we can
> gather data on exactly what is flaking without a lot of manual
> investigation).
>
> *Include ignored or quarantined tests in the release notes*
> Pro:
>  - Users are aware of what is not being tested so may be silently broken
>  - It forces discussion of ignored tests to be part of our community
> processes
> Con:
>  - It may look bad if the list is large (this is actually also a Pro
> because if it looks bad, it is bad)
>
> *Run flaky tests only in postcommit*
> Pro:
>  - isolates the bad signal so pre-commit is not affected
>  - saves pointless re-runs in pre-commit
>  - keeps a signal in post-commit that we can watch, instead of losing it
> completely when we disable a test
>  - maybe keeps the flaky tests in job related to what they are testing
> Con:
>  - we have to really watch post-commit or flakes can turn into failures
>
> *Separate flaky tests into quarantine job*
> Pro:
>  - gain signal for healthy tests, as with disabling or running in
> post-commit
>  - also saves pointless re-runs
> Con:
>  - may collect bad tests so that we never look at it so it is the same as
> disabling the test
>  - lots of unrelated tests grouped into signal instead of focused on
> health of a particular component
>
> *Add Gradle or Jenkins plugin to retry flaky tests*
> https://blog.gradle.org/gradle-flaky-test-retry-plugin
> https://plugins.jenkins.io/flaky-test-handler/
> Pro:
>  - easier than Jiras with human pasting links; works with moving flakes to
> post-commit
>  - get a somewhat automated view of flakiness, whether in pre-commit or
> post-commit
>  - don't get stopped by flakiness
> Con:
>  - maybe too easy to ignore flakes; we should add all flakes (not just
> disabled or quarantined) to the release notes
>  - sometimes flakes are actual bugs (like concurrency) so treating this as
> OK is not desirable
>  - without Jiras, no automated release notes
>  - Jenkins: retry only will work at job level because it needs Maven to
> retry only failed (I think)
>  - Jenkins: some of our jobs may have duplicate test names (but might
> already be fixed)
>
> *Consider Gradle Enterprise*
> Pro:
>  - get Gradle scan granularity of flake data (and other stuff)
>  - also gives module-level health which we do not have today
> Con:
>  - cost and administrative burden unknown
>  - we probably have to do some small work to make our jobs compatible with
> their history tracking
>
> *Require link to Jira to rerun a test*
> Instead of saying "Run Java PreCommit" you have to link to the bug
> relating to the failure.
> Pro:
>  - forces investigation
>  - helps others find out about issues
> Con:
>  - adds a lot of manual work, or requires automation (which will probably
> be ad hoc and fragile)
>
> Kenn
>
> On Mon, Jul 20, 2020 at 11:59 AM Brian Hulette <[email protected]>
> wrote:
>
>> > I think we are missing a way for checking that we are making progress
>> on P1 issues. For example, P0 issues block releases and this obviously
>> results in fixing/triaging/addressing P0 issues at least every 6 weeks. We
>> do not have a similar process for flaky tests. I do not know what would be
>> a good policy. One suggestion is to ping (email/slack) assignees of issues.
>> I recently missed a flaky issue that was assigned to me. A ping like that
>> would have reminded me. And if an assignee cannot help/does not have the
>> time, we can try to find a new assignee.
>>
>> Yeah I think this is something we should address. With the new jira
>> automation at least assignees should get an email notification after 30
>> days because of a jira comment like [1], but that's too long to let a test
>> continue to flake. Could Beam Jira Bot ping every N days for P1s that
>> aren't making progress?
>>
>> That wouldn't help us with P1s that have no assignee, or are assigned to
>> overloaded people. It seems we'd need some kind of dashboard or report to
>> capture those.
>>
>> [1]
>> https://issues.apache.org/jira/browse/BEAM-8101?focusedCommentId=17121918&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17121918
>>
>> On Fri, Jul 17, 2020 at 1:09 PM Ahmet Altay <[email protected]> wrote:
>>
>>> Another idea, could we change our "Retest X" phrases with "Retest X
>>> (Reason)" phrases? With this change a PR author will have to look at failed
>>> test logs. They could catch new flakiness introduced by their PR, file a
>>> JIRA for a flakiness that was not noted before, or ping an existing JIRA
>>> issue/raise its severity. On the downside this will require PR authors to
>>> do more.
>>>
>>> On Fri, Jul 17, 2020 at 6:46 AM Tyson Hamilton <[email protected]>
>>> wrote:
>>>
>>>> Adding retries can be beneficial in two ways, unblocking a PR, and
>>>> collecting metrics about the flakes.
>>>>
>>>
>>> Makes sense. I think we will still need to have a plan to remove retries
>>> similar to re-enabling disabled tests.
>>>
>>>
>>>>
>>>> If we also had a flaky test leaderboard that showed which tests are the
>>>> most flaky, then we could take action on them. Encouraging someone from the
>>>> community to fix the flaky test is another issue.
>>>>
>>>> The test status matrix of tests that is on the GitHub landing page
>>>> could show flake level to communicate to users which modules are losing a
>>>> trustable test signal. Maybe this shows up as a flake % or a code coverage
>>>> % that decreases due to disabled flaky tests.
>>>>
>>>
>>> +1 to a dashboard that will show a "leaderboard" of flaky tests.
>>>
>>>
>>>>
>>>> I didn't look for plugins, just dreaming up some options.
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Jul 16, 2020, 5:58 PM Luke Cwik <[email protected]> wrote:
>>>>
>>>>> What do other Apache projects do to address this issue?
>>>>>
>>>>> On Thu, Jul 16, 2020 at 5:51 PM Ahmet Altay <[email protected]> wrote:
>>>>>
>>>>>> I agree with the comments in this thread.
>>>>>> - If we are not re-enabling tests back again or we do not have a plan
>>>>>> to re-enable them again, disabling tests only provides us temporary 
>>>>>> relief
>>>>>> until eventually users find issues instead of disabled tests.
>>>>>> - I feel similarly about retries. It is reasonable to add retries for
>>>>>> reasons we understand. Adding retries to avoid flakes is similar to
>>>>>> disabling tests. They might hide real issues.
>>>>>>
>>>>>> I think we are missing a way for checking that we are making progress
>>>>>> on P1 issues. For example, P0 issues block releases and this obviously
>>>>>> results in fixing/triaging/addressing P0 issues at least every 6 weeks. 
>>>>>> We
>>>>>> do not have a similar process for flaky tests. I do not know what would 
>>>>>> be
>>>>>> a good policy. One suggestion is to ping (email/slack) assignees of 
>>>>>> issues.
>>>>>> I recently missed a flaky issue that was assigned to me. A ping like that
>>>>>> would have reminded me. And if an assignee cannot help/does not have the
>>>>>> time, we can try to find a new assignee.
>>>>>>
>>>>>> Ahmet
>>>>>>
>>>>>>
>>>>>> On Thu, Jul 16, 2020 at 11:52 AM Valentyn Tymofieiev <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> I think the original discussion[1] on introducing tenacity might
>>>>>>> answer that question.
>>>>>>>
>>>>>>> [1]
>>>>>>> https://lists.apache.org/thread.html/16060fd7f4d408857a5e4a2598cc96ebac0f744b65bf4699001350af%40%3Cdev.beam.apache.org%3E
>>>>>>>
>>>>>>> On Thu, Jul 16, 2020 at 10:48 AM Rui Wang <[email protected]> wrote:
>>>>>>>
>>>>>>>> Is there an observation that enabling tenacity improves the
>>>>>>>> development experience on Python SDK? E.g. less wait time to get PR 
>>>>>>>> pass
>>>>>>>> and merged? Or it might be a matter of a right number of retry to align
>>>>>>>> with the "flakiness" of a test?
>>>>>>>>
>>>>>>>>
>>>>>>>> -Rui
>>>>>>>>
>>>>>>>> On Thu, Jul 16, 2020 at 10:38 AM Valentyn Tymofieiev <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> We used tenacity[1] to retry some unit tests for which we
>>>>>>>>> understood the nature of flakiness.
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>> https://github.com/apache/beam/blob/3b9aae2bcaeb48ab43a77368ae496edc73634c91/sdks/python/apache_beam/runners/portability/fn_api_runner/fn_runner_test.py#L1156
>>>>>>>>>
>>>>>>>>> On Thu, Jul 16, 2020 at 10:25 AM Kenneth Knowles <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Didn't we use something like that flaky retry plugin for Python
>>>>>>>>>> tests at some point? Adding retries may be preferable to disabling 
>>>>>>>>>> the
>>>>>>>>>> test. We need a process to remove the retries ASAP though. As Luke 
>>>>>>>>>> says
>>>>>>>>>> that is not so easy to make happen. Having a way to make P1 bugs more
>>>>>>>>>> visible in an ongoing way may help.
>>>>>>>>>>
>>>>>>>>>> Kenn
>>>>>>>>>>
>>>>>>>>>> On Thu, Jul 16, 2020 at 8:57 AM Luke Cwik <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> I don't think I have seen tests that were previously disabled
>>>>>>>>>>> become re-enabled.
>>>>>>>>>>>
>>>>>>>>>>> It seems as though we have about ~60 disabled tests in Java and
>>>>>>>>>>> ~15 in Python. Half of the Java ones seem to be in ZetaSQL/SQL due 
>>>>>>>>>>> to
>>>>>>>>>>> missing features so unrelated to being a flake.
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Jul 16, 2020 at 8:49 AM Gleb Kanterov <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> There is something called test-retry-gradle-plugin [1]. It
>>>>>>>>>>>> retries tests if they fail, and have different modes to handle 
>>>>>>>>>>>> flaky tests.
>>>>>>>>>>>> Did we ever try or consider using it?
>>>>>>>>>>>>
>>>>>>>>>>>> [1]: https://github.com/gradle/test-retry-gradle-plugin
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Jul 16, 2020 at 1:15 PM Gleb Kanterov <[email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I agree with what Ahmet is saying. I can share my perspective,
>>>>>>>>>>>>> recently I had to retrigger build 6 times due to flaky tests, and 
>>>>>>>>>>>>> each
>>>>>>>>>>>>> retrigger took one hour of waiting time.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I've seen examples of automatic tracking of flaky tests, where
>>>>>>>>>>>>> a test is considered flaky if both fails and succeeds for the 
>>>>>>>>>>>>> same git SHA.
>>>>>>>>>>>>> Not sure if there is anything we can enable to get this 
>>>>>>>>>>>>> automatically.
>>>>>>>>>>>>>
>>>>>>>>>>>>> /Gleb
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Jul 16, 2020 at 2:33 AM Ahmet Altay <[email protected]>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I think it will be reasonable to disable/sickbay any flaky
>>>>>>>>>>>>>> test that is actively blocking people. Collective cost of flaky 
>>>>>>>>>>>>>> tests for
>>>>>>>>>>>>>> such a large group of contributors is very significant.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Most of these issues are unassigned. IMO, it makes sense to
>>>>>>>>>>>>>> assign these issues to the most relevant person (who added the 
>>>>>>>>>>>>>> test/who
>>>>>>>>>>>>>> generally maintains those components). Those people can either 
>>>>>>>>>>>>>> fix and
>>>>>>>>>>>>>> re-enable the tests, or remove them if they no longer provide 
>>>>>>>>>>>>>> valuable
>>>>>>>>>>>>>> signals.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Ahmet
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Jul 15, 2020 at 4:55 PM Kenneth Knowles <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The situation is much worse than that IMO. My experience of
>>>>>>>>>>>>>>> the last few days is that a large portion of time went to *just 
>>>>>>>>>>>>>>> connecting
>>>>>>>>>>>>>>> failing runs with the corresponding Jira tickets or filing new 
>>>>>>>>>>>>>>> ones*.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Summarized on PRs:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  -
>>>>>>>>>>>>>>> https://github.com/apache/beam/pull/12272#issuecomment-659050891
>>>>>>>>>>>>>>>  -
>>>>>>>>>>>>>>> https://github.com/apache/beam/pull/12273#issuecomment-659070317
>>>>>>>>>>>>>>>  -
>>>>>>>>>>>>>>> https://github.com/apache/beam/pull/12225#issuecomment-656973073
>>>>>>>>>>>>>>>  -
>>>>>>>>>>>>>>> https://github.com/apache/beam/pull/12225#issuecomment-657743373
>>>>>>>>>>>>>>>  -
>>>>>>>>>>>>>>> https://github.com/apache/beam/pull/12224#issuecomment-657744481
>>>>>>>>>>>>>>>  -
>>>>>>>>>>>>>>> https://github.com/apache/beam/pull/12216#issuecomment-657735289
>>>>>>>>>>>>>>>  -
>>>>>>>>>>>>>>> https://github.com/apache/beam/pull/12216#issuecomment-657780781
>>>>>>>>>>>>>>>  -
>>>>>>>>>>>>>>> https://github.com/apache/beam/pull/12216#issuecomment-657799415
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The tickets:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-10460
>>>>>>>>>>>>>>> SparkPortableExecutionTest
>>>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-10471
>>>>>>>>>>>>>>> CassandraIOTest > testEstimatedSizeBytes
>>>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-10504
>>>>>>>>>>>>>>> ElasticSearchIOTest > testWriteFullAddressing and 
>>>>>>>>>>>>>>> testWriteWithIndexFn
>>>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-10470
>>>>>>>>>>>>>>> JdbcDriverTest
>>>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-8025
>>>>>>>>>>>>>>> CassandraIOTest > @BeforeClass (classmethod)
>>>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-8454
>>>>>>>>>>>>>>> FnHarnessTest
>>>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-10506
>>>>>>>>>>>>>>> SplunkEventWriterTest
>>>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-10472 direct
>>>>>>>>>>>>>>> runner ParDoLifecycleTest
>>>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-9187
>>>>>>>>>>>>>>> DefaultJobBundleFactoryTest
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Here are our P1 test flake bugs:
>>>>>>>>>>>>>>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22)%20AND%20resolution%20%3D%20Unresolved%20AND%20labels%20%3D%20flake%20ORDER%20BY%20priority%20DESC%2C%20updated%20DESC
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> It seems quite a few of them are actively hindering people
>>>>>>>>>>>>>>> right now.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Kenn
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Jul 15, 2020 at 4:23 PM Andrew Pilloud <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> We have two test suites that are responsible for a large
>>>>>>>>>>>>>>>> percentage of our flaky tests and  both have bugs open for 
>>>>>>>>>>>>>>>> about a year
>>>>>>>>>>>>>>>> without being fixed. These suites are ParDoLifecycleTest (
>>>>>>>>>>>>>>>> BEAM-8101 <https://issues.apache.org/jira/browse/BEAM-8101>)
>>>>>>>>>>>>>>>> in Java and BigQueryWriteIntegrationTests in python (py3
>>>>>>>>>>>>>>>> BEAM-9484 <https://issues.apache.org/jira/browse/BEAM-9484>,
>>>>>>>>>>>>>>>> py2 BEAM-9232
>>>>>>>>>>>>>>>> <https://issues.apache.org/jira/browse/BEAM-9232>, old
>>>>>>>>>>>>>>>> duplicate BEAM-8197
>>>>>>>>>>>>>>>> <https://issues.apache.org/jira/browse/BEAM-8197>).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Are there any volunteers to look into these issues? What
>>>>>>>>>>>>>>>> can we do to mitigate the flakiness until someone has time to 
>>>>>>>>>>>>>>>> investigate?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Andrew
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>

Re: Chronically flaky tests

Reply via email to