Re: Chronically flaky tests

Brian Hulette Mon, 20 Jul 2020 12:00:02 -0700

> I think we are missing a way for checking that we are making progress on
P1 issues. For example, P0 issues block releases and this obviously results
in fixing/triaging/addressing P0 issues at least every 6 weeks. We do not
have a similar process for flaky tests. I do not know what would be a good
policy. One suggestion is to ping (email/slack) assignees of issues. I
recently missed a flaky issue that was assigned to me. A ping like that
would have reminded me. And if an assignee cannot help/does not have the
time, we can try to find a new assignee.


Yeah I think this is something we should address. With the new jira
automation at least assignees should get an email notification after 30
days because of a jira comment like [1], but that's too long to let a test
continue to flake. Could Beam Jira Bot ping every N days for P1s that
aren't making progress?

That wouldn't help us with P1s that have no assignee, or are assigned to
overloaded people. It seems we'd need some kind of dashboard or report to
capture those.

[1]
https://issues.apache.org/jira/browse/BEAM-8101?focusedCommentId=17121918&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17121918

On Fri, Jul 17, 2020 at 1:09 PM Ahmet Altay <[email protected]> wrote:

> Another idea, could we change our "Retest X" phrases with "Retest X
> (Reason)" phrases? With this change a PR author will have to look at failed
> test logs. They could catch new flakiness introduced by their PR, file a
> JIRA for a flakiness that was not noted before, or ping an existing JIRA
> issue/raise its severity. On the downside this will require PR authors to
> do more.
>
> On Fri, Jul 17, 2020 at 6:46 AM Tyson Hamilton <[email protected]> wrote:
>
>> Adding retries can be beneficial in two ways, unblocking a PR, and
>> collecting metrics about the flakes.
>>
>
> Makes sense. I think we will still need to have a plan to remove retries
> similar to re-enabling disabled tests.
>
>
>>
>> If we also had a flaky test leaderboard that showed which tests are the
>> most flaky, then we could take action on them. Encouraging someone from the
>> community to fix the flaky test is another issue.
>>
>> The test status matrix of tests that is on the GitHub landing page could
>> show flake level to communicate to users which modules are losing a
>> trustable test signal. Maybe this shows up as a flake % or a code coverage
>> % that decreases due to disabled flaky tests.
>>
>
> +1 to a dashboard that will show a "leaderboard" of flaky tests.
>
>
>>
>> I didn't look for plugins, just dreaming up some options.
>>
>>
>>
>>
>> On Thu, Jul 16, 2020, 5:58 PM Luke Cwik <[email protected]> wrote:
>>
>>> What do other Apache projects do to address this issue?
>>>
>>> On Thu, Jul 16, 2020 at 5:51 PM Ahmet Altay <[email protected]> wrote:
>>>
>>>> I agree with the comments in this thread.
>>>> - If we are not re-enabling tests back again or we do not have a plan
>>>> to re-enable them again, disabling tests only provides us temporary relief
>>>> until eventually users find issues instead of disabled tests.
>>>> - I feel similarly about retries. It is reasonable to add retries for
>>>> reasons we understand. Adding retries to avoid flakes is similar to
>>>> disabling tests. They might hide real issues.
>>>>
>>>> I think we are missing a way for checking that we are making progress
>>>> on P1 issues. For example, P0 issues block releases and this obviously
>>>> results in fixing/triaging/addressing P0 issues at least every 6 weeks. We
>>>> do not have a similar process for flaky tests. I do not know what would be
>>>> a good policy. One suggestion is to ping (email/slack) assignees of issues.
>>>> I recently missed a flaky issue that was assigned to me. A ping like that
>>>> would have reminded me. And if an assignee cannot help/does not have the
>>>> time, we can try to find a new assignee.
>>>>
>>>> Ahmet
>>>>
>>>>
>>>> On Thu, Jul 16, 2020 at 11:52 AM Valentyn Tymofieiev <
>>>> [email protected]> wrote:
>>>>
>>>>> I think the original discussion[1] on introducing tenacity might
>>>>> answer that question.
>>>>>
>>>>> [1]
>>>>> https://lists.apache.org/thread.html/16060fd7f4d408857a5e4a2598cc96ebac0f744b65bf4699001350af%40%3Cdev.beam.apache.org%3E
>>>>>
>>>>> On Thu, Jul 16, 2020 at 10:48 AM Rui Wang <[email protected]> wrote:
>>>>>
>>>>>> Is there an observation that enabling tenacity improves the
>>>>>> development experience on Python SDK? E.g. less wait time to get PR pass
>>>>>> and merged? Or it might be a matter of a right number of retry to align
>>>>>> with the "flakiness" of a test?
>>>>>>
>>>>>>
>>>>>> -Rui
>>>>>>
>>>>>> On Thu, Jul 16, 2020 at 10:38 AM Valentyn Tymofieiev <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> We used tenacity[1] to retry some unit tests for which we understood
>>>>>>> the nature of flakiness.
>>>>>>>
>>>>>>> [1]
>>>>>>> https://github.com/apache/beam/blob/3b9aae2bcaeb48ab43a77368ae496edc73634c91/sdks/python/apache_beam/runners/portability/fn_api_runner/fn_runner_test.py#L1156
>>>>>>>
>>>>>>> On Thu, Jul 16, 2020 at 10:25 AM Kenneth Knowles <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Didn't we use something like that flaky retry plugin for Python
>>>>>>>> tests at some point? Adding retries may be preferable to disabling the
>>>>>>>> test. We need a process to remove the retries ASAP though. As Luke says
>>>>>>>> that is not so easy to make happen. Having a way to make P1 bugs more
>>>>>>>> visible in an ongoing way may help.
>>>>>>>>
>>>>>>>> Kenn
>>>>>>>>
>>>>>>>> On Thu, Jul 16, 2020 at 8:57 AM Luke Cwik <[email protected]> wrote:
>>>>>>>>
>>>>>>>>> I don't think I have seen tests that were previously disabled
>>>>>>>>> become re-enabled.
>>>>>>>>>
>>>>>>>>> It seems as though we have about ~60 disabled tests in Java and
>>>>>>>>> ~15 in Python. Half of the Java ones seem to be in ZetaSQL/SQL due to
>>>>>>>>> missing features so unrelated to being a flake.
>>>>>>>>>
>>>>>>>>> On Thu, Jul 16, 2020 at 8:49 AM Gleb Kanterov <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> There is something called test-retry-gradle-plugin [1]. It
>>>>>>>>>> retries tests if they fail, and have different modes to handle flaky 
>>>>>>>>>> tests.
>>>>>>>>>> Did we ever try or consider using it?
>>>>>>>>>>
>>>>>>>>>> [1]: https://github.com/gradle/test-retry-gradle-plugin
>>>>>>>>>>
>>>>>>>>>> On Thu, Jul 16, 2020 at 1:15 PM Gleb Kanterov <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> I agree with what Ahmet is saying. I can share my perspective,
>>>>>>>>>>> recently I had to retrigger build 6 times due to flaky tests, and 
>>>>>>>>>>> each
>>>>>>>>>>> retrigger took one hour of waiting time.
>>>>>>>>>>>
>>>>>>>>>>> I've seen examples of automatic tracking of flaky tests, where a
>>>>>>>>>>> test is considered flaky if both fails and succeeds for the same 
>>>>>>>>>>> git SHA.
>>>>>>>>>>> Not sure if there is anything we can enable to get this 
>>>>>>>>>>> automatically.
>>>>>>>>>>>
>>>>>>>>>>> /Gleb
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Jul 16, 2020 at 2:33 AM Ahmet Altay <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I think it will be reasonable to disable/sickbay any flaky test
>>>>>>>>>>>> that is actively blocking people. Collective cost of flaky tests 
>>>>>>>>>>>> for such a
>>>>>>>>>>>> large group of contributors is very significant.
>>>>>>>>>>>>
>>>>>>>>>>>> Most of these issues are unassigned. IMO, it makes sense to
>>>>>>>>>>>> assign these issues to the most relevant person (who added the 
>>>>>>>>>>>> test/who
>>>>>>>>>>>> generally maintains those components). Those people can either fix 
>>>>>>>>>>>> and
>>>>>>>>>>>> re-enable the tests, or remove them if they no longer provide 
>>>>>>>>>>>> valuable
>>>>>>>>>>>> signals.
>>>>>>>>>>>>
>>>>>>>>>>>> Ahmet
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Jul 15, 2020 at 4:55 PM Kenneth Knowles <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> The situation is much worse than that IMO. My experience of
>>>>>>>>>>>>> the last few days is that a large portion of time went to *just 
>>>>>>>>>>>>> connecting
>>>>>>>>>>>>> failing runs with the corresponding Jira tickets or filing new 
>>>>>>>>>>>>> ones*.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Summarized on PRs:
>>>>>>>>>>>>>
>>>>>>>>>>>>>  -
>>>>>>>>>>>>> https://github.com/apache/beam/pull/12272#issuecomment-659050891
>>>>>>>>>>>>>  -
>>>>>>>>>>>>> https://github.com/apache/beam/pull/12273#issuecomment-659070317
>>>>>>>>>>>>>  -
>>>>>>>>>>>>> https://github.com/apache/beam/pull/12225#issuecomment-656973073
>>>>>>>>>>>>>  -
>>>>>>>>>>>>> https://github.com/apache/beam/pull/12225#issuecomment-657743373
>>>>>>>>>>>>>  -
>>>>>>>>>>>>> https://github.com/apache/beam/pull/12224#issuecomment-657744481
>>>>>>>>>>>>>  -
>>>>>>>>>>>>> https://github.com/apache/beam/pull/12216#issuecomment-657735289
>>>>>>>>>>>>>  -
>>>>>>>>>>>>> https://github.com/apache/beam/pull/12216#issuecomment-657780781
>>>>>>>>>>>>>  -
>>>>>>>>>>>>> https://github.com/apache/beam/pull/12216#issuecomment-657799415
>>>>>>>>>>>>>
>>>>>>>>>>>>> The tickets:
>>>>>>>>>>>>>
>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-10460
>>>>>>>>>>>>> SparkPortableExecutionTest
>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-10471
>>>>>>>>>>>>> CassandraIOTest > testEstimatedSizeBytes
>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-10504
>>>>>>>>>>>>> ElasticSearchIOTest > testWriteFullAddressing and 
>>>>>>>>>>>>> testWriteWithIndexFn
>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-10470
>>>>>>>>>>>>> JdbcDriverTest
>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-8025
>>>>>>>>>>>>> CassandraIOTest > @BeforeClass (classmethod)
>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-8454
>>>>>>>>>>>>> FnHarnessTest
>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-10506
>>>>>>>>>>>>> SplunkEventWriterTest
>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-10472 direct
>>>>>>>>>>>>> runner ParDoLifecycleTest
>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-9187
>>>>>>>>>>>>> DefaultJobBundleFactoryTest
>>>>>>>>>>>>>
>>>>>>>>>>>>> Here are our P1 test flake bugs:
>>>>>>>>>>>>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22)%20AND%20resolution%20%3D%20Unresolved%20AND%20labels%20%3D%20flake%20ORDER%20BY%20priority%20DESC%2C%20updated%20DESC
>>>>>>>>>>>>>
>>>>>>>>>>>>> It seems quite a few of them are actively hindering people
>>>>>>>>>>>>> right now.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Kenn
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Jul 15, 2020 at 4:23 PM Andrew Pilloud <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> We have two test suites that are responsible for a large
>>>>>>>>>>>>>> percentage of our flaky tests and  both have bugs open for about 
>>>>>>>>>>>>>> a year
>>>>>>>>>>>>>> without being fixed. These suites are ParDoLifecycleTest (
>>>>>>>>>>>>>> BEAM-8101 <https://issues.apache.org/jira/browse/BEAM-8101>)
>>>>>>>>>>>>>> in Java and BigQueryWriteIntegrationTests in python (py3
>>>>>>>>>>>>>> BEAM-9484 <https://issues.apache.org/jira/browse/BEAM-9484>,
>>>>>>>>>>>>>> py2 BEAM-9232
>>>>>>>>>>>>>> <https://issues.apache.org/jira/browse/BEAM-9232>, old
>>>>>>>>>>>>>> duplicate BEAM-8197
>>>>>>>>>>>>>> <https://issues.apache.org/jira/browse/BEAM-8197>).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Are there any volunteers to look into these issues? What can
>>>>>>>>>>>>>> we do to mitigate the flakiness until someone has time to 
>>>>>>>>>>>>>> investigate?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Andrew
>>>>>>>>>>>>>>
>>>>>>>>>>>>>

Re: Chronically flaky tests

Reply via email to