Re: Chronically flaky tests

Robert Bradshaw Tue, 04 Aug 2020 10:51:07 -0700

I'm in favor of a quarantine job whose tests are called out
prominently as "possibly broken" in the release notes. As a follow up,
+1 to exploring better tooling to track at a fine grained level
exactly how flaky these test are (and hopefully detect if/when they go
from flakey to just plain broken).


On Tue, Aug 4, 2020 at 7:25 AM Etienne Chauchot <echauc...@apache.org> wrote:
>
> Hi all,
>
> +1 on ping the assigned person.
>
> For the flakes I know of (ESIO and CassandraIO), they are due to the load of 
> the CI server. These IOs are tested using real embedded backends because 
> those backends are complex and we need relevant tests.
>
> Counter measures have been taken (retrial inside the test sensible to load, 
> add ranges of acceptable numbers, call internal backend mechanisms to force 
> refresh in case load prevented the backend to do so ...).

Yes, certain tests with external dependencies should to their own
internal retries. if that is not sufficient, they should probably be
quarantined.

> I recently got pinged my Ahmet (thanks to him!) about a flakiness that I did 
> not see. This seems to me the correct way to go. Systematically retrying 
> tests with a CI mechanism or disabling tests seem to me a risky workaround 
> that just allows to get the problem off our minds.
>
> Etienne
>
> On 20/07/2020 20:58, Brian Hulette wrote:
>
> > I think we are missing a way for checking that we are making progress on P1 
> > issues. For example, P0 issues block releases and this obviously results in 
> > fixing/triaging/addressing P0 issues at least every 6 weeks. We do not have 
> > a similar process for flaky tests. I do not know what would be a good 
> > policy. One suggestion is to ping (email/slack) assignees of issues. I 
> > recently missed a flaky issue that was assigned to me. A ping like that 
> > would have reminded me. And if an assignee cannot help/does not have the 
> > time, we can try to find a new assignee.
>
> Yeah I think this is something we should address. With the new jira 
> automation at least assignees should get an email notification after 30 days 
> because of a jira comment like [1], but that's too long to let a test 
> continue to flake. Could Beam Jira Bot ping every N days for P1s that aren't 
> making progress?
>
> That wouldn't help us with P1s that have no assignee, or are assigned to 
> overloaded people. It seems we'd need some kind of dashboard or report to 
> capture those.
>
> [1] 
> https://issues.apache.org/jira/browse/BEAM-8101?focusedCommentId=17121918&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17121918
>
> On Fri, Jul 17, 2020 at 1:09 PM Ahmet Altay <al...@google.com> wrote:
>>
>> Another idea, could we change our "Retest X" phrases with "Retest X 
>> (Reason)" phrases? With this change a PR author will have to look at failed 
>> test logs. They could catch new flakiness introduced by their PR, file a 
>> JIRA for a flakiness that was not noted before, or ping an existing JIRA 
>> issue/raise its severity. On the downside this will require PR authors to do 
>> more.
>>
>> On Fri, Jul 17, 2020 at 6:46 AM Tyson Hamilton <tyso...@google.com> wrote:
>>>
>>> Adding retries can be beneficial in two ways, unblocking a PR, and 
>>> collecting metrics about the flakes.
>>
>>
>> Makes sense. I think we will still need to have a plan to remove retries 
>> similar to re-enabling disabled tests.
>>
>>>
>>>
>>> If we also had a flaky test leaderboard that showed which tests are the 
>>> most flaky, then we could take action on them. Encouraging someone from the 
>>> community to fix the flaky test is another issue.
>>>
>>> The test status matrix of tests that is on the GitHub landing page could 
>>> show flake level to communicate to users which modules are losing a 
>>> trustable test signal. Maybe this shows up as a flake % or a code coverage 
>>> % that decreases due to disabled flaky tests.
>>
>>
>> +1 to a dashboard that will show a "leaderboard" of flaky tests.
>>
>>>
>>>
>>> I didn't look for plugins, just dreaming up some options.
>>>
>>>
>>>
>>>
>>> On Thu, Jul 16, 2020, 5:58 PM Luke Cwik <lc...@google.com> wrote:
>>>>
>>>> What do other Apache projects do to address this issue?
>>>>
>>>> On Thu, Jul 16, 2020 at 5:51 PM Ahmet Altay <al...@google.com> wrote:
>>>>>
>>>>> I agree with the comments in this thread.
>>>>> - If we are not re-enabling tests back again or we do not have a plan to 
>>>>> re-enable them again, disabling tests only provides us temporary relief 
>>>>> until eventually users find issues instead of disabled tests.
>>>>> - I feel similarly about retries. It is reasonable to add retries for 
>>>>> reasons we understand. Adding retries to avoid flakes is similar to 
>>>>> disabling tests. They might hide real issues.
>>>>>
>>>>> I think we are missing a way for checking that we are making progress on 
>>>>> P1 issues. For example, P0 issues block releases and this obviously 
>>>>> results in fixing/triaging/addressing P0 issues at least every 6 weeks. 
>>>>> We do not have a similar process for flaky tests. I do not know what 
>>>>> would be a good policy. One suggestion is to ping (email/slack) assignees 
>>>>> of issues. I recently missed a flaky issue that was assigned to me. A 
>>>>> ping like that would have reminded me. And if an assignee cannot 
>>>>> help/does not have the time, we can try to find a new assignee.
>>>>>
>>>>> Ahmet
>>>>>
>>>>>
>>>>> On Thu, Jul 16, 2020 at 11:52 AM Valentyn Tymofieiev 
>>>>> <valen...@google.com> wrote:
>>>>>>
>>>>>> I think the original discussion[1] on introducing tenacity might answer 
>>>>>> that question.
>>>>>>
>>>>>> [1] 
>>>>>> https://lists.apache.org/thread.html/16060fd7f4d408857a5e4a2598cc96ebac0f744b65bf4699001350af%40%3Cdev.beam.apache.org%3E
>>>>>>
>>>>>> On Thu, Jul 16, 2020 at 10:48 AM Rui Wang <ruw...@google.com> wrote:
>>>>>>>
>>>>>>> Is there an observation that enabling tenacity improves the development 
>>>>>>> experience on Python SDK? E.g. less wait time to get PR pass and 
>>>>>>> merged? Or it might be a matter of a right number of retry to align 
>>>>>>> with the "flakiness" of a test?
>>>>>>>
>>>>>>>
>>>>>>> -Rui
>>>>>>>
>>>>>>> On Thu, Jul 16, 2020 at 10:38 AM Valentyn Tymofieiev 
>>>>>>> <valen...@google.com> wrote:
>>>>>>>>
>>>>>>>> We used tenacity[1] to retry some unit tests for which we understood 
>>>>>>>> the nature of flakiness.
>>>>>>>>
>>>>>>>> [1] 
>>>>>>>> https://github.com/apache/beam/blob/3b9aae2bcaeb48ab43a77368ae496edc73634c91/sdks/python/apache_beam/runners/portability/fn_api_runner/fn_runner_test.py#L1156
>>>>>>>>
>>>>>>>> On Thu, Jul 16, 2020 at 10:25 AM Kenneth Knowles <k...@apache.org> 
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Didn't we use something like that flaky retry plugin for Python tests 
>>>>>>>>> at some point? Adding retries may be preferable to disabling the 
>>>>>>>>> test. We need a process to remove the retries ASAP though. As Luke 
>>>>>>>>> says that is not so easy to make happen. Having a way to make P1 bugs 
>>>>>>>>> more visible in an ongoing way may help.
>>>>>>>>>
>>>>>>>>> Kenn
>>>>>>>>>
>>>>>>>>> On Thu, Jul 16, 2020 at 8:57 AM Luke Cwik <lc...@google.com> wrote:
>>>>>>>>>>
>>>>>>>>>> I don't think I have seen tests that were previously disabled become 
>>>>>>>>>> re-enabled.
>>>>>>>>>>
>>>>>>>>>> It seems as though we have about ~60 disabled tests in Java and ~15 
>>>>>>>>>> in Python. Half of the Java ones seem to be in ZetaSQL/SQL due to 
>>>>>>>>>> missing features so unrelated to being a flake.
>>>>>>>>>>
>>>>>>>>>> On Thu, Jul 16, 2020 at 8:49 AM Gleb Kanterov <g...@spotify.com> 
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> There is something called test-retry-gradle-plugin [1]. It retries 
>>>>>>>>>>> tests if they fail, and have different modes to handle flaky tests. 
>>>>>>>>>>> Did we ever try or consider using it?
>>>>>>>>>>>
>>>>>>>>>>> [1]: https://github.com/gradle/test-retry-gradle-plugin
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Jul 16, 2020 at 1:15 PM Gleb Kanterov <g...@spotify.com> 
>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> I agree with what Ahmet is saying. I can share my perspective, 
>>>>>>>>>>>> recently I had to retrigger build 6 times due to flaky tests, and 
>>>>>>>>>>>> each retrigger took one hour of waiting time.
>>>>>>>>>>>>
>>>>>>>>>>>> I've seen examples of automatic tracking of flaky tests, where a 
>>>>>>>>>>>> test is considered flaky if both fails and succeeds for the same 
>>>>>>>>>>>> git SHA. Not sure if there is anything we can enable to get this 
>>>>>>>>>>>> automatically.
>>>>>>>>>>>>
>>>>>>>>>>>> /Gleb
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Jul 16, 2020 at 2:33 AM Ahmet Altay <al...@google.com> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think it will be reasonable to disable/sickbay any flaky test 
>>>>>>>>>>>>> that is actively blocking people. Collective cost of flaky tests 
>>>>>>>>>>>>> for such a large group of contributors is very significant.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Most of these issues are unassigned. IMO, it makes sense to 
>>>>>>>>>>>>> assign these issues to the most relevant person (who added the 
>>>>>>>>>>>>> test/who generally maintains those components). Those people can 
>>>>>>>>>>>>> either fix and re-enable the tests, or remove them if they no 
>>>>>>>>>>>>> longer provide valuable signals.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Ahmet
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Jul 15, 2020 at 4:55 PM Kenneth Knowles <k...@apache.org> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The situation is much worse than that IMO. My experience of the 
>>>>>>>>>>>>>> last few days is that a large portion of time went to *just 
>>>>>>>>>>>>>> connecting failing runs with the corresponding Jira tickets or 
>>>>>>>>>>>>>> filing new ones*.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Summarized on PRs:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  - 
>>>>>>>>>>>>>> https://github.com/apache/beam/pull/12272#issuecomment-659050891
>>>>>>>>>>>>>>  - 
>>>>>>>>>>>>>> https://github.com/apache/beam/pull/12273#issuecomment-659070317
>>>>>>>>>>>>>>  - 
>>>>>>>>>>>>>> https://github.com/apache/beam/pull/12225#issuecomment-656973073
>>>>>>>>>>>>>>  - 
>>>>>>>>>>>>>> https://github.com/apache/beam/pull/12225#issuecomment-657743373
>>>>>>>>>>>>>>  - 
>>>>>>>>>>>>>> https://github.com/apache/beam/pull/12224#issuecomment-657744481
>>>>>>>>>>>>>>  - 
>>>>>>>>>>>>>> https://github.com/apache/beam/pull/12216#issuecomment-657735289
>>>>>>>>>>>>>>  - 
>>>>>>>>>>>>>> https://github.com/apache/beam/pull/12216#issuecomment-657780781
>>>>>>>>>>>>>>  - 
>>>>>>>>>>>>>> https://github.com/apache/beam/pull/12216#issuecomment-657799415
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The tickets:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-10460 
>>>>>>>>>>>>>> SparkPortableExecutionTest
>>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-10471 
>>>>>>>>>>>>>> CassandraIOTest > testEstimatedSizeBytes
>>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-10504 
>>>>>>>>>>>>>> ElasticSearchIOTest > testWriteFullAddressing and 
>>>>>>>>>>>>>> testWriteWithIndexFn
>>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-10470 
>>>>>>>>>>>>>> JdbcDriverTest
>>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-8025 
>>>>>>>>>>>>>> CassandraIOTest > @BeforeClass (classmethod)
>>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-8454 FnHarnessTest
>>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-10506 
>>>>>>>>>>>>>> SplunkEventWriterTest
>>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-10472 direct 
>>>>>>>>>>>>>> runner ParDoLifecycleTest
>>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-9187 
>>>>>>>>>>>>>> DefaultJobBundleFactoryTest
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Here are our P1 test flake bugs: 
>>>>>>>>>>>>>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22)%20AND%20resolution%20%3D%20Unresolved%20AND%20labels%20%3D%20flake%20ORDER%20BY%20priority%20DESC%2C%20updated%20DESC
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> It seems quite a few of them are actively hindering people right 
>>>>>>>>>>>>>> now.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Kenn
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Jul 15, 2020 at 4:23 PM Andrew Pilloud 
>>>>>>>>>>>>>> <apill...@google.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> We have two test suites that are responsible for a large 
>>>>>>>>>>>>>>> percentage of our flaky tests and  both have bugs open for 
>>>>>>>>>>>>>>> about a year without being fixed. These suites are 
>>>>>>>>>>>>>>> ParDoLifecycleTest (BEAM-8101) in Java and 
>>>>>>>>>>>>>>> BigQueryWriteIntegrationTests in python (py3 BEAM-9484, py2 
>>>>>>>>>>>>>>> BEAM-9232, old duplicate BEAM-8197).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Are there any volunteers to look into these issues? What can we 
>>>>>>>>>>>>>>> do to mitigate the flakiness until someone has time to 
>>>>>>>>>>>>>>> investigate?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Andrew

Re: Chronically flaky tests

Reply via email to