Had some off-list chats to brainstorm and I wanted to bring ideas back to the dev@ list for consideration. A lot can be combined. I would really like to have a section in the release notes. I like the idea of banishing flakes from pre-commit (since you can't tell easily if it was a real failure caused by the PR) and auto-retrying in post-commit (so we can gather data on exactly what is flaking without a lot of manual investigation).
*Include ignored or quarantined tests in the release notes* Pro: - Users are aware of what is not being tested so may be silently broken - It forces discussion of ignored tests to be part of our community processes Con: - It may look bad if the list is large (this is actually also a Pro because if it looks bad, it is bad) *Run flaky tests only in postcommit* Pro: - isolates the bad signal so pre-commit is not affected - saves pointless re-runs in pre-commit - keeps a signal in post-commit that we can watch, instead of losing it completely when we disable a test - maybe keeps the flaky tests in job related to what they are testing Con: - we have to really watch post-commit or flakes can turn into failures *Separate flaky tests into quarantine job* Pro: - gain signal for healthy tests, as with disabling or running in post-commit - also saves pointless re-runs Con: - may collect bad tests so that we never look at it so it is the same as disabling the test - lots of unrelated tests grouped into signal instead of focused on health of a particular component *Add Gradle or Jenkins plugin to retry flaky tests* https://blog.gradle.org/gradle-flaky-test-retry-plugin https://plugins.jenkins.io/flaky-test-handler/ Pro: - easier than Jiras with human pasting links; works with moving flakes to post-commit - get a somewhat automated view of flakiness, whether in pre-commit or post-commit - don't get stopped by flakiness Con: - maybe too easy to ignore flakes; we should add all flakes (not just disabled or quarantined) to the release notes - sometimes flakes are actual bugs (like concurrency) so treating this as OK is not desirable - without Jiras, no automated release notes - Jenkins: retry only will work at job level because it needs Maven to retry only failed (I think) - Jenkins: some of our jobs may have duplicate test names (but might already be fixed) *Consider Gradle Enterprise* Pro: - get Gradle scan granularity of flake data (and other stuff) - also gives module-level health which we do not have today Con: - cost and administrative burden unknown - we probably have to do some small work to make our jobs compatible with their history tracking *Require link to Jira to rerun a test* Instead of saying "Run Java PreCommit" you have to link to the bug relating to the failure. Pro: - forces investigation - helps others find out about issues Con: - adds a lot of manual work, or requires automation (which will probably be ad hoc and fragile) Kenn On Mon, Jul 20, 2020 at 11:59 AM Brian Hulette <bhule...@google.com> wrote: > > I think we are missing a way for checking that we are making progress on > P1 issues. For example, P0 issues block releases and this obviously results > in fixing/triaging/addressing P0 issues at least every 6 weeks. We do not > have a similar process for flaky tests. I do not know what would be a good > policy. One suggestion is to ping (email/slack) assignees of issues. I > recently missed a flaky issue that was assigned to me. A ping like that > would have reminded me. And if an assignee cannot help/does not have the > time, we can try to find a new assignee. > > Yeah I think this is something we should address. With the new jira > automation at least assignees should get an email notification after 30 > days because of a jira comment like [1], but that's too long to let a test > continue to flake. Could Beam Jira Bot ping every N days for P1s that > aren't making progress? > > That wouldn't help us with P1s that have no assignee, or are assigned to > overloaded people. It seems we'd need some kind of dashboard or report to > capture those. > > [1] > https://issues.apache.org/jira/browse/BEAM-8101?focusedCommentId=17121918&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17121918 > > On Fri, Jul 17, 2020 at 1:09 PM Ahmet Altay <al...@google.com> wrote: > >> Another idea, could we change our "Retest X" phrases with "Retest X >> (Reason)" phrases? With this change a PR author will have to look at failed >> test logs. They could catch new flakiness introduced by their PR, file a >> JIRA for a flakiness that was not noted before, or ping an existing JIRA >> issue/raise its severity. On the downside this will require PR authors to >> do more. >> >> On Fri, Jul 17, 2020 at 6:46 AM Tyson Hamilton <tyso...@google.com> >> wrote: >> >>> Adding retries can be beneficial in two ways, unblocking a PR, and >>> collecting metrics about the flakes. >>> >> >> Makes sense. I think we will still need to have a plan to remove retries >> similar to re-enabling disabled tests. >> >> >>> >>> If we also had a flaky test leaderboard that showed which tests are the >>> most flaky, then we could take action on them. Encouraging someone from the >>> community to fix the flaky test is another issue. >>> >>> The test status matrix of tests that is on the GitHub landing page could >>> show flake level to communicate to users which modules are losing a >>> trustable test signal. Maybe this shows up as a flake % or a code coverage >>> % that decreases due to disabled flaky tests. >>> >> >> +1 to a dashboard that will show a "leaderboard" of flaky tests. >> >> >>> >>> I didn't look for plugins, just dreaming up some options. >>> >>> >>> >>> >>> On Thu, Jul 16, 2020, 5:58 PM Luke Cwik <lc...@google.com> wrote: >>> >>>> What do other Apache projects do to address this issue? >>>> >>>> On Thu, Jul 16, 2020 at 5:51 PM Ahmet Altay <al...@google.com> wrote: >>>> >>>>> I agree with the comments in this thread. >>>>> - If we are not re-enabling tests back again or we do not have a plan >>>>> to re-enable them again, disabling tests only provides us temporary relief >>>>> until eventually users find issues instead of disabled tests. >>>>> - I feel similarly about retries. It is reasonable to add retries for >>>>> reasons we understand. Adding retries to avoid flakes is similar to >>>>> disabling tests. They might hide real issues. >>>>> >>>>> I think we are missing a way for checking that we are making progress >>>>> on P1 issues. For example, P0 issues block releases and this obviously >>>>> results in fixing/triaging/addressing P0 issues at least every 6 weeks. We >>>>> do not have a similar process for flaky tests. I do not know what would be >>>>> a good policy. One suggestion is to ping (email/slack) assignees of >>>>> issues. >>>>> I recently missed a flaky issue that was assigned to me. A ping like that >>>>> would have reminded me. And if an assignee cannot help/does not have the >>>>> time, we can try to find a new assignee. >>>>> >>>>> Ahmet >>>>> >>>>> >>>>> On Thu, Jul 16, 2020 at 11:52 AM Valentyn Tymofieiev < >>>>> valen...@google.com> wrote: >>>>> >>>>>> I think the original discussion[1] on introducing tenacity might >>>>>> answer that question. >>>>>> >>>>>> [1] >>>>>> https://lists.apache.org/thread.html/16060fd7f4d408857a5e4a2598cc96ebac0f744b65bf4699001350af%40%3Cdev.beam.apache.org%3E >>>>>> >>>>>> On Thu, Jul 16, 2020 at 10:48 AM Rui Wang <ruw...@google.com> wrote: >>>>>> >>>>>>> Is there an observation that enabling tenacity improves the >>>>>>> development experience on Python SDK? E.g. less wait time to get PR pass >>>>>>> and merged? Or it might be a matter of a right number of retry to align >>>>>>> with the "flakiness" of a test? >>>>>>> >>>>>>> >>>>>>> -Rui >>>>>>> >>>>>>> On Thu, Jul 16, 2020 at 10:38 AM Valentyn Tymofieiev < >>>>>>> valen...@google.com> wrote: >>>>>>> >>>>>>>> We used tenacity[1] to retry some unit tests for which we >>>>>>>> understood the nature of flakiness. >>>>>>>> >>>>>>>> [1] >>>>>>>> https://github.com/apache/beam/blob/3b9aae2bcaeb48ab43a77368ae496edc73634c91/sdks/python/apache_beam/runners/portability/fn_api_runner/fn_runner_test.py#L1156 >>>>>>>> >>>>>>>> On Thu, Jul 16, 2020 at 10:25 AM Kenneth Knowles <k...@apache.org> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Didn't we use something like that flaky retry plugin for Python >>>>>>>>> tests at some point? Adding retries may be preferable to disabling the >>>>>>>>> test. We need a process to remove the retries ASAP though. As Luke >>>>>>>>> says >>>>>>>>> that is not so easy to make happen. Having a way to make P1 bugs more >>>>>>>>> visible in an ongoing way may help. >>>>>>>>> >>>>>>>>> Kenn >>>>>>>>> >>>>>>>>> On Thu, Jul 16, 2020 at 8:57 AM Luke Cwik <lc...@google.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> I don't think I have seen tests that were previously disabled >>>>>>>>>> become re-enabled. >>>>>>>>>> >>>>>>>>>> It seems as though we have about ~60 disabled tests in Java and >>>>>>>>>> ~15 in Python. Half of the Java ones seem to be in ZetaSQL/SQL due to >>>>>>>>>> missing features so unrelated to being a flake. >>>>>>>>>> >>>>>>>>>> On Thu, Jul 16, 2020 at 8:49 AM Gleb Kanterov <g...@spotify.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> There is something called test-retry-gradle-plugin [1]. It >>>>>>>>>>> retries tests if they fail, and have different modes to handle >>>>>>>>>>> flaky tests. >>>>>>>>>>> Did we ever try or consider using it? >>>>>>>>>>> >>>>>>>>>>> [1]: https://github.com/gradle/test-retry-gradle-plugin >>>>>>>>>>> >>>>>>>>>>> On Thu, Jul 16, 2020 at 1:15 PM Gleb Kanterov <g...@spotify.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> I agree with what Ahmet is saying. I can share my perspective, >>>>>>>>>>>> recently I had to retrigger build 6 times due to flaky tests, and >>>>>>>>>>>> each >>>>>>>>>>>> retrigger took one hour of waiting time. >>>>>>>>>>>> >>>>>>>>>>>> I've seen examples of automatic tracking of flaky tests, where >>>>>>>>>>>> a test is considered flaky if both fails and succeeds for the same >>>>>>>>>>>> git SHA. >>>>>>>>>>>> Not sure if there is anything we can enable to get this >>>>>>>>>>>> automatically. >>>>>>>>>>>> >>>>>>>>>>>> /Gleb >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Jul 16, 2020 at 2:33 AM Ahmet Altay <al...@google.com> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> I think it will be reasonable to disable/sickbay any flaky >>>>>>>>>>>>> test that is actively blocking people. Collective cost of flaky >>>>>>>>>>>>> tests for >>>>>>>>>>>>> such a large group of contributors is very significant. >>>>>>>>>>>>> >>>>>>>>>>>>> Most of these issues are unassigned. IMO, it makes sense to >>>>>>>>>>>>> assign these issues to the most relevant person (who added the >>>>>>>>>>>>> test/who >>>>>>>>>>>>> generally maintains those components). Those people can either >>>>>>>>>>>>> fix and >>>>>>>>>>>>> re-enable the tests, or remove them if they no longer provide >>>>>>>>>>>>> valuable >>>>>>>>>>>>> signals. >>>>>>>>>>>>> >>>>>>>>>>>>> Ahmet >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, Jul 15, 2020 at 4:55 PM Kenneth Knowles < >>>>>>>>>>>>> k...@apache.org> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> The situation is much worse than that IMO. My experience of >>>>>>>>>>>>>> the last few days is that a large portion of time went to *just >>>>>>>>>>>>>> connecting >>>>>>>>>>>>>> failing runs with the corresponding Jira tickets or filing new >>>>>>>>>>>>>> ones*. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Summarized on PRs: >>>>>>>>>>>>>> >>>>>>>>>>>>>> - >>>>>>>>>>>>>> https://github.com/apache/beam/pull/12272#issuecomment-659050891 >>>>>>>>>>>>>> - >>>>>>>>>>>>>> https://github.com/apache/beam/pull/12273#issuecomment-659070317 >>>>>>>>>>>>>> - >>>>>>>>>>>>>> https://github.com/apache/beam/pull/12225#issuecomment-656973073 >>>>>>>>>>>>>> - >>>>>>>>>>>>>> https://github.com/apache/beam/pull/12225#issuecomment-657743373 >>>>>>>>>>>>>> - >>>>>>>>>>>>>> https://github.com/apache/beam/pull/12224#issuecomment-657744481 >>>>>>>>>>>>>> - >>>>>>>>>>>>>> https://github.com/apache/beam/pull/12216#issuecomment-657735289 >>>>>>>>>>>>>> - >>>>>>>>>>>>>> https://github.com/apache/beam/pull/12216#issuecomment-657780781 >>>>>>>>>>>>>> - >>>>>>>>>>>>>> https://github.com/apache/beam/pull/12216#issuecomment-657799415 >>>>>>>>>>>>>> >>>>>>>>>>>>>> The tickets: >>>>>>>>>>>>>> >>>>>>>>>>>>>> - https://issues.apache.org/jira/browse/BEAM-10460 >>>>>>>>>>>>>> SparkPortableExecutionTest >>>>>>>>>>>>>> - https://issues.apache.org/jira/browse/BEAM-10471 >>>>>>>>>>>>>> CassandraIOTest > testEstimatedSizeBytes >>>>>>>>>>>>>> - https://issues.apache.org/jira/browse/BEAM-10504 >>>>>>>>>>>>>> ElasticSearchIOTest > testWriteFullAddressing and >>>>>>>>>>>>>> testWriteWithIndexFn >>>>>>>>>>>>>> - https://issues.apache.org/jira/browse/BEAM-10470 >>>>>>>>>>>>>> JdbcDriverTest >>>>>>>>>>>>>> - https://issues.apache.org/jira/browse/BEAM-8025 >>>>>>>>>>>>>> CassandraIOTest > @BeforeClass (classmethod) >>>>>>>>>>>>>> - https://issues.apache.org/jira/browse/BEAM-8454 >>>>>>>>>>>>>> FnHarnessTest >>>>>>>>>>>>>> - https://issues.apache.org/jira/browse/BEAM-10506 >>>>>>>>>>>>>> SplunkEventWriterTest >>>>>>>>>>>>>> - https://issues.apache.org/jira/browse/BEAM-10472 direct >>>>>>>>>>>>>> runner ParDoLifecycleTest >>>>>>>>>>>>>> - https://issues.apache.org/jira/browse/BEAM-9187 >>>>>>>>>>>>>> DefaultJobBundleFactoryTest >>>>>>>>>>>>>> >>>>>>>>>>>>>> Here are our P1 test flake bugs: >>>>>>>>>>>>>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22)%20AND%20resolution%20%3D%20Unresolved%20AND%20labels%20%3D%20flake%20ORDER%20BY%20priority%20DESC%2C%20updated%20DESC >>>>>>>>>>>>>> >>>>>>>>>>>>>> It seems quite a few of them are actively hindering people >>>>>>>>>>>>>> right now. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Kenn >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Wed, Jul 15, 2020 at 4:23 PM Andrew Pilloud < >>>>>>>>>>>>>> apill...@google.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> We have two test suites that are responsible for a large >>>>>>>>>>>>>>> percentage of our flaky tests and both have bugs open for >>>>>>>>>>>>>>> about a year >>>>>>>>>>>>>>> without being fixed. These suites are ParDoLifecycleTest ( >>>>>>>>>>>>>>> BEAM-8101 <https://issues.apache.org/jira/browse/BEAM-8101>) >>>>>>>>>>>>>>> in Java and BigQueryWriteIntegrationTests in python (py3 >>>>>>>>>>>>>>> BEAM-9484 <https://issues.apache.org/jira/browse/BEAM-9484>, >>>>>>>>>>>>>>> py2 BEAM-9232 >>>>>>>>>>>>>>> <https://issues.apache.org/jira/browse/BEAM-9232>, old >>>>>>>>>>>>>>> duplicate BEAM-8197 >>>>>>>>>>>>>>> <https://issues.apache.org/jira/browse/BEAM-8197>). >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Are there any volunteers to look into these issues? What can >>>>>>>>>>>>>>> we do to mitigate the flakiness until someone has time to >>>>>>>>>>>>>>> investigate? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Andrew >>>>>>>>>>>>>>> >>>>>>>>>>>>>>