On Thu, Jul 30, 2020 at 6:24 PM Ahmet Altay <al...@google.com> wrote:
> I like: > *Include ignored or quarantined tests in the release notes* > *Run flaky tests only in postcommit* (related? *Separate flaky tests into > quarantine job*) > The quarantine job would allow them to run in presubmit still, we would just not use it to determine the health of a PR or block submission. > *Require link to Jira to rerun a test* > > I am concerned about: > *Add Gradle or Jenkins plugin to retry flaky tests* - because it is a > convenient place for real bugs to hide. > This concern has come up a few times now so I feel like this is a route we shouldn't pursue further. > > I do not know much about: > *Consider Gradle Enterprise* > https://testautonation.com/analyse-test-results-deflake-flaky-tests/ > There is a subscription fee for Gradle Enterprise but it offers a lot of support for flaky tests and other metrics. I have a meeting to talk with them on August 7th about the pricing model for open source projects. From what I understand, last time we spoke with them they didn't have a good model for open source projects and the fee was tied into the number of developers in the project. > > > Thank you for putting this list! I believe even if we can commit to doing > some of these we would have a much healthier project. If we can build > consensus on implementing, I will be happy to work on some of them. > > On Fri, Jul 24, 2020 at 1:54 PM Kenneth Knowles <k...@apache.org> wrote: > >> Adding >> https://testautonation.com/analyse-test-results-deflake-flaky-tests/ to >> the list which seems a more powerful test history tool. >> >> On Fri, Jul 24, 2020 at 1:51 PM Kenneth Knowles <k...@apache.org> wrote: >> >>> Had some off-list chats to brainstorm and I wanted to bring ideas back >>> to the dev@ list for consideration. A lot can be combined. I would >>> really like to have a section in the release notes. I like the idea of >>> banishing flakes from pre-commit (since you can't tell easily if it was a >>> real failure caused by the PR) and auto-retrying in post-commit (so we can >>> gather data on exactly what is flaking without a lot of manual >>> investigation). >>> >>> *Include ignored or quarantined tests in the release notes* >>> Pro: >>> - Users are aware of what is not being tested so may be silently broken >>> - It forces discussion of ignored tests to be part of our community >>> processes >>> Con: >>> - It may look bad if the list is large (this is actually also a Pro >>> because if it looks bad, it is bad) >>> >>> *Run flaky tests only in postcommit* >>> Pro: >>> - isolates the bad signal so pre-commit is not affected >>> - saves pointless re-runs in pre-commit >>> - keeps a signal in post-commit that we can watch, instead of losing it >>> completely when we disable a test >>> - maybe keeps the flaky tests in job related to what they are testing >>> Con: >>> - we have to really watch post-commit or flakes can turn into failures >>> >>> *Separate flaky tests into quarantine job* >>> Pro: >>> - gain signal for healthy tests, as with disabling or running in >>> post-commit >>> - also saves pointless re-runs >>> Con: >>> - may collect bad tests so that we never look at it so it is the same >>> as disabling the test >>> - lots of unrelated tests grouped into signal instead of focused on >>> health of a particular component >>> >>> *Add Gradle or Jenkins plugin to retry flaky tests* >>> https://blog.gradle.org/gradle-flaky-test-retry-plugin >>> https://plugins.jenkins.io/flaky-test-handler/ >>> Pro: >>> - easier than Jiras with human pasting links; works with moving flakes >>> to post-commit >>> - get a somewhat automated view of flakiness, whether in pre-commit or >>> post-commit >>> - don't get stopped by flakiness >>> Con: >>> - maybe too easy to ignore flakes; we should add all flakes (not just >>> disabled or quarantined) to the release notes >>> - sometimes flakes are actual bugs (like concurrency) so treating this >>> as OK is not desirable >>> - without Jiras, no automated release notes >>> - Jenkins: retry only will work at job level because it needs Maven to >>> retry only failed (I think) >>> - Jenkins: some of our jobs may have duplicate test names (but might >>> already be fixed) >>> >>> *Consider Gradle Enterprise* >>> Pro: >>> - get Gradle scan granularity of flake data (and other stuff) >>> - also gives module-level health which we do not have today >>> Con: >>> - cost and administrative burden unknown >>> - we probably have to do some small work to make our jobs compatible >>> with their history tracking >>> >>> *Require link to Jira to rerun a test* >>> Instead of saying "Run Java PreCommit" you have to link to the bug >>> relating to the failure. >>> Pro: >>> - forces investigation >>> - helps others find out about issues >>> Con: >>> - adds a lot of manual work, or requires automation (which will >>> probably be ad hoc and fragile) >>> >>> Kenn >>> >>> On Mon, Jul 20, 2020 at 11:59 AM Brian Hulette <bhule...@google.com> >>> wrote: >>> >>>> > I think we are missing a way for checking that we are making progress >>>> on P1 issues. For example, P0 issues block releases and this obviously >>>> results in fixing/triaging/addressing P0 issues at least every 6 weeks. We >>>> do not have a similar process for flaky tests. I do not know what would be >>>> a good policy. One suggestion is to ping (email/slack) assignees of issues. >>>> I recently missed a flaky issue that was assigned to me. A ping like that >>>> would have reminded me. And if an assignee cannot help/does not have the >>>> time, we can try to find a new assignee. >>>> >>>> Yeah I think this is something we should address. With the new jira >>>> automation at least assignees should get an email notification after 30 >>>> days because of a jira comment like [1], but that's too long to let a test >>>> continue to flake. Could Beam Jira Bot ping every N days for P1s that >>>> aren't making progress? >>>> >>>> That wouldn't help us with P1s that have no assignee, or are assigned >>>> to overloaded people. It seems we'd need some kind of dashboard or report >>>> to capture those. >>>> >>>> [1] >>>> https://issues.apache.org/jira/browse/BEAM-8101?focusedCommentId=17121918&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17121918 >>>> >>>> On Fri, Jul 17, 2020 at 1:09 PM Ahmet Altay <al...@google.com> wrote: >>>> >>>>> Another idea, could we change our "Retest X" phrases with "Retest X >>>>> (Reason)" phrases? With this change a PR author will have to look at >>>>> failed >>>>> test logs. They could catch new flakiness introduced by their PR, file a >>>>> JIRA for a flakiness that was not noted before, or ping an existing JIRA >>>>> issue/raise its severity. On the downside this will require PR authors to >>>>> do more. >>>>> >>>>> On Fri, Jul 17, 2020 at 6:46 AM Tyson Hamilton <tyso...@google.com> >>>>> wrote: >>>>> >>>>>> Adding retries can be beneficial in two ways, unblocking a PR, and >>>>>> collecting metrics about the flakes. >>>>>> >>>>> >>>>> Makes sense. I think we will still need to have a plan to remove >>>>> retries similar to re-enabling disabled tests. >>>>> >>>>> >>>>>> >>>>>> If we also had a flaky test leaderboard that showed which tests are >>>>>> the most flaky, then we could take action on them. Encouraging someone >>>>>> from >>>>>> the community to fix the flaky test is another issue. >>>>>> >>>>>> The test status matrix of tests that is on the GitHub landing page >>>>>> could show flake level to communicate to users which modules are losing a >>>>>> trustable test signal. Maybe this shows up as a flake % or a code >>>>>> coverage >>>>>> % that decreases due to disabled flaky tests. >>>>>> >>>>> >>>>> +1 to a dashboard that will show a "leaderboard" of flaky tests. >>>>> >>>>> >>>>>> >>>>>> I didn't look for plugins, just dreaming up some options. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Thu, Jul 16, 2020, 5:58 PM Luke Cwik <lc...@google.com> wrote: >>>>>> >>>>>>> What do other Apache projects do to address this issue? >>>>>>> >>>>>>> On Thu, Jul 16, 2020 at 5:51 PM Ahmet Altay <al...@google.com> >>>>>>> wrote: >>>>>>> >>>>>>>> I agree with the comments in this thread. >>>>>>>> - If we are not re-enabling tests back again or we do not have a >>>>>>>> plan to re-enable them again, disabling tests only provides us >>>>>>>> temporary >>>>>>>> relief until eventually users find issues instead of disabled tests. >>>>>>>> - I feel similarly about retries. It is reasonable to add retries >>>>>>>> for reasons we understand. Adding retries to avoid flakes is similar to >>>>>>>> disabling tests. They might hide real issues. >>>>>>>> >>>>>>>> I think we are missing a way for checking that we are making >>>>>>>> progress on P1 issues. For example, P0 issues block releases and this >>>>>>>> obviously results in fixing/triaging/addressing P0 issues at least >>>>>>>> every 6 >>>>>>>> weeks. We do not have a similar process for flaky tests. I do not know >>>>>>>> what >>>>>>>> would be a good policy. One suggestion is to ping (email/slack) >>>>>>>> assignees >>>>>>>> of issues. I recently missed a flaky issue that was assigned to me. A >>>>>>>> ping >>>>>>>> like that would have reminded me. And if an assignee cannot help/does >>>>>>>> not >>>>>>>> have the time, we can try to find a new assignee. >>>>>>>> >>>>>>>> Ahmet >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Jul 16, 2020 at 11:52 AM Valentyn Tymofieiev < >>>>>>>> valen...@google.com> wrote: >>>>>>>> >>>>>>>>> I think the original discussion[1] on introducing tenacity might >>>>>>>>> answer that question. >>>>>>>>> >>>>>>>>> [1] >>>>>>>>> https://lists.apache.org/thread.html/16060fd7f4d408857a5e4a2598cc96ebac0f744b65bf4699001350af%40%3Cdev.beam.apache.org%3E >>>>>>>>> >>>>>>>>> On Thu, Jul 16, 2020 at 10:48 AM Rui Wang <ruw...@google.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Is there an observation that enabling tenacity improves the >>>>>>>>>> development experience on Python SDK? E.g. less wait time to get PR >>>>>>>>>> pass >>>>>>>>>> and merged? Or it might be a matter of a right number of retry to >>>>>>>>>> align >>>>>>>>>> with the "flakiness" of a test? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -Rui >>>>>>>>>> >>>>>>>>>> On Thu, Jul 16, 2020 at 10:38 AM Valentyn Tymofieiev < >>>>>>>>>> valen...@google.com> wrote: >>>>>>>>>> >>>>>>>>>>> We used tenacity[1] to retry some unit tests for which we >>>>>>>>>>> understood the nature of flakiness. >>>>>>>>>>> >>>>>>>>>>> [1] >>>>>>>>>>> https://github.com/apache/beam/blob/3b9aae2bcaeb48ab43a77368ae496edc73634c91/sdks/python/apache_beam/runners/portability/fn_api_runner/fn_runner_test.py#L1156 >>>>>>>>>>> >>>>>>>>>>> On Thu, Jul 16, 2020 at 10:25 AM Kenneth Knowles < >>>>>>>>>>> k...@apache.org> wrote: >>>>>>>>>>> >>>>>>>>>>>> Didn't we use something like that flaky retry plugin for Python >>>>>>>>>>>> tests at some point? Adding retries may be preferable to disabling >>>>>>>>>>>> the >>>>>>>>>>>> test. We need a process to remove the retries ASAP though. As Luke >>>>>>>>>>>> says >>>>>>>>>>>> that is not so easy to make happen. Having a way to make P1 bugs >>>>>>>>>>>> more >>>>>>>>>>>> visible in an ongoing way may help. >>>>>>>>>>>> >>>>>>>>>>>> Kenn >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Jul 16, 2020 at 8:57 AM Luke Cwik <lc...@google.com> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> I don't think I have seen tests that were previously disabled >>>>>>>>>>>>> become re-enabled. >>>>>>>>>>>>> >>>>>>>>>>>>> It seems as though we have about ~60 disabled tests in Java >>>>>>>>>>>>> and ~15 in Python. Half of the Java ones seem to be in >>>>>>>>>>>>> ZetaSQL/SQL due to >>>>>>>>>>>>> missing features so unrelated to being a flake. >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, Jul 16, 2020 at 8:49 AM Gleb Kanterov < >>>>>>>>>>>>> g...@spotify.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> There is something called test-retry-gradle-plugin [1]. It >>>>>>>>>>>>>> retries tests if they fail, and have different modes to handle >>>>>>>>>>>>>> flaky tests. >>>>>>>>>>>>>> Did we ever try or consider using it? >>>>>>>>>>>>>> >>>>>>>>>>>>>> [1]: https://github.com/gradle/test-retry-gradle-plugin >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Thu, Jul 16, 2020 at 1:15 PM Gleb Kanterov < >>>>>>>>>>>>>> g...@spotify.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> I agree with what Ahmet is saying. I can share my >>>>>>>>>>>>>>> perspective, recently I had to retrigger build 6 times due to >>>>>>>>>>>>>>> flaky tests, >>>>>>>>>>>>>>> and each retrigger took one hour of waiting time. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I've seen examples of automatic tracking of flaky tests, >>>>>>>>>>>>>>> where a test is considered flaky if both fails and succeeds for >>>>>>>>>>>>>>> the same >>>>>>>>>>>>>>> git SHA. Not sure if there is anything we can enable to get this >>>>>>>>>>>>>>> automatically. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> /Gleb >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Thu, Jul 16, 2020 at 2:33 AM Ahmet Altay < >>>>>>>>>>>>>>> al...@google.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I think it will be reasonable to disable/sickbay any flaky >>>>>>>>>>>>>>>> test that is actively blocking people. Collective cost of >>>>>>>>>>>>>>>> flaky tests for >>>>>>>>>>>>>>>> such a large group of contributors is very significant. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Most of these issues are unassigned. IMO, it makes sense to >>>>>>>>>>>>>>>> assign these issues to the most relevant person (who added the >>>>>>>>>>>>>>>> test/who >>>>>>>>>>>>>>>> generally maintains those components). Those people can either >>>>>>>>>>>>>>>> fix and >>>>>>>>>>>>>>>> re-enable the tests, or remove them if they no longer provide >>>>>>>>>>>>>>>> valuable >>>>>>>>>>>>>>>> signals. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Ahmet >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Wed, Jul 15, 2020 at 4:55 PM Kenneth Knowles < >>>>>>>>>>>>>>>> k...@apache.org> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> The situation is much worse than that IMO. My experience >>>>>>>>>>>>>>>>> of the last few days is that a large portion of time went to >>>>>>>>>>>>>>>>> *just >>>>>>>>>>>>>>>>> connecting failing runs with the corresponding Jira tickets >>>>>>>>>>>>>>>>> or filing new >>>>>>>>>>>>>>>>> ones*. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Summarized on PRs: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>>> https://github.com/apache/beam/pull/12272#issuecomment-659050891 >>>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>>> https://github.com/apache/beam/pull/12273#issuecomment-659070317 >>>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>>> https://github.com/apache/beam/pull/12225#issuecomment-656973073 >>>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>>> https://github.com/apache/beam/pull/12225#issuecomment-657743373 >>>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>>> https://github.com/apache/beam/pull/12224#issuecomment-657744481 >>>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>>> https://github.com/apache/beam/pull/12216#issuecomment-657735289 >>>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>>> https://github.com/apache/beam/pull/12216#issuecomment-657780781 >>>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>>> https://github.com/apache/beam/pull/12216#issuecomment-657799415 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> The tickets: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> - https://issues.apache.org/jira/browse/BEAM-10460 >>>>>>>>>>>>>>>>> SparkPortableExecutionTest >>>>>>>>>>>>>>>>> - https://issues.apache.org/jira/browse/BEAM-10471 >>>>>>>>>>>>>>>>> CassandraIOTest > testEstimatedSizeBytes >>>>>>>>>>>>>>>>> - https://issues.apache.org/jira/browse/BEAM-10504 >>>>>>>>>>>>>>>>> ElasticSearchIOTest > testWriteFullAddressing and >>>>>>>>>>>>>>>>> testWriteWithIndexFn >>>>>>>>>>>>>>>>> - https://issues.apache.org/jira/browse/BEAM-10470 >>>>>>>>>>>>>>>>> JdbcDriverTest >>>>>>>>>>>>>>>>> - https://issues.apache.org/jira/browse/BEAM-8025 >>>>>>>>>>>>>>>>> CassandraIOTest > @BeforeClass (classmethod) >>>>>>>>>>>>>>>>> - https://issues.apache.org/jira/browse/BEAM-8454 >>>>>>>>>>>>>>>>> FnHarnessTest >>>>>>>>>>>>>>>>> - https://issues.apache.org/jira/browse/BEAM-10506 >>>>>>>>>>>>>>>>> SplunkEventWriterTest >>>>>>>>>>>>>>>>> - https://issues.apache.org/jira/browse/BEAM-10472 >>>>>>>>>>>>>>>>> direct runner ParDoLifecycleTest >>>>>>>>>>>>>>>>> - https://issues.apache.org/jira/browse/BEAM-9187 >>>>>>>>>>>>>>>>> DefaultJobBundleFactoryTest >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Here are our P1 test flake bugs: >>>>>>>>>>>>>>>>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22)%20AND%20resolution%20%3D%20Unresolved%20AND%20labels%20%3D%20flake%20ORDER%20BY%20priority%20DESC%2C%20updated%20DESC >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> It seems quite a few of them are actively hindering people >>>>>>>>>>>>>>>>> right now. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Kenn >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Wed, Jul 15, 2020 at 4:23 PM Andrew Pilloud < >>>>>>>>>>>>>>>>> apill...@google.com> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> We have two test suites that are responsible for a large >>>>>>>>>>>>>>>>>> percentage of our flaky tests and both have bugs open for >>>>>>>>>>>>>>>>>> about a year >>>>>>>>>>>>>>>>>> without being fixed. These suites are ParDoLifecycleTest ( >>>>>>>>>>>>>>>>>> BEAM-8101 >>>>>>>>>>>>>>>>>> <https://issues.apache.org/jira/browse/BEAM-8101>) in >>>>>>>>>>>>>>>>>> Java and BigQueryWriteIntegrationTests in python (py3 >>>>>>>>>>>>>>>>>> BEAM-9484 >>>>>>>>>>>>>>>>>> <https://issues.apache.org/jira/browse/BEAM-9484>, py2 >>>>>>>>>>>>>>>>>> BEAM-9232 >>>>>>>>>>>>>>>>>> <https://issues.apache.org/jira/browse/BEAM-9232>, old >>>>>>>>>>>>>>>>>> duplicate BEAM-8197 >>>>>>>>>>>>>>>>>> <https://issues.apache.org/jira/browse/BEAM-8197>). >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Are there any volunteers to look into these issues? What >>>>>>>>>>>>>>>>>> can we do to mitigate the flakiness until someone has time >>>>>>>>>>>>>>>>>> to investigate? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Andrew >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>