We used tenacity[1] to retry some unit tests for which we understood the nature of flakiness.
[1] https://github.com/apache/beam/blob/3b9aae2bcaeb48ab43a77368ae496edc73634c91/sdks/python/apache_beam/runners/portability/fn_api_runner/fn_runner_test.py#L1156 On Thu, Jul 16, 2020 at 10:25 AM Kenneth Knowles <[email protected]> wrote: > Didn't we use something like that flaky retry plugin for Python tests at > some point? Adding retries may be preferable to disabling the test. We need > a process to remove the retries ASAP though. As Luke says that is not so > easy to make happen. Having a way to make P1 bugs more visible in an > ongoing way may help. > > Kenn > > On Thu, Jul 16, 2020 at 8:57 AM Luke Cwik <[email protected]> wrote: > >> I don't think I have seen tests that were previously disabled become >> re-enabled. >> >> It seems as though we have about ~60 disabled tests in Java and ~15 in >> Python. Half of the Java ones seem to be in ZetaSQL/SQL due to missing >> features so unrelated to being a flake. >> >> On Thu, Jul 16, 2020 at 8:49 AM Gleb Kanterov <[email protected]> wrote: >> >>> There is something called test-retry-gradle-plugin [1]. It retries tests >>> if they fail, and have different modes to handle flaky tests. Did we ever >>> try or consider using it? >>> >>> [1]: https://github.com/gradle/test-retry-gradle-plugin >>> >>> On Thu, Jul 16, 2020 at 1:15 PM Gleb Kanterov <[email protected]> wrote: >>> >>>> I agree with what Ahmet is saying. I can share my perspective, recently >>>> I had to retrigger build 6 times due to flaky tests, and each retrigger >>>> took one hour of waiting time. >>>> >>>> I've seen examples of automatic tracking of flaky tests, where a test >>>> is considered flaky if both fails and succeeds for the same git SHA. Not >>>> sure if there is anything we can enable to get this automatically. >>>> >>>> /Gleb >>>> >>>> On Thu, Jul 16, 2020 at 2:33 AM Ahmet Altay <[email protected]> wrote: >>>> >>>>> I think it will be reasonable to disable/sickbay any flaky test that >>>>> is actively blocking people. Collective cost of flaky tests for such a >>>>> large group of contributors is very significant. >>>>> >>>>> Most of these issues are unassigned. IMO, it makes sense to assign >>>>> these issues to the most relevant person (who added the test/who generally >>>>> maintains those components). Those people can either fix and re-enable the >>>>> tests, or remove them if they no longer provide valuable signals. >>>>> >>>>> Ahmet >>>>> >>>>> On Wed, Jul 15, 2020 at 4:55 PM Kenneth Knowles <[email protected]> >>>>> wrote: >>>>> >>>>>> The situation is much worse than that IMO. My experience of the last >>>>>> few days is that a large portion of time went to *just connecting failing >>>>>> runs with the corresponding Jira tickets or filing new ones*. >>>>>> >>>>>> Summarized on PRs: >>>>>> >>>>>> - https://github.com/apache/beam/pull/12272#issuecomment-659050891 >>>>>> - https://github.com/apache/beam/pull/12273#issuecomment-659070317 >>>>>> - https://github.com/apache/beam/pull/12225#issuecomment-656973073 >>>>>> - https://github.com/apache/beam/pull/12225#issuecomment-657743373 >>>>>> - https://github.com/apache/beam/pull/12224#issuecomment-657744481 >>>>>> - https://github.com/apache/beam/pull/12216#issuecomment-657735289 >>>>>> - https://github.com/apache/beam/pull/12216#issuecomment-657780781 >>>>>> - https://github.com/apache/beam/pull/12216#issuecomment-657799415 >>>>>> >>>>>> The tickets: >>>>>> >>>>>> - https://issues.apache.org/jira/browse/BEAM-10460 >>>>>> SparkPortableExecutionTest >>>>>> - https://issues.apache.org/jira/browse/BEAM-10471 CassandraIOTest >>>>>> > testEstimatedSizeBytes >>>>>> - https://issues.apache.org/jira/browse/BEAM-10504 >>>>>> ElasticSearchIOTest > testWriteFullAddressing and testWriteWithIndexFn >>>>>> - https://issues.apache.org/jira/browse/BEAM-10470 JdbcDriverTest >>>>>> - https://issues.apache.org/jira/browse/BEAM-8025 CassandraIOTest >>>>>> > @BeforeClass (classmethod) >>>>>> - https://issues.apache.org/jira/browse/BEAM-8454 FnHarnessTest >>>>>> - https://issues.apache.org/jira/browse/BEAM-10506 >>>>>> SplunkEventWriterTest >>>>>> - https://issues.apache.org/jira/browse/BEAM-10472 direct runner >>>>>> ParDoLifecycleTest >>>>>> - https://issues.apache.org/jira/browse/BEAM-9187 >>>>>> DefaultJobBundleFactoryTest >>>>>> >>>>>> Here are our P1 test flake bugs: >>>>>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22)%20AND%20resolution%20%3D%20Unresolved%20AND%20labels%20%3D%20flake%20ORDER%20BY%20priority%20DESC%2C%20updated%20DESC >>>>>> >>>>>> It seems quite a few of them are actively hindering people right now. >>>>>> >>>>>> Kenn >>>>>> >>>>>> On Wed, Jul 15, 2020 at 4:23 PM Andrew Pilloud <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> We have two test suites that are responsible for a large percentage >>>>>>> of our flaky tests and both have bugs open for about a year without >>>>>>> being >>>>>>> fixed. These suites are ParDoLifecycleTest (BEAM-8101 >>>>>>> <https://issues.apache.org/jira/browse/BEAM-8101>) in Java >>>>>>> and BigQueryWriteIntegrationTests in python (py3 BEAM-9484 >>>>>>> <https://issues.apache.org/jira/browse/BEAM-9484>, py2 BEAM-9232 >>>>>>> <https://issues.apache.org/jira/browse/BEAM-9232>, old duplicate >>>>>>> BEAM-8197 <https://issues.apache.org/jira/browse/BEAM-8197>). >>>>>>> >>>>>>> Are there any volunteers to look into these issues? What can we do >>>>>>> to mitigate the flakiness until someone has time to investigate? >>>>>>> >>>>>>> Andrew >>>>>>> >>>>>>
