I think the original discussion[1] on introducing tenacity might answer that question.
[1] https://lists.apache.org/thread.html/16060fd7f4d408857a5e4a2598cc96ebac0f744b65bf4699001350af%40%3Cdev.beam.apache.org%3E On Thu, Jul 16, 2020 at 10:48 AM Rui Wang <ruw...@google.com> wrote: > Is there an observation that enabling tenacity improves the > development experience on Python SDK? E.g. less wait time to get PR pass > and merged? Or it might be a matter of a right number of retry to align > with the "flakiness" of a test? > > > -Rui > > On Thu, Jul 16, 2020 at 10:38 AM Valentyn Tymofieiev <valen...@google.com> > wrote: > >> We used tenacity[1] to retry some unit tests for which we understood the >> nature of flakiness. >> >> [1] >> https://github.com/apache/beam/blob/3b9aae2bcaeb48ab43a77368ae496edc73634c91/sdks/python/apache_beam/runners/portability/fn_api_runner/fn_runner_test.py#L1156 >> >> On Thu, Jul 16, 2020 at 10:25 AM Kenneth Knowles <k...@apache.org> wrote: >> >>> Didn't we use something like that flaky retry plugin for Python tests at >>> some point? Adding retries may be preferable to disabling the test. We need >>> a process to remove the retries ASAP though. As Luke says that is not so >>> easy to make happen. Having a way to make P1 bugs more visible in an >>> ongoing way may help. >>> >>> Kenn >>> >>> On Thu, Jul 16, 2020 at 8:57 AM Luke Cwik <lc...@google.com> wrote: >>> >>>> I don't think I have seen tests that were previously disabled become >>>> re-enabled. >>>> >>>> It seems as though we have about ~60 disabled tests in Java and ~15 in >>>> Python. Half of the Java ones seem to be in ZetaSQL/SQL due to missing >>>> features so unrelated to being a flake. >>>> >>>> On Thu, Jul 16, 2020 at 8:49 AM Gleb Kanterov <g...@spotify.com> wrote: >>>> >>>>> There is something called test-retry-gradle-plugin [1]. It retries >>>>> tests if they fail, and have different modes to handle flaky tests. Did we >>>>> ever try or consider using it? >>>>> >>>>> [1]: https://github.com/gradle/test-retry-gradle-plugin >>>>> >>>>> On Thu, Jul 16, 2020 at 1:15 PM Gleb Kanterov <g...@spotify.com> >>>>> wrote: >>>>> >>>>>> I agree with what Ahmet is saying. I can share my perspective, >>>>>> recently I had to retrigger build 6 times due to flaky tests, and each >>>>>> retrigger took one hour of waiting time. >>>>>> >>>>>> I've seen examples of automatic tracking of flaky tests, where a test >>>>>> is considered flaky if both fails and succeeds for the same git SHA. Not >>>>>> sure if there is anything we can enable to get this automatically. >>>>>> >>>>>> /Gleb >>>>>> >>>>>> On Thu, Jul 16, 2020 at 2:33 AM Ahmet Altay <al...@google.com> wrote: >>>>>> >>>>>>> I think it will be reasonable to disable/sickbay any flaky test that >>>>>>> is actively blocking people. Collective cost of flaky tests for such a >>>>>>> large group of contributors is very significant. >>>>>>> >>>>>>> Most of these issues are unassigned. IMO, it makes sense to assign >>>>>>> these issues to the most relevant person (who added the test/who >>>>>>> generally >>>>>>> maintains those components). Those people can either fix and re-enable >>>>>>> the >>>>>>> tests, or remove them if they no longer provide valuable signals. >>>>>>> >>>>>>> Ahmet >>>>>>> >>>>>>> On Wed, Jul 15, 2020 at 4:55 PM Kenneth Knowles <k...@apache.org> >>>>>>> wrote: >>>>>>> >>>>>>>> The situation is much worse than that IMO. My experience of the >>>>>>>> last few days is that a large portion of time went to *just connecting >>>>>>>> failing runs with the corresponding Jira tickets or filing new ones*. >>>>>>>> >>>>>>>> Summarized on PRs: >>>>>>>> >>>>>>>> - https://github.com/apache/beam/pull/12272#issuecomment-659050891 >>>>>>>> - https://github.com/apache/beam/pull/12273#issuecomment-659070317 >>>>>>>> - https://github.com/apache/beam/pull/12225#issuecomment-656973073 >>>>>>>> - https://github.com/apache/beam/pull/12225#issuecomment-657743373 >>>>>>>> - https://github.com/apache/beam/pull/12224#issuecomment-657744481 >>>>>>>> - https://github.com/apache/beam/pull/12216#issuecomment-657735289 >>>>>>>> - https://github.com/apache/beam/pull/12216#issuecomment-657780781 >>>>>>>> - https://github.com/apache/beam/pull/12216#issuecomment-657799415 >>>>>>>> >>>>>>>> The tickets: >>>>>>>> >>>>>>>> - https://issues.apache.org/jira/browse/BEAM-10460 >>>>>>>> SparkPortableExecutionTest >>>>>>>> - https://issues.apache.org/jira/browse/BEAM-10471 >>>>>>>> CassandraIOTest > testEstimatedSizeBytes >>>>>>>> - https://issues.apache.org/jira/browse/BEAM-10504 >>>>>>>> ElasticSearchIOTest > testWriteFullAddressing and testWriteWithIndexFn >>>>>>>> - https://issues.apache.org/jira/browse/BEAM-10470 JdbcDriverTest >>>>>>>> - https://issues.apache.org/jira/browse/BEAM-8025 CassandraIOTest >>>>>>>> > @BeforeClass (classmethod) >>>>>>>> - https://issues.apache.org/jira/browse/BEAM-8454 FnHarnessTest >>>>>>>> - https://issues.apache.org/jira/browse/BEAM-10506 >>>>>>>> SplunkEventWriterTest >>>>>>>> - https://issues.apache.org/jira/browse/BEAM-10472 direct runner >>>>>>>> ParDoLifecycleTest >>>>>>>> - https://issues.apache.org/jira/browse/BEAM-9187 >>>>>>>> DefaultJobBundleFactoryTest >>>>>>>> >>>>>>>> Here are our P1 test flake bugs: >>>>>>>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22)%20AND%20resolution%20%3D%20Unresolved%20AND%20labels%20%3D%20flake%20ORDER%20BY%20priority%20DESC%2C%20updated%20DESC >>>>>>>> >>>>>>>> It seems quite a few of them are actively hindering people right >>>>>>>> now. >>>>>>>> >>>>>>>> Kenn >>>>>>>> >>>>>>>> On Wed, Jul 15, 2020 at 4:23 PM Andrew Pilloud <apill...@google.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> We have two test suites that are responsible for a large >>>>>>>>> percentage of our flaky tests and both have bugs open for about a >>>>>>>>> year >>>>>>>>> without being fixed. These suites are ParDoLifecycleTest ( >>>>>>>>> BEAM-8101 <https://issues.apache.org/jira/browse/BEAM-8101>) in >>>>>>>>> Java and BigQueryWriteIntegrationTests in python (py3 BEAM-9484 >>>>>>>>> <https://issues.apache.org/jira/browse/BEAM-9484>, py2 BEAM-9232 >>>>>>>>> <https://issues.apache.org/jira/browse/BEAM-9232>, old duplicate >>>>>>>>> BEAM-8197 <https://issues.apache.org/jira/browse/BEAM-8197>). >>>>>>>>> >>>>>>>>> Are there any volunteers to look into these issues? What can we do >>>>>>>>> to mitigate the flakiness until someone has time to investigate? >>>>>>>>> >>>>>>>>> Andrew >>>>>>>>> >>>>>>>>