What do other Apache projects do to address this issue? On Thu, Jul 16, 2020 at 5:51 PM Ahmet Altay <al...@google.com> wrote:
> I agree with the comments in this thread. > - If we are not re-enabling tests back again or we do not have a plan to > re-enable them again, disabling tests only provides us temporary relief > until eventually users find issues instead of disabled tests. > - I feel similarly about retries. It is reasonable to add retries for > reasons we understand. Adding retries to avoid flakes is similar to > disabling tests. They might hide real issues. > > I think we are missing a way for checking that we are making progress on > P1 issues. For example, P0 issues block releases and this obviously results > in fixing/triaging/addressing P0 issues at least every 6 weeks. We do not > have a similar process for flaky tests. I do not know what would be a good > policy. One suggestion is to ping (email/slack) assignees of issues. I > recently missed a flaky issue that was assigned to me. A ping like that > would have reminded me. And if an assignee cannot help/does not have the > time, we can try to find a new assignee. > > Ahmet > > > On Thu, Jul 16, 2020 at 11:52 AM Valentyn Tymofieiev <valen...@google.com> > wrote: > >> I think the original discussion[1] on introducing tenacity might answer >> that question. >> >> [1] >> https://lists.apache.org/thread.html/16060fd7f4d408857a5e4a2598cc96ebac0f744b65bf4699001350af%40%3Cdev.beam.apache.org%3E >> >> On Thu, Jul 16, 2020 at 10:48 AM Rui Wang <ruw...@google.com> wrote: >> >>> Is there an observation that enabling tenacity improves the >>> development experience on Python SDK? E.g. less wait time to get PR pass >>> and merged? Or it might be a matter of a right number of retry to align >>> with the "flakiness" of a test? >>> >>> >>> -Rui >>> >>> On Thu, Jul 16, 2020 at 10:38 AM Valentyn Tymofieiev < >>> valen...@google.com> wrote: >>> >>>> We used tenacity[1] to retry some unit tests for which we understood >>>> the nature of flakiness. >>>> >>>> [1] >>>> https://github.com/apache/beam/blob/3b9aae2bcaeb48ab43a77368ae496edc73634c91/sdks/python/apache_beam/runners/portability/fn_api_runner/fn_runner_test.py#L1156 >>>> >>>> On Thu, Jul 16, 2020 at 10:25 AM Kenneth Knowles <k...@apache.org> >>>> wrote: >>>> >>>>> Didn't we use something like that flaky retry plugin for Python tests >>>>> at some point? Adding retries may be preferable to disabling the test. We >>>>> need a process to remove the retries ASAP though. As Luke says that is not >>>>> so easy to make happen. Having a way to make P1 bugs more visible in an >>>>> ongoing way may help. >>>>> >>>>> Kenn >>>>> >>>>> On Thu, Jul 16, 2020 at 8:57 AM Luke Cwik <lc...@google.com> wrote: >>>>> >>>>>> I don't think I have seen tests that were previously disabled become >>>>>> re-enabled. >>>>>> >>>>>> It seems as though we have about ~60 disabled tests in Java and ~15 >>>>>> in Python. Half of the Java ones seem to be in ZetaSQL/SQL due to missing >>>>>> features so unrelated to being a flake. >>>>>> >>>>>> On Thu, Jul 16, 2020 at 8:49 AM Gleb Kanterov <g...@spotify.com> >>>>>> wrote: >>>>>> >>>>>>> There is something called test-retry-gradle-plugin [1]. It retries >>>>>>> tests if they fail, and have different modes to handle flaky tests. Did >>>>>>> we >>>>>>> ever try or consider using it? >>>>>>> >>>>>>> [1]: https://github.com/gradle/test-retry-gradle-plugin >>>>>>> >>>>>>> On Thu, Jul 16, 2020 at 1:15 PM Gleb Kanterov <g...@spotify.com> >>>>>>> wrote: >>>>>>> >>>>>>>> I agree with what Ahmet is saying. I can share my perspective, >>>>>>>> recently I had to retrigger build 6 times due to flaky tests, and each >>>>>>>> retrigger took one hour of waiting time. >>>>>>>> >>>>>>>> I've seen examples of automatic tracking of flaky tests, where a >>>>>>>> test is considered flaky if both fails and succeeds for the same git >>>>>>>> SHA. >>>>>>>> Not sure if there is anything we can enable to get this automatically. >>>>>>>> >>>>>>>> /Gleb >>>>>>>> >>>>>>>> On Thu, Jul 16, 2020 at 2:33 AM Ahmet Altay <al...@google.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> I think it will be reasonable to disable/sickbay any flaky test >>>>>>>>> that is actively blocking people. Collective cost of flaky tests for >>>>>>>>> such a >>>>>>>>> large group of contributors is very significant. >>>>>>>>> >>>>>>>>> Most of these issues are unassigned. IMO, it makes sense to assign >>>>>>>>> these issues to the most relevant person (who added the test/who >>>>>>>>> generally >>>>>>>>> maintains those components). Those people can either fix and >>>>>>>>> re-enable the >>>>>>>>> tests, or remove them if they no longer provide valuable signals. >>>>>>>>> >>>>>>>>> Ahmet >>>>>>>>> >>>>>>>>> On Wed, Jul 15, 2020 at 4:55 PM Kenneth Knowles <k...@apache.org> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> The situation is much worse than that IMO. My experience of the >>>>>>>>>> last few days is that a large portion of time went to *just >>>>>>>>>> connecting >>>>>>>>>> failing runs with the corresponding Jira tickets or filing new ones*. >>>>>>>>>> >>>>>>>>>> Summarized on PRs: >>>>>>>>>> >>>>>>>>>> - >>>>>>>>>> https://github.com/apache/beam/pull/12272#issuecomment-659050891 >>>>>>>>>> - >>>>>>>>>> https://github.com/apache/beam/pull/12273#issuecomment-659070317 >>>>>>>>>> - >>>>>>>>>> https://github.com/apache/beam/pull/12225#issuecomment-656973073 >>>>>>>>>> - >>>>>>>>>> https://github.com/apache/beam/pull/12225#issuecomment-657743373 >>>>>>>>>> - >>>>>>>>>> https://github.com/apache/beam/pull/12224#issuecomment-657744481 >>>>>>>>>> - >>>>>>>>>> https://github.com/apache/beam/pull/12216#issuecomment-657735289 >>>>>>>>>> - >>>>>>>>>> https://github.com/apache/beam/pull/12216#issuecomment-657780781 >>>>>>>>>> - >>>>>>>>>> https://github.com/apache/beam/pull/12216#issuecomment-657799415 >>>>>>>>>> >>>>>>>>>> The tickets: >>>>>>>>>> >>>>>>>>>> - https://issues.apache.org/jira/browse/BEAM-10460 >>>>>>>>>> SparkPortableExecutionTest >>>>>>>>>> - https://issues.apache.org/jira/browse/BEAM-10471 >>>>>>>>>> CassandraIOTest > testEstimatedSizeBytes >>>>>>>>>> - https://issues.apache.org/jira/browse/BEAM-10504 >>>>>>>>>> ElasticSearchIOTest > testWriteFullAddressing and >>>>>>>>>> testWriteWithIndexFn >>>>>>>>>> - https://issues.apache.org/jira/browse/BEAM-10470 >>>>>>>>>> JdbcDriverTest >>>>>>>>>> - https://issues.apache.org/jira/browse/BEAM-8025 >>>>>>>>>> CassandraIOTest > @BeforeClass (classmethod) >>>>>>>>>> - https://issues.apache.org/jira/browse/BEAM-8454 FnHarnessTest >>>>>>>>>> - https://issues.apache.org/jira/browse/BEAM-10506 >>>>>>>>>> SplunkEventWriterTest >>>>>>>>>> - https://issues.apache.org/jira/browse/BEAM-10472 direct >>>>>>>>>> runner ParDoLifecycleTest >>>>>>>>>> - https://issues.apache.org/jira/browse/BEAM-9187 >>>>>>>>>> DefaultJobBundleFactoryTest >>>>>>>>>> >>>>>>>>>> Here are our P1 test flake bugs: >>>>>>>>>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22)%20AND%20resolution%20%3D%20Unresolved%20AND%20labels%20%3D%20flake%20ORDER%20BY%20priority%20DESC%2C%20updated%20DESC >>>>>>>>>> >>>>>>>>>> It seems quite a few of them are actively hindering people right >>>>>>>>>> now. >>>>>>>>>> >>>>>>>>>> Kenn >>>>>>>>>> >>>>>>>>>> On Wed, Jul 15, 2020 at 4:23 PM Andrew Pilloud < >>>>>>>>>> apill...@google.com> wrote: >>>>>>>>>> >>>>>>>>>>> We have two test suites that are responsible for a large >>>>>>>>>>> percentage of our flaky tests and both have bugs open for about a >>>>>>>>>>> year >>>>>>>>>>> without being fixed. These suites are ParDoLifecycleTest ( >>>>>>>>>>> BEAM-8101 <https://issues.apache.org/jira/browse/BEAM-8101>) in >>>>>>>>>>> Java and BigQueryWriteIntegrationTests in python (py3 BEAM-9484 >>>>>>>>>>> <https://issues.apache.org/jira/browse/BEAM-9484>, py2 BEAM-9232 >>>>>>>>>>> <https://issues.apache.org/jira/browse/BEAM-9232>, old >>>>>>>>>>> duplicate BEAM-8197 >>>>>>>>>>> <https://issues.apache.org/jira/browse/BEAM-8197>). >>>>>>>>>>> >>>>>>>>>>> Are there any volunteers to look into these issues? What can we >>>>>>>>>>> do to mitigate the flakiness until someone has time to investigate? >>>>>>>>>>> >>>>>>>>>>> Andrew >>>>>>>>>>> >>>>>>>>>>