I agree with what Ahmet is saying. I can share my perspective, recently I had to retrigger build 6 times due to flaky tests, and each retrigger took one hour of waiting time.
I've seen examples of automatic tracking of flaky tests, where a test is considered flaky if both fails and succeeds for the same git SHA. Not sure if there is anything we can enable to get this automatically. /Gleb On Thu, Jul 16, 2020 at 2:33 AM Ahmet Altay <al...@google.com> wrote: > I think it will be reasonable to disable/sickbay any flaky test that is > actively blocking people. Collective cost of flaky tests for such a large > group of contributors is very significant. > > Most of these issues are unassigned. IMO, it makes sense to assign these > issues to the most relevant person (who added the test/who generally > maintains those components). Those people can either fix and re-enable the > tests, or remove them if they no longer provide valuable signals. > > Ahmet > > On Wed, Jul 15, 2020 at 4:55 PM Kenneth Knowles <k...@apache.org> wrote: > >> The situation is much worse than that IMO. My experience of the last few >> days is that a large portion of time went to *just connecting failing runs >> with the corresponding Jira tickets or filing new ones*. >> >> Summarized on PRs: >> >> - https://github.com/apache/beam/pull/12272#issuecomment-659050891 >> - https://github.com/apache/beam/pull/12273#issuecomment-659070317 >> - https://github.com/apache/beam/pull/12225#issuecomment-656973073 >> - https://github.com/apache/beam/pull/12225#issuecomment-657743373 >> - https://github.com/apache/beam/pull/12224#issuecomment-657744481 >> - https://github.com/apache/beam/pull/12216#issuecomment-657735289 >> - https://github.com/apache/beam/pull/12216#issuecomment-657780781 >> - https://github.com/apache/beam/pull/12216#issuecomment-657799415 >> >> The tickets: >> >> - https://issues.apache.org/jira/browse/BEAM-10460 >> SparkPortableExecutionTest >> - https://issues.apache.org/jira/browse/BEAM-10471 CassandraIOTest > >> testEstimatedSizeBytes >> - https://issues.apache.org/jira/browse/BEAM-10504 ElasticSearchIOTest >> > testWriteFullAddressing and testWriteWithIndexFn >> - https://issues.apache.org/jira/browse/BEAM-10470 JdbcDriverTest >> - https://issues.apache.org/jira/browse/BEAM-8025 CassandraIOTest >> > @BeforeClass (classmethod) >> - https://issues.apache.org/jira/browse/BEAM-8454 FnHarnessTest >> - https://issues.apache.org/jira/browse/BEAM-10506 SplunkEventWriterTest >> - https://issues.apache.org/jira/browse/BEAM-10472 direct runner >> ParDoLifecycleTest >> - https://issues.apache.org/jira/browse/BEAM-9187 >> DefaultJobBundleFactoryTest >> >> Here are our P1 test flake bugs: >> https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22)%20AND%20resolution%20%3D%20Unresolved%20AND%20labels%20%3D%20flake%20ORDER%20BY%20priority%20DESC%2C%20updated%20DESC >> >> It seems quite a few of them are actively hindering people right now. >> >> Kenn >> >> On Wed, Jul 15, 2020 at 4:23 PM Andrew Pilloud <apill...@google.com> >> wrote: >> >>> We have two test suites that are responsible for a large percentage of >>> our flaky tests and both have bugs open for about a year without being >>> fixed. These suites are ParDoLifecycleTest (BEAM-8101 >>> <https://issues.apache.org/jira/browse/BEAM-8101>) in Java >>> and BigQueryWriteIntegrationTests in python (py3 BEAM-9484 >>> <https://issues.apache.org/jira/browse/BEAM-9484>, py2 BEAM-9232 >>> <https://issues.apache.org/jira/browse/BEAM-9232>, old duplicate >>> BEAM-8197 <https://issues.apache.org/jira/browse/BEAM-8197>). >>> >>> Are there any volunteers to look into these issues? What can we do to >>> mitigate the flakiness until someone has time to investigate? >>> >>> Andrew >>> >>