Hi all,
+1 on ping the assigned person.
For the flakes I know of (ESIO and CassandraIO), they are due to the
load of the CI server. These IOs are tested using real embedded backends
because those backends are complex and we need relevant tests.
Counter measures have been taken (retrial inside the test sensible to
load, add ranges of acceptable numbers, call internal backend mechanisms
to force refresh in case load prevented the backend to do so ...).
I recently got pinged my Ahmet (thanks to him!) about a flakiness that I
did not see. This seems to me the correct way to go. Systematically
retrying tests with a CI mechanism or disabling tests seem to me a risky
workaround that just allows to get the problem off our minds.
Etienne
On 20/07/2020 20:58, Brian Hulette wrote:
> I think we are missing a way for checking that we are making
progress on P1 issues. For example, P0 issues block releases and this
obviously results in fixing/triaging/addressing P0 issues at least
every 6 weeks. We do not have a similar process for flaky tests. I do
not know what would be a good policy. One suggestion is to ping
(email/slack) assignees of issues. I recently missed a flaky issue
that was assigned to me. A ping like that would have reminded me. And
if an assignee cannot help/does not have the time, we can try to find
a new assignee.
Yeah I think this is something we should address. With the new jira
automation at least assignees should get an email notification after
30 days because of a jira comment like [1], but that's too long to let
a test continue to flake. Could Beam Jira Bot ping every N days for
P1s that aren't making progress?
That wouldn't help us with P1s that have no assignee, or are assigned
to overloaded people. It seems we'd need some kind of dashboard or
report to capture those.
[1]
https://issues.apache.org/jira/browse/BEAM-8101?focusedCommentId=17121918&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17121918
On Fri, Jul 17, 2020 at 1:09 PM Ahmet Altay <al...@google.com
<mailto:al...@google.com>> wrote:
Another idea, could we change our "Retest X" phrases with "Retest
X (Reason)" phrases? With this change a PR author will have to
look at failed test logs. They could catch new flakiness
introduced by their PR, file a JIRA for a flakiness that was not
noted before, or ping an existing JIRA issue/raise its severity.
On the downside this will require PR authors to do more.
On Fri, Jul 17, 2020 at 6:46 AM Tyson Hamilton <tyso...@google.com
<mailto:tyso...@google.com>> wrote:
Adding retries can be beneficial in two ways, unblocking a PR,
and collecting metrics about the flakes.
Makes sense. I think we will still need to have a plan to remove
retries similar to re-enabling disabled tests.
If we also had a flaky test leaderboard that showed which
tests are the most flaky, then we could take action on them.
Encouraging someone from the community to fix the flaky test
is another issue.
The test status matrix of tests that is on the GitHub landing
page could show flake level to communicate to users which
modules are losing a trustable test signal. Maybe this shows
up as a flake % or a code coverage % that decreases due to
disabled flaky tests.
+1 to a dashboard that will show a "leaderboard" of flaky tests.
I didn't look for plugins, just dreaming up some options.
On Thu, Jul 16, 2020, 5:58 PM Luke Cwik <lc...@google.com
<mailto:lc...@google.com>> wrote:
What do other Apache projects do to address this issue?
On Thu, Jul 16, 2020 at 5:51 PM Ahmet Altay
<al...@google.com <mailto:al...@google.com>> wrote:
I agree with the comments in this thread.
- If we are not re-enabling tests back again or we do
not have a plan to re-enable them again, disabling
tests only provides us temporary relief until
eventually users find issues instead of disabled tests.
- I feel similarly about retries. It is reasonable to
add retries for reasons we understand. Adding retries
to avoid flakes is similar to disabling tests. They
might hide real issues.
I think we are missing a way for checking that we are
making progress on P1 issues. For example, P0 issues
block releases and this obviously results in
fixing/triaging/addressing P0 issues at least every 6
weeks. We do not have a similar process for flaky
tests. I do not know what would be a good policy. One
suggestion is to ping (email/slack) assignees of
issues. I recently missed a flaky issue that was
assigned to me. A ping like that would have reminded
me. And if an assignee cannot help/does not have the
time, we can try to find a new assignee.
Ahmet
On Thu, Jul 16, 2020 at 11:52 AM Valentyn Tymofieiev
<valen...@google.com <mailto:valen...@google.com>> wrote:
I think the original discussion[1] on introducing
tenacity might answer that question.
[1]
https://lists.apache.org/thread.html/16060fd7f4d408857a5e4a2598cc96ebac0f744b65bf4699001350af%40%3Cdev.beam.apache.org%3E
On Thu, Jul 16, 2020 at 10:48 AM Rui Wang
<ruw...@google.com <mailto:ruw...@google.com>> wrote:
Is there an observation that enabling tenacity
improves the development experience on Python
SDK? E.g. less wait time to get PR pass and
merged? Or it might be a matter of a right
number of retry to align with the "flakiness"
of a test?
-Rui
On Thu, Jul 16, 2020 at 10:38 AM Valentyn
Tymofieiev <valen...@google.com
<mailto:valen...@google.com>> wrote:
We used tenacity[1] to retry some unit
tests for which we understood the nature
of flakiness.
[1]
https://github.com/apache/beam/blob/3b9aae2bcaeb48ab43a77368ae496edc73634c91/sdks/python/apache_beam/runners/portability/fn_api_runner/fn_runner_test.py#L1156
On Thu, Jul 16, 2020 at 10:25 AM Kenneth
Knowles <k...@apache.org
<mailto:k...@apache.org>> wrote:
Didn't we use something like that
flaky retry plugin for Python tests at
some point? Adding retries may be
preferable to disabling the test. We
need a process to remove the retries
ASAP though. As Luke says that is not
so easy to make happen. Having a way
to make P1 bugs more visible in an
ongoing way may help.
Kenn
On Thu, Jul 16, 2020 at 8:57 AM Luke
Cwik <lc...@google.com
<mailto:lc...@google.com>> wrote:
I don't think I have seen tests
that were previously disabled
become re-enabled.
It seems as though we have about
~60 disabled tests in Java and ~15
in Python. Half of the Java ones
seem to be in ZetaSQL/SQL due to
missing features so unrelated to
being a flake.
On Thu, Jul 16, 2020 at 8:49 AM
Gleb Kanterov <g...@spotify.com
<mailto:g...@spotify.com>> wrote:
There is something called
test-retry-gradle-plugin [1].
It retries tests if they fail,
and have different modes to
handle flaky tests. Did we
ever try or consider using it?
[1]:
https://github.com/gradle/test-retry-gradle-plugin
On Thu, Jul 16, 2020 at 1:15
PM Gleb Kanterov
<g...@spotify.com
<mailto:g...@spotify.com>> wrote:
I agree with what Ahmet is
saying. I can share my
perspective, recently I
had to retrigger build 6
times due to flaky tests,
and each retrigger took
one hour of waiting time.
I've seen examples of
automatic tracking of
flaky tests, where a test
is considered flaky if
both fails and succeeds
for the same git SHA. Not
sure if there is anything
we can enable to get this
automatically.
/Gleb
On Thu, Jul 16, 2020 at
2:33 AM Ahmet Altay
<al...@google.com
<mailto:al...@google.com>>
wrote:
I think it will be
reasonable to
disable/sickbay any
flaky test that is
actively blocking
people. Collective
cost of flaky tests
for such a large group
of contributors is
very significant.
Most of these issues
are unassigned. IMO,
it makes sense to
assign these issues to
the most relevant
person (who added the
test/who generally
maintains those
components). Those
people can either fix
and re-enable the
tests, or remove them
if they no longer
provide valuable signals.
Ahmet
On Wed, Jul 15, 2020
at 4:55 PM Kenneth
Knowles
<k...@apache.org
<mailto:k...@apache.org>>
wrote:
The situation is
much worse than
that IMO. My
experience of the
last few days is
that a large
portion of time
went to *just
connecting failing
runs with the
corresponding Jira
tickets or filing
new ones*.
Summarized on PRs:
-
https://github.com/apache/beam/pull/12272#issuecomment-659050891
-
https://github.com/apache/beam/pull/12273#issuecomment-659070317
-
https://github.com/apache/beam/pull/12225#issuecomment-656973073
-
https://github.com/apache/beam/pull/12225#issuecomment-657743373
-
https://github.com/apache/beam/pull/12224#issuecomment-657744481
-
https://github.com/apache/beam/pull/12216#issuecomment-657735289
-
https://github.com/apache/beam/pull/12216#issuecomment-657780781
-
https://github.com/apache/beam/pull/12216#issuecomment-657799415
The tickets:
-
https://issues.apache.org/jira/browse/BEAM-10460
SparkPortableExecutionTest
-
https://issues.apache.org/jira/browse/BEAM-10471
CassandraIOTest >
testEstimatedSizeBytes
-
https://issues.apache.org/jira/browse/BEAM-10504
ElasticSearchIOTest
>
testWriteFullAddressing
and
testWriteWithIndexFn
-
https://issues.apache.org/jira/browse/BEAM-10470
JdbcDriverTest
-
https://issues.apache.org/jira/browse/BEAM-8025
CassandraIOTest
> @BeforeClass
(classmethod)
-
https://issues.apache.org/jira/browse/BEAM-8454
FnHarnessTest
-
https://issues.apache.org/jira/browse/BEAM-10506
SplunkEventWriterTest
-
https://issues.apache.org/jira/browse/BEAM-10472
direct runner
ParDoLifecycleTest
-
https://issues.apache.org/jira/browse/BEAM-9187
DefaultJobBundleFactoryTest
Here are our P1
test flake bugs:
https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22)%20AND%20resolution%20%3D%20Unresolved%20AND%20labels%20%3D%20flake%20ORDER%20BY%20priority%20DESC%2C%20updated%20DESC
It seems quite a
few of them are
actively hindering
people right now.
Kenn
On Wed, Jul 15,
2020 at 4:23 PM
Andrew Pilloud
<apill...@google.com
<mailto:apill...@google.com>>
wrote:
We have two
test suites
that are
responsible
for a large
percentage of
our flaky
tests and both
have bugs open
for about a
year without
being fixed.
These suites
are
ParDoLifecycleTest
(BEAM-8101
<https://issues.apache.org/jira/browse/BEAM-8101>)
in Java
and
BigQueryWriteIntegrationTests
in python (py3
BEAM-9484
<https://issues.apache.org/jira/browse/BEAM-9484>,
py2 BEAM-9232
<https://issues.apache.org/jira/browse/BEAM-9232>,
old duplicate
BEAM-8197
<https://issues.apache.org/jira/browse/BEAM-8197>).
Are there any
volunteers to
look into
these issues?
What can we do
to mitigate
the
flakiness until
someone has
time to
investigate?
Andrew