If changes to core are causing Dataflow precommits to fail but not
local precommits, that suggests we lack test coverage?
What is the difference between "Dataflow precommit" and "local
precommit" (besides that the latter can be run without GCP)? If "local
precommit" should catch _all_ regressions, what would be the reason to
have any other precommits then? My intuition would be, that precommit
checks (those being run as part of CI on pull requests) should ideally
be runnable by virtually anyone locally. Any checks that require a
specific environment should be run optionally (e.g. like validates
runner suites).
On 8/16/21 7:11 PM, Andrew Pilloud wrote:
I can confirm the tests are passing now. Thank you.
If changes to core are causing Dataflow precommits to fail but not
local precommits, that suggests we lack test coverage? I'm not
suggesting we remove the Dataflow tests entirely, just that we
consider removing them from the precommits where there is overlapping
test coverage.
I would be +1 in favor of a flag as it would allow us to easily
disable Dataflow tests in precommits should we have another outage.
On Mon, Aug 16, 2021 at 9:52 AM Jan Lukavský <je...@seznam.cz
<mailto:je...@seznam.cz>> wrote:
The issue is with pull requests. IIRC, I didn't encounter this
problem, but I can imagine, that a change in core can make
Dataflow precommit to fail. And it would be complicated to fix
this without GCP credentials.
So, to answer the question, I think that no, it would not help, as
long as this flag would not be used in CI as well.
On 8/16/21 6:47 PM, Luke Cwik wrote:
Jan, it would be possible to add a flag that says to skip any IT
tests that require a cloud service of any kind. Would that work
for you?
It turns out that the fix was rolled out and finished about 45
mins ago so my prior e-mail was already out of date when I sent
it. If you had a test that failed on your PR, please feel free to
restart the test using the github trigger phrase associated with it.
I reran one of the suites that were perma-red
https://ci-beam.apache.org/job/beam_PreCommit_Java_Examples_Dataflow_Cron/4059
<https://ci-beam.apache.org/job/beam_PreCommit_Java_Examples_Dataflow_Cron/4059>
and it passed.
On Mon, Aug 16, 2021 at 9:29 AM Jan Lukavský <je...@seznam.cz
<mailto:je...@seznam.cz>> wrote:
Not directly related to the 'flakiness' discussion of this
thread, but I think it would be good if pre-commit checks
could be run locally without GCP credentials.
On 8/16/21 6:24 PM, Luke Cwik wrote:
The fix was inadvertently run in dry run mode so didn't make
any changes. Since the fix was taking a couple of hours or
so and it was getting late on Friday people didn't want to
start it again till today (after the weekend).
I don't think removing the few tests that run an unbounded
pipeline on Dataflow for a long term is a good idea. Sure,
we can disable them and re-enable them when there is an
issue that is blocking folks.
On Mon, Aug 16, 2021 at 9:19 AM Andrew Pilloud
<apill...@google.com <mailto:apill...@google.com>> wrote:
The two hours to estimated fix has long passed and we
are now at 18 days since the last successful run. What
is the latest estimate?
It sounds like these tests are primarily testing
Dataflow, not Beam. They seem like good candidates to
remove from the precommit (or limit to Dataflow runner
changes) even after they are fixed.
On Fri, Aug 13, 2021 at 6:48 PM Luke Cwik
<lc...@google.com <mailto:lc...@google.com>> wrote:
The failure is related due to data that is
associated with the apache-beam-testing project
which is impacting all the Dataflow streaming tests.
Yes, disabling the tests should have happened weeks
ago if:
1) The fix seemed like it was going to take a long
time (was unknown at the time)
2) We had confidence in test coverage minus Dataflow
streaming test coverage (which I believe we did)
On Fri, Aug 13, 2021 at 6:27 PM Andrew Pilloud
<apill...@google.com <mailto:apill...@google.com>>
wrote:
Or if a rollback won't fix this, can we disable
the broken tests?
On Fri, Aug 13, 2021 at 6:25 PM Andrew Pilloud
<apill...@google.com
<mailto:apill...@google.com>> wrote:
So you can roll back in two hours. Beam has
been broken for two weeks. Why isn't a
rollback appropriate?
On Fri, Aug 13, 2021 at 6:06 PM Luke Cwik
<lc...@google.com <mailto:lc...@google.com>>
wrote:
From the test failures that I have seen
they have been because of BEAM-12676[1]
which is due to a bug impacting Dataflow
streaming pipelines for the
apache-beam-testing project. The fix is
rolling out now from my understanding
and should take another 2hrs or so.
Rolling back master doesn't seem like
what we should be doing at the moment.
1:
https://issues.apache.org/jira/projects/BEAM/issues/BEAM-12676
<https://issues.apache.org/jira/projects/BEAM/issues/BEAM-12676>
On Fri, Aug 13, 2021 at 5:51 PM Andrew
Pilloud <apill...@google.com
<mailto:apill...@google.com>> wrote:
Both java and python precommits are
reporting the last successful run
being in July (for both Cron and
Precommit), so it looks like changes
are being submitting without
successful test runs. We
probably shouldn't be doing that?
https://ci-beam.apache.org/job/beam_PreCommit_Python_Cron/
<https://ci-beam.apache.org/job/beam_PreCommit_Python_Cron/>
https://ci-beam.apache.org/job/beam_PreCommit_Python_Commit/
<https://ci-beam.apache.org/job/beam_PreCommit_Python_Commit/>
https://ci-beam.apache.org/job/beam_PreCommit_Java_Examples_Dataflow_Cron/
<https://ci-beam.apache.org/job/beam_PreCommit_Java_Examples_Dataflow_Cron/>
https://ci-beam.apache.org/job/beam_PreCommit_Java_Examples_Dataflow_Commit/
<https://ci-beam.apache.org/job/beam_PreCommit_Java_Examples_Dataflow_Commit/>
Is there a plan to get this fixed?
Should we roll master back to July?
On Tue, Aug 3, 2021 at 12:24 PM
Tyson Hamilton <tyso...@google.com
<mailto:tyso...@google.com>> wrote:
I only realized after sending
that I used the IP for the link,
that was by accident, here is
the proper domain link:
http://metrics.beam.apache.org/d/D81lW0pmk/post-commit-test-reliability?orgId=1
<http://metrics.beam.apache.org/d/D81lW0pmk/post-commit-test-reliability?orgId=1>
On Tue, Aug 3, 2021 at 3:22 PM
Tyson Hamilton
<tyso...@google.com
<mailto:tyso...@google.com>> wrote:
The way I've investigated
precommit flake stability is
by looking at the
'Post-commit Test
Reliability' [1] dashboard
(hah!). There is a cron job
that runs precommits and
those results are tracked in
the post commit dashboard
confusingly. This week, Java
is about 50% green for the
pre-commit cron job, not great.
The plugin we installed for
tracking the most flaky
tests for a job doesn't do
well for the number of tests
present in the precommit
cron job. This could be an
area of improvement to help
add granularity and
visibility to the flakiest
tests over some period of time.
[1]:
http://104.154.241.245/d/D81lW0pmk/post-commit-test-reliability?orgId=1
<http://104.154.241.245/d/D81lW0pmk/post-commit-test-reliability?orgId=1>
(look for
"PreCommit_Java_Cron")
On Tue, Aug 3, 2021 at 2:24
PM Andrew Pilloud
<apill...@google.com
<mailto:apill...@google.com>>
wrote:
Our metrics show java is
nearly free from flakes,
that go has significant
flakes, and that python
is effectively broken.
It appears they may be
missing coverage on the
Java side. The dashboard
is here:
http://104.154.241.245/d/McTAiu0ik/stability-critical-jobs-status?orgId=1
<http://104.154.241.245/d/McTAiu0ik/stability-critical-jobs-status?orgId=1>
I agree that this is
important to address. I
haven't submitted any
code recently but I
spent a significant
amount of time on the
2.31.0 release
investigating flakes in
the release
validation tests.
Andrew
On Tue, Aug 3, 2021 at
10:43 AM Reuven Lax
<re...@google.com
<mailto:re...@google.com>>
wrote:
I've noticed
recently that our
precommit tests are
getting flakier and
flakier. Recently I
had to run Java
PreCommit 5 times
before I was able to
get a clean run.
This is frustrating
for us as
developers, but it
also is extremely
wasteful of our
compute resources.
I started making a
list of the flaky
tests I've seen.
Here are some of the
ones I've dealt with
just the past few
days; this is not
nearly an exhaustive
list - I've seen
many others before I
started recording
them. Of the below,
failures in
ElasticsearchIOTest
are by far the most
common!
We need to try and
make these tests not
flaky. Barring that,
I think the
extremely flaky
tests need to be
excluded from our
presubmit until they
can be fixed.
Rerunning the
precommit over and
over again till
green is not a good
testing strategy.
*
org.apache.beam.runners.flink.ReadSourcePortableTest.testExecution[streaming:
false]
<https://ci-beam.apache.org/job/beam_PreCommit_Java_Phrase/3901/testReport/junit/org.apache.beam.runners.flink/ReadSourcePortableTest/testExecution_streaming__false_/>
*
org.apache.beam.sdk.io.jms.JmsIOTest.testCheckpointMarkSafety
<https://ci-beam.apache.org/job/beam_PreCommit_Java_Commit/18485/testReport/junit/org.apache.beam.sdk.io.jms/JmsIOTest/testCheckpointMarkSafety/>
*
org.apache.beam.sdk.transforms.ParDoLifecycleTest.testTeardownCalledAfterExceptionInFinishBundleStateful
<https://ci-beam.apache.org/job/beam_PreCommit_Java_Phrase/3903/testReport/junit/org.apache.beam.sdk.transforms/ParDoLifecycleTest/testTeardownCalledAfterExceptionInFinishBundleStateful/>
*
org.apache.beam.sdk.io.elasticsearch.ElasticsearchIOTest.testSplit
<https://ci-beam.apache.org/job/beam_PreCommit_Java_Phrase/3903/testReport/junit/org.apache.beam.sdk.io.elasticsearch/ElasticsearchIOTest/testSplit/>
*
org.apache.beam.sdk.io.gcp.datastore.RampupThrottlingFnTest.testRampupThrottler
<https://ci-beam.apache.org/job/beam_PreCommit_Java_Commit/18501/testReport/junit/org.apache.beam.sdk.io.gcp.datastore/RampupThrottlingFnTest/testRampupThrottler/>