Re: Chronically flaky tests

Etienne Chauchot Tue, 04 Aug 2020 07:26:07 -0700

Hi all,

+1 on ping the assigned person.

For the flakes I know of (ESIO and CassandraIO), they are due to theload of the CI server. These IOs are tested using real embedded backendsbecause those backends are complex and we need relevant tests.

Counter measures have been taken (retrial inside the test sensible toload, add ranges of acceptable numbers, call internal backend mechanismsto force refresh in case load prevented the backend to do so ...).

I recently got pinged my Ahmet (thanks to him!) about a flakiness that Idid not see. This seems to me the correct way to go. Systematicallyretrying tests with a CI mechanism or disabling tests seem to me a riskyworkaround that just allows to get the problem off our minds.


Etienne

On 20/07/2020 20:58, Brian Hulette wrote:

> I think we are missing a way for checking that we are makingprogress on P1 issues. For example, P0 issues block releases and thisobviously results in fixing/triaging/addressing P0 issues at leastevery 6 weeks. We do not have a similar process for flaky tests. I donot know what would be a good policy. One suggestion is to ping(email/slack) assignees of issues. I recently missed a flaky issuethat was assigned to me. A ping like that would have reminded me. Andif an assignee cannot help/does not have the time, we can try to finda new assignee.

Yeah I think this is something we should address. With the new jiraautomation at least assignees should get an email notification after30 days because of a jira comment like [1], but that's too long to leta test continue to flake. Could Beam Jira Bot ping every N days forP1s that aren't making progress?

That wouldn't help us with P1s that have no assignee, or are assignedto overloaded people. It seems we'd need some kind of dashboard orreport to capture those.

[1]https://issues.apache.org/jira/browse/BEAM-8101?focusedCommentId=17121918&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17121918

On Fri, Jul 17, 2020 at 1:09 PM Ahmet Altay <al...@google.com<mailto:al...@google.com>> wrote:


    Another idea, could we change our "Retest X" phrases with "Retest
    X (Reason)" phrases? With this change a PR author will have to
    look at failed test logs. They could catch new flakiness
    introduced by their PR, file a JIRA for a flakiness that was not
    noted before, or ping an existing JIRA issue/raise its severity.
    On the downside this will require PR authors to do more.

    On Fri, Jul 17, 2020 at 6:46 AM Tyson Hamilton <tyso...@google.com
    <mailto:tyso...@google.com>> wrote:

        Adding retries can be beneficial in two ways, unblocking a PR,
        and collecting metrics about the flakes.


    Makes sense. I think we will still need to have a plan to remove
    retries similar to re-enabling disabled tests.


        If we also had a flaky test leaderboard that showed which
        tests are the most flaky, then we could take action on them.
        Encouraging someone from the community to fix the flaky test
        is another issue.

        The test status matrix of tests that is on the GitHub landing
        page could show flake level to communicate to users which
        modules are losing a trustable test signal. Maybe this shows
        up as a flake % or a code coverage % that decreases due to
        disabled flaky tests.


    +1 to a dashboard that will show a "leaderboard" of flaky tests.


        I didn't look for plugins, just dreaming up some options.




        On Thu, Jul 16, 2020, 5:58 PM Luke Cwik <lc...@google.com
        <mailto:lc...@google.com>> wrote:

            What do other Apache projects do to address this issue?

            On Thu, Jul 16, 2020 at 5:51 PM Ahmet Altay
            <al...@google.com <mailto:al...@google.com>> wrote:

                I agree with the comments in this thread.
                - If we are not re-enabling tests back again or we do
                not have a plan to re-enable them again, disabling
                tests only provides us temporary relief until
                eventually users find issues instead of disabled tests.
                - I feel similarly about retries. It is reasonable to
                add retries for reasons we understand. Adding retries
                to avoid flakes is similar to disabling tests. They
                might hide real issues.

                I think we are missing a way for checking that we are
                making progress on P1 issues. For example, P0 issues
                block releases and this obviously results in
                fixing/triaging/addressing P0 issues at least every 6
                weeks. We do not have a similar process for flaky
                tests. I do not know what would be a good policy. One
                suggestion is to ping (email/slack) assignees of
                issues. I recently missed a flaky issue that was
                assigned to me. A ping like that would have reminded
                me. And if an assignee cannot help/does not have the
                time, we can try to find a new assignee.

                Ahmet


                On Thu, Jul 16, 2020 at 11:52 AM Valentyn Tymofieiev
                <valen...@google.com <mailto:valen...@google.com>> wrote:

                    I think the original discussion[1] on introducing
                    tenacity might answer that question.

                    [1]
                    
https://lists.apache.org/thread.html/16060fd7f4d408857a5e4a2598cc96ebac0f744b65bf4699001350af%40%3Cdev.beam.apache.org%3E

                    On Thu, Jul 16, 2020 at 10:48 AM Rui Wang
                    <ruw...@google.com <mailto:ruw...@google.com>> wrote:

                        Is there an observation that enabling tenacity
                        improves the development experience on Python
                        SDK? E.g. less wait time to get PR pass and
                        merged? Or it might be a matter of a right
                        number of retry to align with the "flakiness"
                        of a test?


                        -Rui

                        On Thu, Jul 16, 2020 at 10:38 AM Valentyn
                        Tymofieiev <valen...@google.com
                        <mailto:valen...@google.com>> wrote:

                            We used tenacity[1] to retry some unit
                            tests for which we understood the nature
                            of flakiness.

                            [1]
                            
https://github.com/apache/beam/blob/3b9aae2bcaeb48ab43a77368ae496edc73634c91/sdks/python/apache_beam/runners/portability/fn_api_runner/fn_runner_test.py#L1156

                            On Thu, Jul 16, 2020 at 10:25 AM Kenneth
                            Knowles <k...@apache.org
                            <mailto:k...@apache.org>> wrote:

                                Didn't we use something like that
                                flaky retry plugin for Python tests at
                                some point? Adding retries may be
                                preferable to disabling the test. We
                                need a process to remove the retries
                                ASAP though. As Luke says that is not
                                so easy to make happen. Having a way
                                to make P1 bugs more visible in an
                                ongoing way may help.

                                Kenn

                                On Thu, Jul 16, 2020 at 8:57 AM Luke
                                Cwik <lc...@google.com
                                <mailto:lc...@google.com>> wrote:

                                    I don't think I have seen tests
                                    that were previously disabled
                                    become re-enabled.

                                    It seems as though we have about
                                    ~60 disabled tests in Java and ~15
                                    in Python. Half of the Java ones
                                    seem to be in ZetaSQL/SQL due to
                                    missing features so unrelated to
                                    being a flake.

                                    On Thu, Jul 16, 2020 at 8:49 AM
                                    Gleb Kanterov <g...@spotify.com
                                    <mailto:g...@spotify.com>> wrote:

                                        There is something called
                                        test-retry-gradle-plugin [1].
                                        It retries tests if they fail,
                                        and have different modes to
                                        handle flaky tests. Did we
                                        ever try or consider using it?

                                        [1]:
                                        
https://github.com/gradle/test-retry-gradle-plugin

                                        On Thu, Jul 16, 2020 at 1:15
                                        PM Gleb Kanterov
                                        <g...@spotify.com
                                        <mailto:g...@spotify.com>> wrote:

                                            I agree with what Ahmet is
                                            saying. I can share my
                                            perspective, recently I
                                            had to retrigger build 6
                                            times due to flaky tests,
                                            and each retrigger took
                                            one hour of waiting time.

                                            I've seen examples of
                                            automatic tracking of
                                            flaky tests, where a test
                                            is considered flaky if
                                            both fails and succeeds
                                            for the same git SHA. Not
                                            sure if there is anything
                                            we can enable to get this
                                            automatically.

                                            /Gleb

                                            On Thu, Jul 16, 2020 at
                                            2:33 AM Ahmet Altay
                                            <al...@google.com
                                            <mailto:al...@google.com>>
                                            wrote:

                                                I think it will be
                                                reasonable to
                                                disable/sickbay any
                                                flaky test that is
                                                actively blocking
                                                people. Collective
                                                cost of flaky tests
                                                for such a large group
                                                of contributors is
                                                very significant.

                                                Most of these issues
                                                are unassigned. IMO,
                                                it makes sense to
                                                assign these issues to
                                                the most relevant
                                                person (who added the
                                                test/who generally
                                                maintains those
                                                components). Those
                                                people can either fix
                                                and re-enable the
                                                tests, or remove them
                                                if they no longer
                                                provide valuable signals.

                                                Ahmet

                                                On Wed, Jul 15, 2020
                                                at 4:55 PM Kenneth
                                                Knowles
                                                <k...@apache.org
                                                <mailto:k...@apache.org>>
                                                wrote:

                                                    The situation is
                                                    much worse than
                                                    that IMO. My
                                                    experience of the
                                                    last few days is
                                                    that a large
                                                    portion of time
                                                    went to *just
                                                    connecting failing
                                                    runs with the
                                                    corresponding Jira
                                                    tickets or filing
                                                    new ones*.

                                                    Summarized on PRs:

                                                     -
                                                    
https://github.com/apache/beam/pull/12272#issuecomment-659050891
                                                     -
                                                    
https://github.com/apache/beam/pull/12273#issuecomment-659070317
                                                     -
                                                    
https://github.com/apache/beam/pull/12225#issuecomment-656973073
                                                     -
                                                    
https://github.com/apache/beam/pull/12225#issuecomment-657743373
                                                     -
                                                    
https://github.com/apache/beam/pull/12224#issuecomment-657744481
                                                     -
                                                    
https://github.com/apache/beam/pull/12216#issuecomment-657735289
                                                     -
                                                    
https://github.com/apache/beam/pull/12216#issuecomment-657780781
                                                     -
                                                    
https://github.com/apache/beam/pull/12216#issuecomment-657799415

                                                    The tickets:

                                                     -
                                                    
https://issues.apache.org/jira/browse/BEAM-10460
                                                    SparkPortableExecutionTest
                                                     -
                                                    
https://issues.apache.org/jira/browse/BEAM-10471
                                                    CassandraIOTest >
                                                    testEstimatedSizeBytes
                                                     -
                                                    
https://issues.apache.org/jira/browse/BEAM-10504
                                                    ElasticSearchIOTest
                                                    >
                                                    testWriteFullAddressing
                                                    and
                                                    testWriteWithIndexFn
                                                     -
                                                    
https://issues.apache.org/jira/browse/BEAM-10470
                                                    JdbcDriverTest
                                                     -
                                                    
https://issues.apache.org/jira/browse/BEAM-8025
                                                    CassandraIOTest
                                                    > @BeforeClass
                                                    (classmethod)
                                                     -
                                                    
https://issues.apache.org/jira/browse/BEAM-8454
                                                    FnHarnessTest
                                                     -
                                                    
https://issues.apache.org/jira/browse/BEAM-10506
                                                    SplunkEventWriterTest
                                                     -
                                                    
https://issues.apache.org/jira/browse/BEAM-10472
                                                    direct runner
                                                    ParDoLifecycleTest
                                                     -
                                                    
https://issues.apache.org/jira/browse/BEAM-9187
                                                    DefaultJobBundleFactoryTest

                                                    Here are our P1
                                                    test flake bugs:
                                                    
https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22)%20AND%20resolution%20%3D%20Unresolved%20AND%20labels%20%3D%20flake%20ORDER%20BY%20priority%20DESC%2C%20updated%20DESC

                                                    It seems quite a
                                                    few of them are
                                                    actively hindering
                                                    people right now.

                                                    Kenn

                                                    On Wed, Jul 15,
                                                    2020 at 4:23 PM
                                                    Andrew Pilloud
                                                    <apill...@google.com
                                                    
<mailto:apill...@google.com>>
                                                    wrote:

                                                        We have two
                                                        test suites
                                                        that are
                                                        responsible
                                                        for a large
                                                        percentage of
                                                        our flaky
                                                        tests and both
                                                        have bugs open
                                                        for about a
                                                        year without
                                                        being fixed.
                                                        These suites
                                                        are
                                                        ParDoLifecycleTest
                                                        (BEAM-8101
                                                        
<https://issues.apache.org/jira/browse/BEAM-8101>)
                                                        in Java
                                                        and 
BigQueryWriteIntegrationTests
                                                        in python (py3
                                                        BEAM-9484
                                                        
<https://issues.apache.org/jira/browse/BEAM-9484>,
                                                        py2 BEAM-9232
                                                        
<https://issues.apache.org/jira/browse/BEAM-9232>,
                                                        old duplicate
                                                        BEAM-8197
                                                        
<https://issues.apache.org/jira/browse/BEAM-8197>).


                                                        Are there any
                                                        volunteers to
                                                        look into
                                                        these issues?
                                                        What can we do
                                                        to mitigate
                                                        the
                                                        flakiness until
                                                        someone has
                                                        time to
                                                        investigate?

                                                        Andrew

Re: Chronically flaky tests

Reply via email to