[DISCUSSION] Research on remedy patterns of flaky tests and fundamental root causes

Devin Bost Wed, 07 Apr 2021 00:36:13 -0700

Dear Pulsar community,

After inspecting many of the flaky tests in Pulsar and performing some
research on the issue, I discovered that there are some patterns that will
allow us to remedy many of our flaky tests. I bring your attention to this
publication:


Luo, Q., Hariri, F., Eloussi, L., & Marinov, D. (2014, November). An
empirical analysis of flaky tests. In *Proceedings of the 22nd ACM SIGSOFT
International Symposium on Foundations of Software Engineering* (pp.
643-653). Available at:
http://mir.cs.illinois.edu/~eloussi2/publications/fse14.pdf


In this study, Luo et al. discovered that the top three categories of flaky
tests are:

   - ASYNC WAIT
   - CONCURRENCY
   - TEST ORDER DEPENDENCY

>From my anecdotal observations of our flaky Pulsar tests, it appears that
the *vast majority of our flaky tests are in the ASYNC WAIT category*. In
the resource-constrained test runner, these flaky tests are testing that
behavior is correct when certain timing rules (such as timeouts) are
involved. The problem is that the resource-constrained test runner
environment creates additional pressure for frequent stop-the-world garbage
collection events, which are non-deterministic and break the timing
assumptions made in these ASYNC WAIT tests. Compounding the matter, large
volumes of disposable objects are constructed by our tests, resulting in
long garbage collection pauses. Although sharing more dependencies between
tests could reduce the accumulation of disposable objects, it increases the
risk of encountering CONCURRENCY and TEST ORDER DEPENDENCY flaky tests,
which usually fail due to shared state.

Luo et al. define ASYNC WAIT flaky tests like this:

"We classify a commit into the Async Wait category when the test execution
makes an asynchronous call and does not properly wait for the result of the
call to become available before using it. For example, a test (or the [code
under test] CUT) can spawn a separate service, e.g., a remote server or
another thread, and there is no proper synchronization to wait for that
service to be available before proceeding with the execution. Based on
whether the result becomes available before (or after) it is used, the test
can non-deterministically pass (or fail)."

Regarding the solution, they indicate, "The key to fixing Async Wait flaky
tests is to address the order violation between different threads or
processes," and they noticed that 57% of ASYNC WAIT flaky tests could be
fixed with some call to a waitFor operation that blocks the current thread
until a certain condition is reached.

In Pulsar, due to the asynchronous nature of the framework, finding ways to
solve ASYNC WAIT flaky tests may require some evaluation of how we can
replace our dependence on sleep operations in our tests with some kind of
mechanism that will allow us to more deterministically detect when a
condition has occurred.  This raises the architectural question that
perhaps we need an additional mechanism to be able to locate a message in a
particular flow to determine if the condition has occurred or not.
(Repeatedly checking with a long timeout is better than assuming the
condition will occur or won't occur after a particular sleep.)

Regardless, the paper also identifies many other common causes of flaky
tests and their remedies. I think this paper may be a helpful reference
when working on flaky tests in Pulsar.

Devin G. Bost

[DISCUSSION] Research on remedy patterns of flaky tests and fundamental root causes

Reply via email to