Re: [DISCUSS] Drop Jepsen tests

Chesnay Schepler Wed, 09 Feb 2022 08:43:56 -0800

b/c are part of the same test.

 * We have a job running,
 * trigger a network partition (failing the job),
 * then crash HDFS (preventing checkpoints and access to the HA
   storageDir),
 * then the partition is resolved and HDFS is started again.

Conceptually I would think we can replicate this by nuking half thecluster, crashing HDFS/ZK, and restarting everything.


On 09/02/2022 17:39, Chesnay Schepler wrote:

The jepsen tests cover 3 cases:
a) JM/TM crashes
b) HDFS namenode crash (aka, can't checkpoint because HDFS is down)
c) network partitions
a) can (and probably is) reasonably covered by existing ITCases ande2e tests
b) We could probably figure this out ourselves if we wanted to.
c) is the difficult part.
Note that the tests also only cover yarn (per-job/session) andstandalone (session) deployments.
On 09/02/2022 17:11, Konstantin Knauf wrote:
Thank you for raising this issue. What risks do you see if we dropit? Do
you see any cheaper alternative to (partially) mitigate those risks?
On Wed, Feb 9, 2022 at 12:40 PM Chesnay Schepler <[email protected]>wrote:
For a few years by now we had a set of Jepsen tests that verify the
correctness of Flinks coordination layer in the case of processcrashes.
In the past it has indeed found issues and thus provided value to the
project, and in general the core idea of it (and Jepsen for thatmatter)
is very sound.

However, so far we neither made attempts to make further use of Jepsen
(and limited ourselves to very basic tests) nor to familiarizeourselves
with the tests/jepsen at all.
As a result these tests are difficult to maintain. They (and Jepsen)are
written in Clojure, which makes debugging, changes and upstreaming
contributions very difficult.
Additionally, the tests also make use of a very complicated
(Ververica-internal) terraform+ansible setup to spin up and tear down
AWS machines. While it works (and is actually pretty cool), it's
difficult to adjust because the people who wrote it have left thecompany.
Why I'm raising this now (and not earlier) is because so far keepingthetests running wasn't much of a problem; bump a few dependencies hereand
there and we're good to go.

However, this has changed with the recent upgrade to Zookeeper 3.5,
which isn't supported by Jepsen out-of-the-box, completely breaking the
tests. We'd now have to write a new Zookeeper 3.5+ integration for
Jepsen (again, in Clojure). While I started working on that and could
likely finish it, I started to wonder whether it even makes sense to do
so, and whether we couldn't invest this time elsewhere.

Let me know what you think.

Re: [DISCUSS] Drop Jepsen tests

Reply via email to