Thank you for raising this issue. What risks do you see if we drop it? Do
you see any cheaper alternative to (partially) mitigate those risks?

On Wed, Feb 9, 2022 at 12:40 PM Chesnay Schepler <ches...@apache.org> wrote:

> For a few years by now we had a set of Jepsen tests that verify the
> correctness of Flinks coordination layer in the case of process crashes.
> In the past it has indeed found issues and thus provided value to the
> project, and in general the core idea of it (and Jepsen for that matter)
> is very sound.
>
> However, so far we neither made attempts to make further use of Jepsen
> (and limited ourselves to very basic tests) nor to familiarize ourselves
> with the tests/jepsen at all.
> As a result these tests are difficult to maintain. They (and Jepsen) are
> written in Clojure, which makes debugging, changes and upstreaming
> contributions very difficult.
> Additionally, the tests also make use of a very complicated
> (Ververica-internal) terraform+ansible setup to spin up and tear down
> AWS machines. While it works (and is actually pretty cool), it's
> difficult to adjust because the people who wrote it have left the company.
>
> Why I'm raising this now (and not earlier) is because so far keeping the
> tests running wasn't much of a problem; bump a few dependencies here and
> there and we're good to go.
>
> However, this has changed with the recent upgrade to Zookeeper 3.5,
> which isn't supported by Jepsen out-of-the-box, completely breaking the
> tests. We'd now have to write a new Zookeeper 3.5+ integration for
> Jepsen (again, in Clojure). While I started working on that and could
> likely finish it, I started to wonder whether it even makes sense to do
> so, and whether we couldn't invest this time elsewhere.
>
> Let me know what you think.
>
>

-- 

Konstantin Knauf

https://twitter.com/snntrable

https://github.com/knaufk

Reply via email to