Re: [DISCUSS] Drop Jepsen tests

Yang Wang Wed, 09 Feb 2022 18:33:38 -0800

@Austin
We already have some e2e tests[1] that guards k8s deployment(both session
and application, with or without HA).
And I agree with you that network partition could be simulated by K8s
network policy.



[1].
https://github.com/apache/flink/blob/master/flink-end-to-end-tests/test-scripts/test_kubernetes_application_ha.sh

Best,
Yang

Austin Cawley-Edwards <austin.caw...@gmail.com> 于2022年2月10日周四 05:12写道：

> Are there e2e tests that run on kubernetes? Perhaps k8s network policies[1]
> would be an option to simulate asymmetric network partitions without
> modifying iptables in a more approachable way?
>
> Austin
>
> [1]:
> https://kubernetes.io/docs/concepts/services-networking/network-policies/
>
>
> On Wed, Feb 9, 2022 at 12:40 PM David Morávek <d...@apache.org> wrote:
>
> > Network partitions are trickier than simply crashing process. For example
> > these can be asymmetric -> as a TM you're still able to talk to the JM,
> but
> > you're not able to talk to other TMs.
> >
> > In general this could be achieved by manipulating iptables on the host
> > machine (considering we spawn all the processes locally), but not sure if
> > that will solve the "make it less complicated for others to contribute"
> > part :/ Also this kind of test would be executable on nix systems only.
> >
> > I assume that jepsen uses the same approach under the hood.
> >
> > D.
> >
> > On Wed, Feb 9, 2022 at 5:43 PM Chesnay Schepler <ches...@apache.org>
> > wrote:
> >
> > > b/c are part of the same test.
> > >
> > >   * We have a job running,
> > >   * trigger a network partition (failing the job),
> > >   * then crash HDFS (preventing checkpoints and access to the HA
> > >     storageDir),
> > >   * then the partition is resolved and HDFS is started again.
> > >
> > > Conceptually I would think we can replicate this by nuking half the
> > > cluster, crashing HDFS/ZK, and restarting everything.
> > >
> > > On 09/02/2022 17:39, Chesnay Schepler wrote:
> > > > The jepsen tests cover 3 cases:
> > > > a) JM/TM crashes
> > > > b) HDFS namenode crash (aka, can't checkpoint because HDFS is down)
> > > > c) network partitions
> > > >
> > > > a) can (and probably is) reasonably covered by existing ITCases and
> > > > e2e tests
> > > > b) We could probably figure this out ourselves if we wanted to.
> > > > c) is the difficult part.
> > > >
> > > > Note that the tests also only cover yarn (per-job/session) and
> > > > standalone (session) deployments.
> > > >
> > > > On 09/02/2022 17:11, Konstantin Knauf wrote:
> > > >> Thank you for raising this issue. What risks do you see if we drop
> > > >> it? Do
> > > >> you see any cheaper alternative to (partially) mitigate those risks?
> > > >>
> > > >> On Wed, Feb 9, 2022 at 12:40 PM Chesnay Schepler <
> ches...@apache.org>
> > > >> wrote:
> > > >>
> > > >>> For a few years by now we had a set of Jepsen tests that verify the
> > > >>> correctness of Flinks coordination layer in the case of process
> > > >>> crashes.
> > > >>> In the past it has indeed found issues and thus provided value to
> the
> > > >>> project, and in general the core idea of it (and Jepsen for that
> > > >>> matter)
> > > >>> is very sound.
> > > >>>
> > > >>> However, so far we neither made attempts to make further use of
> > Jepsen
> > > >>> (and limited ourselves to very basic tests) nor to familiarize
> > > >>> ourselves
> > > >>> with the tests/jepsen at all.
> > > >>> As a result these tests are difficult to maintain. They (and
> Jepsen)
> > > >>> are
> > > >>> written in Clojure, which makes debugging, changes and upstreaming
> > > >>> contributions very difficult.
> > > >>> Additionally, the tests also make use of a very complicated
> > > >>> (Ververica-internal) terraform+ansible setup to spin up and tear
> down
> > > >>> AWS machines. While it works (and is actually pretty cool), it's
> > > >>> difficult to adjust because the people who wrote it have left the
> > > >>> company.
> > > >>>
> > > >>> Why I'm raising this now (and not earlier) is because so far
> keeping
> > > >>> the
> > > >>> tests running wasn't much of a problem; bump a few dependencies
> here
> > > >>> and
> > > >>> there and we're good to go.
> > > >>>
> > > >>> However, this has changed with the recent upgrade to Zookeeper 3.5,
> > > >>> which isn't supported by Jepsen out-of-the-box, completely breaking
> > the
> > > >>> tests. We'd now have to write a new Zookeeper 3.5+ integration for
> > > >>> Jepsen (again, in Clojure). While I started working on that and
> could
> > > >>> likely finish it, I started to wonder whether it even makes sense
> to
> > do
> > > >>> so, and whether we couldn't invest this time elsewhere.
> > > >>>
> > > >>> Let me know what you think.
> > > >>>
> > > >>>
> > > >
> > >
> >
>

Re: [DISCUSS] Drop Jepsen tests

Reply via email to