Are there e2e tests that run on kubernetes? Perhaps k8s network policies[1]
would be an option to simulate asymmetric network partitions without
modifying iptables in a more approachable way?

Austin

[1]:
https://kubernetes.io/docs/concepts/services-networking/network-policies/


On Wed, Feb 9, 2022 at 12:40 PM David Morávek <d...@apache.org> wrote:

> Network partitions are trickier than simply crashing process. For example
> these can be asymmetric -> as a TM you're still able to talk to the JM, but
> you're not able to talk to other TMs.
>
> In general this could be achieved by manipulating iptables on the host
> machine (considering we spawn all the processes locally), but not sure if
> that will solve the "make it less complicated for others to contribute"
> part :/ Also this kind of test would be executable on nix systems only.
>
> I assume that jepsen uses the same approach under the hood.
>
> D.
>
> On Wed, Feb 9, 2022 at 5:43 PM Chesnay Schepler <ches...@apache.org>
> wrote:
>
> > b/c are part of the same test.
> >
> >   * We have a job running,
> >   * trigger a network partition (failing the job),
> >   * then crash HDFS (preventing checkpoints and access to the HA
> >     storageDir),
> >   * then the partition is resolved and HDFS is started again.
> >
> > Conceptually I would think we can replicate this by nuking half the
> > cluster, crashing HDFS/ZK, and restarting everything.
> >
> > On 09/02/2022 17:39, Chesnay Schepler wrote:
> > > The jepsen tests cover 3 cases:
> > > a) JM/TM crashes
> > > b) HDFS namenode crash (aka, can't checkpoint because HDFS is down)
> > > c) network partitions
> > >
> > > a) can (and probably is) reasonably covered by existing ITCases and
> > > e2e tests
> > > b) We could probably figure this out ourselves if we wanted to.
> > > c) is the difficult part.
> > >
> > > Note that the tests also only cover yarn (per-job/session) and
> > > standalone (session) deployments.
> > >
> > > On 09/02/2022 17:11, Konstantin Knauf wrote:
> > >> Thank you for raising this issue. What risks do you see if we drop
> > >> it? Do
> > >> you see any cheaper alternative to (partially) mitigate those risks?
> > >>
> > >> On Wed, Feb 9, 2022 at 12:40 PM Chesnay Schepler <ches...@apache.org>
> > >> wrote:
> > >>
> > >>> For a few years by now we had a set of Jepsen tests that verify the
> > >>> correctness of Flinks coordination layer in the case of process
> > >>> crashes.
> > >>> In the past it has indeed found issues and thus provided value to the
> > >>> project, and in general the core idea of it (and Jepsen for that
> > >>> matter)
> > >>> is very sound.
> > >>>
> > >>> However, so far we neither made attempts to make further use of
> Jepsen
> > >>> (and limited ourselves to very basic tests) nor to familiarize
> > >>> ourselves
> > >>> with the tests/jepsen at all.
> > >>> As a result these tests are difficult to maintain. They (and Jepsen)
> > >>> are
> > >>> written in Clojure, which makes debugging, changes and upstreaming
> > >>> contributions very difficult.
> > >>> Additionally, the tests also make use of a very complicated
> > >>> (Ververica-internal) terraform+ansible setup to spin up and tear down
> > >>> AWS machines. While it works (and is actually pretty cool), it's
> > >>> difficult to adjust because the people who wrote it have left the
> > >>> company.
> > >>>
> > >>> Why I'm raising this now (and not earlier) is because so far keeping
> > >>> the
> > >>> tests running wasn't much of a problem; bump a few dependencies here
> > >>> and
> > >>> there and we're good to go.
> > >>>
> > >>> However, this has changed with the recent upgrade to Zookeeper 3.5,
> > >>> which isn't supported by Jepsen out-of-the-box, completely breaking
> the
> > >>> tests. We'd now have to write a new Zookeeper 3.5+ integration for
> > >>> Jepsen (again, in Clojure). While I started working on that and could
> > >>> likely finish it, I started to wonder whether it even makes sense to
> do
> > >>> so, and whether we couldn't invest this time elsewhere.
> > >>>
> > >>> Let me know what you think.
> > >>>
> > >>>
> > >
> >
>

Reply via email to