Are there e2e tests that run on kubernetes? Perhaps k8s network policies[1] would be an option to simulate asymmetric network partitions without modifying iptables in a more approachable way?
Austin [1]: https://kubernetes.io/docs/concepts/services-networking/network-policies/ On Wed, Feb 9, 2022 at 12:40 PM David Morávek <d...@apache.org> wrote: > Network partitions are trickier than simply crashing process. For example > these can be asymmetric -> as a TM you're still able to talk to the JM, but > you're not able to talk to other TMs. > > In general this could be achieved by manipulating iptables on the host > machine (considering we spawn all the processes locally), but not sure if > that will solve the "make it less complicated for others to contribute" > part :/ Also this kind of test would be executable on nix systems only. > > I assume that jepsen uses the same approach under the hood. > > D. > > On Wed, Feb 9, 2022 at 5:43 PM Chesnay Schepler <ches...@apache.org> > wrote: > > > b/c are part of the same test. > > > > * We have a job running, > > * trigger a network partition (failing the job), > > * then crash HDFS (preventing checkpoints and access to the HA > > storageDir), > > * then the partition is resolved and HDFS is started again. > > > > Conceptually I would think we can replicate this by nuking half the > > cluster, crashing HDFS/ZK, and restarting everything. > > > > On 09/02/2022 17:39, Chesnay Schepler wrote: > > > The jepsen tests cover 3 cases: > > > a) JM/TM crashes > > > b) HDFS namenode crash (aka, can't checkpoint because HDFS is down) > > > c) network partitions > > > > > > a) can (and probably is) reasonably covered by existing ITCases and > > > e2e tests > > > b) We could probably figure this out ourselves if we wanted to. > > > c) is the difficult part. > > > > > > Note that the tests also only cover yarn (per-job/session) and > > > standalone (session) deployments. > > > > > > On 09/02/2022 17:11, Konstantin Knauf wrote: > > >> Thank you for raising this issue. What risks do you see if we drop > > >> it? Do > > >> you see any cheaper alternative to (partially) mitigate those risks? > > >> > > >> On Wed, Feb 9, 2022 at 12:40 PM Chesnay Schepler <ches...@apache.org> > > >> wrote: > > >> > > >>> For a few years by now we had a set of Jepsen tests that verify the > > >>> correctness of Flinks coordination layer in the case of process > > >>> crashes. > > >>> In the past it has indeed found issues and thus provided value to the > > >>> project, and in general the core idea of it (and Jepsen for that > > >>> matter) > > >>> is very sound. > > >>> > > >>> However, so far we neither made attempts to make further use of > Jepsen > > >>> (and limited ourselves to very basic tests) nor to familiarize > > >>> ourselves > > >>> with the tests/jepsen at all. > > >>> As a result these tests are difficult to maintain. They (and Jepsen) > > >>> are > > >>> written in Clojure, which makes debugging, changes and upstreaming > > >>> contributions very difficult. > > >>> Additionally, the tests also make use of a very complicated > > >>> (Ververica-internal) terraform+ansible setup to spin up and tear down > > >>> AWS machines. While it works (and is actually pretty cool), it's > > >>> difficult to adjust because the people who wrote it have left the > > >>> company. > > >>> > > >>> Why I'm raising this now (and not earlier) is because so far keeping > > >>> the > > >>> tests running wasn't much of a problem; bump a few dependencies here > > >>> and > > >>> there and we're good to go. > > >>> > > >>> However, this has changed with the recent upgrade to Zookeeper 3.5, > > >>> which isn't supported by Jepsen out-of-the-box, completely breaking > the > > >>> tests. We'd now have to write a new Zookeeper 3.5+ integration for > > >>> Jepsen (again, in Clojure). While I started working on that and could > > >>> likely finish it, I started to wonder whether it even makes sense to > do > > >>> so, and whether we couldn't invest this time elsewhere. > > >>> > > >>> Let me know what you think. > > >>> > > >>> > > > > > >