@Austin We already have some e2e tests[1] that guards k8s deployment(both session and application, with or without HA). And I agree with you that network partition could be simulated by K8s network policy.
[1]. https://github.com/apache/flink/blob/master/flink-end-to-end-tests/test-scripts/test_kubernetes_application_ha.sh Best, Yang Austin Cawley-Edwards <austin.caw...@gmail.com> 于2022年2月10日周四 05:12写道: > Are there e2e tests that run on kubernetes? Perhaps k8s network policies[1] > would be an option to simulate asymmetric network partitions without > modifying iptables in a more approachable way? > > Austin > > [1]: > https://kubernetes.io/docs/concepts/services-networking/network-policies/ > > > On Wed, Feb 9, 2022 at 12:40 PM David Morávek <d...@apache.org> wrote: > > > Network partitions are trickier than simply crashing process. For example > > these can be asymmetric -> as a TM you're still able to talk to the JM, > but > > you're not able to talk to other TMs. > > > > In general this could be achieved by manipulating iptables on the host > > machine (considering we spawn all the processes locally), but not sure if > > that will solve the "make it less complicated for others to contribute" > > part :/ Also this kind of test would be executable on nix systems only. > > > > I assume that jepsen uses the same approach under the hood. > > > > D. > > > > On Wed, Feb 9, 2022 at 5:43 PM Chesnay Schepler <ches...@apache.org> > > wrote: > > > > > b/c are part of the same test. > > > > > > * We have a job running, > > > * trigger a network partition (failing the job), > > > * then crash HDFS (preventing checkpoints and access to the HA > > > storageDir), > > > * then the partition is resolved and HDFS is started again. > > > > > > Conceptually I would think we can replicate this by nuking half the > > > cluster, crashing HDFS/ZK, and restarting everything. > > > > > > On 09/02/2022 17:39, Chesnay Schepler wrote: > > > > The jepsen tests cover 3 cases: > > > > a) JM/TM crashes > > > > b) HDFS namenode crash (aka, can't checkpoint because HDFS is down) > > > > c) network partitions > > > > > > > > a) can (and probably is) reasonably covered by existing ITCases and > > > > e2e tests > > > > b) We could probably figure this out ourselves if we wanted to. > > > > c) is the difficult part. > > > > > > > > Note that the tests also only cover yarn (per-job/session) and > > > > standalone (session) deployments. > > > > > > > > On 09/02/2022 17:11, Konstantin Knauf wrote: > > > >> Thank you for raising this issue. What risks do you see if we drop > > > >> it? Do > > > >> you see any cheaper alternative to (partially) mitigate those risks? > > > >> > > > >> On Wed, Feb 9, 2022 at 12:40 PM Chesnay Schepler < > ches...@apache.org> > > > >> wrote: > > > >> > > > >>> For a few years by now we had a set of Jepsen tests that verify the > > > >>> correctness of Flinks coordination layer in the case of process > > > >>> crashes. > > > >>> In the past it has indeed found issues and thus provided value to > the > > > >>> project, and in general the core idea of it (and Jepsen for that > > > >>> matter) > > > >>> is very sound. > > > >>> > > > >>> However, so far we neither made attempts to make further use of > > Jepsen > > > >>> (and limited ourselves to very basic tests) nor to familiarize > > > >>> ourselves > > > >>> with the tests/jepsen at all. > > > >>> As a result these tests are difficult to maintain. They (and > Jepsen) > > > >>> are > > > >>> written in Clojure, which makes debugging, changes and upstreaming > > > >>> contributions very difficult. > > > >>> Additionally, the tests also make use of a very complicated > > > >>> (Ververica-internal) terraform+ansible setup to spin up and tear > down > > > >>> AWS machines. While it works (and is actually pretty cool), it's > > > >>> difficult to adjust because the people who wrote it have left the > > > >>> company. > > > >>> > > > >>> Why I'm raising this now (and not earlier) is because so far > keeping > > > >>> the > > > >>> tests running wasn't much of a problem; bump a few dependencies > here > > > >>> and > > > >>> there and we're good to go. > > > >>> > > > >>> However, this has changed with the recent upgrade to Zookeeper 3.5, > > > >>> which isn't supported by Jepsen out-of-the-box, completely breaking > > the > > > >>> tests. We'd now have to write a new Zookeeper 3.5+ integration for > > > >>> Jepsen (again, in Clojure). While I started working on that and > could > > > >>> likely finish it, I started to wonder whether it even makes sense > to > > do > > > >>> so, and whether we couldn't invest this time elsewhere. > > > >>> > > > >>> Let me know what you think. > > > >>> > > > >>> > > > > > > > > > >