Here's a link to the doc I mentioned that discusses implementing tests for Beam IO transforms - https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-NprQ7vbf1jNVRgdqeEE8I/edit?usp=sharing This doc is definitely not finished (the benchmark section is notably missing, and other sections are definitely still draft :), but I thought it would be useful to talk about the high-level goals there.
High level points from the doc ======================= I think the most important part of this doc for now is the goals of testing IO transforms that I propose, and how I propose that we cover them: 1. IO Transform is correct - corner cases incl. DWR, thread-safety (cover with unit tests) 2. IO Transform works w/ real instance of data store (cover with integration tests) 3. Runners correctly run IO Transforms (integration tests) 4. IO Transform/Runner pairs scale - to "medium data" [1] (perf test) 5. IO Transform/Runner pairs - basic correctness at scale (basic output validation on perf test) [1] medium data = 5 data store instances and/or NNN GB data? TBD I go into more detail about the goals in a table in the section "Test Strategy for Beam IO" I also discuss setting up mocks/fakes in unit tests, as well as a strategy for testing network failures/retries (as discussed in my last email.) Stephen On Tue, Nov 22, 2016 at 5:00 PM Stephen Sisk <s...@google.com> wrote: Hi, I'm excited we're getting lots of discussion going. There are many threads of conversation here, we may choose to split some of them off into a different email thread. I'm also betting I missed some of the questions in this thread, so apologies ahead of time for that. Also apologies for the amount of text, I provided some quick summaries at the top of each section. Amit - thanks for your thoughts. I've responded in detail below. Ismael - thanks for offering to help. There's plenty of work here to go around. I'll try and think about how we can divide up some next steps (probably in a separate thread.) The main next step I see is deciding between kubernetes/mesos+marathon/docker swarm - I'm working on that, but having lots of different thoughts on what the advantages/disadvantages of those are would be helpful (I'm not entirely sure of the protocol for collaborating on sub-projects like this.) These issues are all related to what kind of tests we want to write. I think a kubernetes/mesos/swarm cluster could support all the use cases we've discussed here (and thus should not block moving forward with this), but understanding what we want to test will help us understand how the cluster will be used. I'm working on a proposed user guide for testing IO Transforms, and I'm going to send out a link to that + a short summary to the list shortly so folks can get a better sense of where I'm coming from. Here's my thinking on the questions we've raised here - Embedded versions of data stores for testing -------------------- Summary: yes! But we still need real data stores to test against. I am a gigantic fan of using embedded versions of the various data stores. I think we should test everything we possibly can using them, and do the majority of our correctness testing using embedded versions + the direct runner. However, it's also important to have at least one test that actually connects to an actual instance, so we can get coverage for things like credentials, real connection strings, etc... The key point is that embedded versions definitely can't cover the performance tests, so we need to host instances if we want to test that. I consider the integration tests/performance benchmarks to be costly things that we do only for the IO transforms with large amounts of community support/usage. A random IO transform used by a few users doesn't necessarily need integration & perf tests, but for heavily used IO transforms, there's a lot of community value in these tests. The maintenance proposal below scales with the amount of community support for a particular IO transform. Reusing data stores ("use the data stores across executions.") ------------------ Summary: I favor a hybrid approach: some frequently used, very small instances that we keep up all the time + larger multi-container data store instances that we spin up for perf tests. I don't think we need to have a strong answer to this question, but I think we do need to know what range of capabilities we need, and use that to inform our requirements on the hosting infrastructure. I think kubernetes/mesos + docker can support all the scenarios I discuss below. I had been thinking of a hybrid approach - reuse some instances and don't reuse others. Some tests require isolation from other tests (eg. performance benchmarking), while others can easily re-use the same database/data store instance over time, provided they are written in the correct manner (eg. a simple read or write correctness integration tests) To me, the question of whether to use one instance over time for a test vs spin up an instance for each test comes down to a trade off between these factors: 1. Flakiness of spin-up of an instance - if it's super flaky, we'll want to keep more instances up and running rather than bring them up/down. (this may also vary by the data store in question) 2. Frequency of testing - if we are running tests every 5 minutes, it may be wasteful to bring machines up/down every time. If we run tests once a day or week, it seems wasteful to keep the machines up the whole time. 3. Isolation requirements - If tests must be isolated, it means we either have to bring up the instances for each test, or we have to have some sort of signaling mechanism to indicate that a given instance is in use. I strongly favor bringing up an instance per test. 4. Number/size of containers - if we need a large number of machines for a particular test, keeping them running all the time will use more resources. The major unknown to me is how flaky it'll be to spin these up. I'm hopeful/assuming they'll be pretty stable to bring up, but I think the best way to test that is to start doing it. I suspect the sweet spot is the following: have a set of very small data store instances that stay up to support small-data-size post-commit end to end tests (post-commits run frequently and the data size means the instances would not use many resources), combined with the ability to spin up larger instances for once a day/week performance benchmarks (these use up more resources and are used less frequently.) That's the mix I'll propose in my docs on testing IO transforms. If spinning up new instances is cheap/non-flaky, I'd be fine with the idea of spinning up instances for each test. Management ("what's the overhead of managing such a deployment") -------------------- Summary: I propose that anyone can contribute scripts for setting up data store instances + integration/perf tests, but if the community doesn't maintain a particular data store's tests, we disable the tests and turn off the data store instances. Management of these instances is a crucial question. First, let's break down what tasks we'll need to do on a recurring basis: 1. Ongoing maintenance (update to new versions, both instance & dependencies) - we don't want to have a lot of old versions vulnerable to attacks/buggy 2. Investigate breakages/regressions (I'm betting there will be more things we'll discover - let me know if you have suggestions) There's a couple goals I see: 1. We should only do sys admin work for things that give us a lot of benefit. (ie, don't build IT/perf/data store set up scripts for data stores without a large community) 2. We should do as much as possible of testing via in-memory/embedded testing (as you brought up). 3. Reduce the amount of manual administration overhead As I discussed above, I think that integration tests/performance benchmarks are costly things that we should do only for the IO transforms with large amounts of community support/usage. Thus, I propose that we limit the IO transforms that get integration tests & performance benchmarks to those that have community support for maintaining the data store instances. We can enforce this organically using some simple rules: 1. Investigating breakages/regressions: if a given integration/perf test starts failing and no one investigates it within a set period of time (a week?), we disable the tests and shut off the data store instances if we have instances running. When someone wants to step up and support it again, they can fix the test, check it in, and re-enable the test. 2. Ongoing maintenance: every N months, file a jira issue that is just "is the IO Transform X data store up to date?" - if the jira is not resolved in a set period of time (1 month?), the perf/integration tests are disabled, and the data store instances shut off. This is pretty flexible - * If a particular person or organization wants to support an IO transform, they can. If a group of people all organically organize to keep the tests running, they can. * It can be mostly automated - there's not a lot of central organizing work that needs to be done. Exposing the information about what IO transforms currently have running IT/perf benchmarks on the website will let users know what IO transforms are well supported. I like this solution, but I also recognize this is a tricky problem. This is something the community needs to be supportive of, so I'm open to other thoughts. Simulating failures in real nodes ("programmatic tests to simulate failure") ----------------- Summary: 1) Focus our testing on the code in Beam 2) We should encourage a design pattern separating out network/retry logic from the main IO transform logic We *could* create instance failure in any container management software - we can use their programmatic APIs to determine which containers are running the instances, and ask them to kill the container in question. A slow node would be trickier, but I'm sure we could figure it out - for example, add a network proxy that would delay responses. However, I would argue that this type of testing doesn't gain us a lot, and is complicated to set up. I think it will be easier to test network errors and retry behavior in unit tests for the IO transforms. Part of the way to handle this is to separate out the read code from the network code (eg. bigtable has BigtableService). If you put the "handle errors/retry logic" code in a separate MySourceService class, you can test MySourceService on the wide variety of networks errors/data store problems, and then your main IO transform tests focus on the read behavior and handling the small set of errors the MySourceService class will return. I also think we should focus on testing the IO Transform, not the data store - if we kill a node in a data store, it's that data store's problem, not beam's problem. As you were pointing out, there are a *large* number of possible ways that a particular data store can fail, and we would like to support many different data stores. Rather than try to test that each data store behaves well, we should ensure that we handle generic/expected errors in a graceful manner. Ismaeal had a couple other quick comments/questions, I'll answer here - We can use this to test other runners running on multiple machines - I agree. This is also necessary for a good performance benchmark test. "providing the test machines to mount the cluster" - we can discuss this further, but one possible option is that google may be willing to donate something to support this. "IO Consistency" - let's follow up on those questions in another thread. That's as much about the public interface we provide to users as anything else. I agree with your sentiment that a user should be able to expect predictable behavior from the different IO transforms. Thanks for everyone's questions/comments - I really am excited to see that people care about this :) Stephen On Tue, Nov 22, 2016 at 7:59 AM Ismaël Mejía <ieme...@gmail.com> wrote: Hello, @Stephen Thanks for your proposal, it is really interesting, I would really like to help with this. I have never played with Kubernetes but this seems a really nice chance to do something useful with it. We (at Talend) are testing most of the IOs using simple container images and in some particular cases ‘clusters’ of containers using docker-compose (a little bit like Amit’s (2) proposal). It would be really nice to have this at the Beam level, in particular to try to test more complex semantics, I don’t know how programmable kubernetes is to achieve this for example: Let’s think we have a cluster of Cassandra or Kafka nodes, I would like to have programmatic tests to simulate failure (e.g. kill a node), or simulate a really slow node, to ensure that the IO behaves as expected in the Beam pipeline for the given runner. Another related idea is to improve IO consistency: Today the different IOs have small differences in their failure behavior, I really would like to be able to predict with more precision what will happen in case of errors, e.g. what is the correct behavior if I am writing to a Kafka node and there is a network partition, does the Kafka sink retries or no ? and what if it is the JdbcIO ?, will it work the same e.g. assuming checkpointing? Or do we guarantee exactly once writes somehow?, today I am not sure about what happens (or if the expected behavior depends on the runner), but well maybe it is just that I don’t know and we have tests to ensure this. Of course both are really hard problems, but I think with your proposal we can try to tackle them, as well as the performance ones. And apart of the data stores, I think it will be also really nice to be able to test the runners in a distributed manner. So what is the next step? How do you imagine such integration tests? ? Who can provide the test machines so we can mount the cluster? Maybe my ideas are a bit too far away for an initial setup, but it will be really nice to start working on this. Ismael On Tue, Nov 22, 2016 at 11:00 AM, Amit Sela <amitsel...@gmail.com> wrote: > Hi Stephen, > > I was wondering about how we plan to use the data stores across executions. > > Clearly, it's best to setup a new instance (container) for every test, > running a "standalone" store (say HBase/Cassandra for example), and once > the test is done, teardown the instance. It should also be agnostic to the > runtime environment (e.g., Docker on Kubernetes). > I'm wondering though what's the overhead of managing such a deployment > which could become heavy and complicated as more IOs are supported and more > test cases introduced. > > Another way to go would be to have small clusters of different data stores > and run against new "namespaces" (while lazily evicting old ones), but I > think this is less likely as maintaining a distributed instance (even a > small one) for each data store sounds even more complex. > > A third approach would be to to simply have an "embedded" in-memory > instance of a data store as part of a test that runs against it (such as an > embedded Kafka, though not a data store). > This is probably the simplest solution in terms of orchestration, but it > depends on having a proper "embedded" implementation for an IO. > > Does this make sense to you ? have you considered it ? > > Thanks, > Amit > > On Tue, Nov 22, 2016 at 8:20 AM Jean-Baptiste Onofré <j...@nanthrax.net> > wrote: > > > Hi Stephen, > > > > as already discussed a bit together, it sounds great ! Especially I like > > it as a both integration test platform and good coverage for IOs. > > > > I'm very late on this but, as said, I will share with you my Marathon > > JSON and Mesos docker images. > > > > By the way, I started to experiment a bit kubernetes and swamp but it's > > not yet complete. I will share what I have on the same github repo. > > > > Thanks ! > > Regards > > JB > > > > On 11/16/2016 11:36 PM, Stephen Sisk wrote: > > > Hi everyone! > > > > > > Currently we have a good set of unit tests for our IO Transforms - > those > > > tend to run against in-memory versions of the data stores. However, > we'd > > > like to further increase our test coverage to include running them > > against > > > real instances of the data stores that the IO Transforms work against > > (e.g. > > > cassandra, mongodb, kafka, etc…), which means we'll need to have real > > > instances of various data stores. > > > > > > Additionally, if we want to do performance regression detection, it's > > > important to have instances of the services that behave realistically, > > > which isn't true of in-memory or dev versions of the services. > > > > > > > > > Proposed solution > > > ------------------------- > > > If we accept this proposal, we would create an infrastructure for > running > > > real instances of data stores inside of containers, using container > > > management software like mesos/marathon, kubernetes, docker swarm, etc… > > to > > > manage the instances. > > > > > > This would enable us to build integration tests that run against those > > real > > > instances and performance tests that run against those real instances > > (like > > > those that Jason Kuster is proposing elsewhere.) > > > > > > > > > Why do we need one centralized set of instances vs just having various > > > people host their own instances? > > > ------------------------- > > > Reducing flakiness of tests is key. By not having dependencies from the > > > core project on external services/instances of data stores we have > > > guaranteed access to the services and the group can fix issues that > > arise. > > > > > > An exception would be something that has an ops team supporting it (eg, > > > AWS, Google Cloud or other professionally managed service) - those we > > trust > > > will be stable. > > > > > > > > > There may be a lot of different data stores needed - how will we > maintain > > > them? > > > ------------------------- > > > It will take work above and beyond that of a normal set of unit tests > to > > > build and maintain integration/performance tests & their data store > > > instances. > > > > > > Setup & maintenance of the data store containers and data store > instances > > > on it must be automated. It also has to be as simple of a setup as > > > possible, and we should avoid hand tweaking the containers - expecting > > > checked in scripts/dockerfiles is key. > > > > > > Aligned with the community ownership approach of Apache, as members of > > the > > > community are excited to contribute & maintain those tests and the > > > integration/performance tests, people will be able to step up and do > > that. > > > If there is no longer support for maintaining a particular set of > > > integration & performance tests and their data store instances, then we > > can > > > disable those tests. We may document on the website what IO Transforms > > have > > > current integration/performance tests so users know what level of > testing > > > the various IO Transforms have. > > > > > > > > > What about requirements for the container management software itself? > > > ------------------------- > > > * We should have the data store instances themselves in Docker. Docker > > > allows new instances to be spun up in a quick, reproducible way and is > > > fairly platform independent. It has wide support from a variety of > > > different container management services. > > > * As little admin work required as possible. Crashing instances should > be > > > restarted, setup should be simple, everything possible should be > > > scripted/scriptable. > > > * Logs and test output should be on a publicly available website, > without > > > needing to log into test execution machine. Centralized capture of > > > monitoring info/logs from instances running in the containers would > > support > > > this. Ideally, this would just be supported by the container software > out > > > of the box. > > > * It'd be useful to have good persistent volume in the container > > management > > > software so that databases don't have to reload large data sets every > > time. > > > * The containers may be a place to execute runners themselves if we > need > > > larger runner instances, so it should play well with Spark, Flink, etc… > > > > > > As I discussed earlier on the mailing list, it looks like hosting > docker > > > containers on kubernetes, docker swarm or mesos+marathon would be a > good > > > solution. > > > > > > Thanks, > > > Stephen Sisk > > > > > > > -- > > Jean-Baptiste Onofré > > jbono...@apache.org > > http://blog.nanthrax.net > > Talend - http://www.talend.com > > >