Hi Stephen, I really like your proposal! I don't have any comments because this seems very well "researched" already.
I'm hoping others will also have a look at this as well because "real" integration testing provides a new level of confidence in the code, IMHO. Cheers, Aljoscha On Wed, 16 Nov 2016 at 23:36 Stephen Sisk <s...@google.com.invalid> wrote: > Hi everyone! > > Currently we have a good set of unit tests for our IO Transforms - those > tend to run against in-memory versions of the data stores. However, we'd > like to further increase our test coverage to include running them against > real instances of the data stores that the IO Transforms work against (e.g. > cassandra, mongodb, kafka, etc…), which means we'll need to have real > instances of various data stores. > > Additionally, if we want to do performance regression detection, it's > important to have instances of the services that behave realistically, > which isn't true of in-memory or dev versions of the services. > > > Proposed solution > ------------------------- > If we accept this proposal, we would create an infrastructure for running > real instances of data stores inside of containers, using container > management software like mesos/marathon, kubernetes, docker swarm, etc… to > manage the instances. > > This would enable us to build integration tests that run against those real > instances and performance tests that run against those real instances (like > those that Jason Kuster is proposing elsewhere.) > > > Why do we need one centralized set of instances vs just having various > people host their own instances? > ------------------------- > Reducing flakiness of tests is key. By not having dependencies from the > core project on external services/instances of data stores we have > guaranteed access to the services and the group can fix issues that arise. > > An exception would be something that has an ops team supporting it (eg, > AWS, Google Cloud or other professionally managed service) - those we trust > will be stable. > > > There may be a lot of different data stores needed - how will we maintain > them? > ------------------------- > It will take work above and beyond that of a normal set of unit tests to > build and maintain integration/performance tests & their data store > instances. > > Setup & maintenance of the data store containers and data store instances > on it must be automated. It also has to be as simple of a setup as > possible, and we should avoid hand tweaking the containers - expecting > checked in scripts/dockerfiles is key. > > Aligned with the community ownership approach of Apache, as members of the > community are excited to contribute & maintain those tests and the > integration/performance tests, people will be able to step up and do that. > If there is no longer support for maintaining a particular set of > integration & performance tests and their data store instances, then we can > disable those tests. We may document on the website what IO Transforms have > current integration/performance tests so users know what level of testing > the various IO Transforms have. > > > What about requirements for the container management software itself? > ------------------------- > * We should have the data store instances themselves in Docker. Docker > allows new instances to be spun up in a quick, reproducible way and is > fairly platform independent. It has wide support from a variety of > different container management services. > * As little admin work required as possible. Crashing instances should be > restarted, setup should be simple, everything possible should be > scripted/scriptable. > * Logs and test output should be on a publicly available website, without > needing to log into test execution machine. Centralized capture of > monitoring info/logs from instances running in the containers would support > this. Ideally, this would just be supported by the container software out > of the box. > * It'd be useful to have good persistent volume in the container management > software so that databases don't have to reload large data sets every time. > * The containers may be a place to execute runners themselves if we need > larger runner instances, so it should play well with Spark, Flink, etc… > > As I discussed earlier on the mailing list, it looks like hosting docker > containers on kubernetes, docker swarm or mesos+marathon would be a good > solution. > > Thanks, > Stephen Sisk >