Hi Stephen, I was wondering about how we plan to use the data stores across executions.
Clearly, it's best to setup a new instance (container) for every test, running a "standalone" store (say HBase/Cassandra for example), and once the test is done, teardown the instance. It should also be agnostic to the runtime environment (e.g., Docker on Kubernetes). I'm wondering though what's the overhead of managing such a deployment which could become heavy and complicated as more IOs are supported and more test cases introduced. Another way to go would be to have small clusters of different data stores and run against new "namespaces" (while lazily evicting old ones), but I think this is less likely as maintaining a distributed instance (even a small one) for each data store sounds even more complex. A third approach would be to to simply have an "embedded" in-memory instance of a data store as part of a test that runs against it (such as an embedded Kafka, though not a data store). This is probably the simplest solution in terms of orchestration, but it depends on having a proper "embedded" implementation for an IO. Does this make sense to you ? have you considered it ? Thanks, Amit On Tue, Nov 22, 2016 at 8:20 AM Jean-Baptiste Onofré <j...@nanthrax.net> wrote: > Hi Stephen, > > as already discussed a bit together, it sounds great ! Especially I like > it as a both integration test platform and good coverage for IOs. > > I'm very late on this but, as said, I will share with you my Marathon > JSON and Mesos docker images. > > By the way, I started to experiment a bit kubernetes and swamp but it's > not yet complete. I will share what I have on the same github repo. > > Thanks ! > Regards > JB > > On 11/16/2016 11:36 PM, Stephen Sisk wrote: > > Hi everyone! > > > > Currently we have a good set of unit tests for our IO Transforms - those > > tend to run against in-memory versions of the data stores. However, we'd > > like to further increase our test coverage to include running them > against > > real instances of the data stores that the IO Transforms work against > (e.g. > > cassandra, mongodb, kafka, etc…), which means we'll need to have real > > instances of various data stores. > > > > Additionally, if we want to do performance regression detection, it's > > important to have instances of the services that behave realistically, > > which isn't true of in-memory or dev versions of the services. > > > > > > Proposed solution > > ------------------------- > > If we accept this proposal, we would create an infrastructure for running > > real instances of data stores inside of containers, using container > > management software like mesos/marathon, kubernetes, docker swarm, etc… > to > > manage the instances. > > > > This would enable us to build integration tests that run against those > real > > instances and performance tests that run against those real instances > (like > > those that Jason Kuster is proposing elsewhere.) > > > > > > Why do we need one centralized set of instances vs just having various > > people host their own instances? > > ------------------------- > > Reducing flakiness of tests is key. By not having dependencies from the > > core project on external services/instances of data stores we have > > guaranteed access to the services and the group can fix issues that > arise. > > > > An exception would be something that has an ops team supporting it (eg, > > AWS, Google Cloud or other professionally managed service) - those we > trust > > will be stable. > > > > > > There may be a lot of different data stores needed - how will we maintain > > them? > > ------------------------- > > It will take work above and beyond that of a normal set of unit tests to > > build and maintain integration/performance tests & their data store > > instances. > > > > Setup & maintenance of the data store containers and data store instances > > on it must be automated. It also has to be as simple of a setup as > > possible, and we should avoid hand tweaking the containers - expecting > > checked in scripts/dockerfiles is key. > > > > Aligned with the community ownership approach of Apache, as members of > the > > community are excited to contribute & maintain those tests and the > > integration/performance tests, people will be able to step up and do > that. > > If there is no longer support for maintaining a particular set of > > integration & performance tests and their data store instances, then we > can > > disable those tests. We may document on the website what IO Transforms > have > > current integration/performance tests so users know what level of testing > > the various IO Transforms have. > > > > > > What about requirements for the container management software itself? > > ------------------------- > > * We should have the data store instances themselves in Docker. Docker > > allows new instances to be spun up in a quick, reproducible way and is > > fairly platform independent. It has wide support from a variety of > > different container management services. > > * As little admin work required as possible. Crashing instances should be > > restarted, setup should be simple, everything possible should be > > scripted/scriptable. > > * Logs and test output should be on a publicly available website, without > > needing to log into test execution machine. Centralized capture of > > monitoring info/logs from instances running in the containers would > support > > this. Ideally, this would just be supported by the container software out > > of the box. > > * It'd be useful to have good persistent volume in the container > management > > software so that databases don't have to reload large data sets every > time. > > * The containers may be a place to execute runners themselves if we need > > larger runner instances, so it should play well with Spark, Flink, etc… > > > > As I discussed earlier on the mailing list, it looks like hosting docker > > containers on kubernetes, docker swarm or mesos+marathon would be a good > > solution. > > > > Thanks, > > Stephen Sisk > > > > -- > Jean-Baptiste Onofré > jbono...@apache.org > http://blog.nanthrax.net > Talend - http://www.talend.com >