Hi everyone!

Currently we have a good set of unit tests for our IO Transforms - those
tend to run against in-memory versions of the data stores. However, we'd
like to further increase our test coverage to include running them against
real instances of the data stores that the IO Transforms work against (e.g.
cassandra, mongodb, kafka, etc…), which means we'll need to have real
instances of various data stores.

Additionally, if we want to do performance regression detection, it's
important to have instances of the services that behave realistically,
which isn't true of in-memory or dev versions of the services.


Proposed solution
-------------------------
If we accept this proposal, we would create an infrastructure for running
real instances of data stores inside of containers, using container
management software like mesos/marathon, kubernetes, docker swarm, etc… to
manage the instances.

This would enable us to build integration tests that run against those real
instances and performance tests that run against those real instances (like
those that Jason Kuster is proposing elsewhere.)


Why do we need one centralized set of instances vs just having various
people host their own instances?
-------------------------
Reducing flakiness of tests is key. By not having dependencies from the
core project on external services/instances of data stores we have
guaranteed access to the services and the group can fix issues that arise.

An exception would be something that has an ops team supporting it (eg,
AWS, Google Cloud or other professionally managed service) - those we trust
will be stable.


There may be a lot of different data stores needed - how will we maintain
them?
-------------------------
It will take work above and beyond that of a normal set of unit tests to
build and maintain integration/performance tests & their data store
instances.

Setup & maintenance of the data store containers and data store instances
on it must be automated. It also has to be as simple of a setup as
possible, and we should avoid hand tweaking the containers - expecting
checked in scripts/dockerfiles is key.

Aligned with the community ownership approach of Apache, as members of the
community are excited to contribute & maintain those tests and the
integration/performance tests, people will be able to step up and do that.
If there is no longer support for maintaining a particular set of
integration & performance tests and their data store instances, then we can
disable those tests. We may document on the website what IO Transforms have
current integration/performance tests so users know what level of testing
the various IO Transforms have.


What about requirements for the container management software itself?
-------------------------
* We should have the data store instances themselves in Docker. Docker
allows new instances to be spun up in a quick, reproducible way and is
fairly platform independent. It has wide support from a variety of
different container management services.
* As little admin work required as possible. Crashing instances should be
restarted, setup should be simple, everything possible should be
scripted/scriptable.
* Logs and test output should be on a publicly available website, without
needing to log into test execution machine. Centralized capture of
monitoring info/logs from instances running in the containers would support
this. Ideally, this would just be supported by the container software out
of the box.
* It'd be useful to have good persistent volume in the container management
software so that databases don't have to reload large data sets every time.
* The containers may be a place to execute runners themselves if we need
larger runner instances, so it should play well with Spark, Flink, etc…

As I discussed earlier on the mailing list, it looks like hosting docker
containers on kubernetes, docker swarm or mesos+marathon would be a good
solution.

Thanks,
Stephen Sisk

Reply via email to