Hi Stephen,
I really like your proposal! I don't have any comments because this seems
very well "researched" already.

I'm hoping others will also have a look at this as well because "real"
integration testing provides a new level of confidence in the code, IMHO.

Cheers,
Aljoscha


On Wed, 16 Nov 2016 at 23:36 Stephen Sisk <s...@google.com.invalid> wrote:

> Hi everyone!
>
> Currently we have a good set of unit tests for our IO Transforms - those
> tend to run against in-memory versions of the data stores. However, we'd
> like to further increase our test coverage to include running them against
> real instances of the data stores that the IO Transforms work against (e.g.
> cassandra, mongodb, kafka, etc…), which means we'll need to have real
> instances of various data stores.
>
> Additionally, if we want to do performance regression detection, it's
> important to have instances of the services that behave realistically,
> which isn't true of in-memory or dev versions of the services.
>
>
> Proposed solution
> -------------------------
> If we accept this proposal, we would create an infrastructure for running
> real instances of data stores inside of containers, using container
> management software like mesos/marathon, kubernetes, docker swarm, etc… to
> manage the instances.
>
> This would enable us to build integration tests that run against those real
> instances and performance tests that run against those real instances (like
> those that Jason Kuster is proposing elsewhere.)
>
>
> Why do we need one centralized set of instances vs just having various
> people host their own instances?
> -------------------------
> Reducing flakiness of tests is key. By not having dependencies from the
> core project on external services/instances of data stores we have
> guaranteed access to the services and the group can fix issues that arise.
>
> An exception would be something that has an ops team supporting it (eg,
> AWS, Google Cloud or other professionally managed service) - those we trust
> will be stable.
>
>
> There may be a lot of different data stores needed - how will we maintain
> them?
> -------------------------
> It will take work above and beyond that of a normal set of unit tests to
> build and maintain integration/performance tests & their data store
> instances.
>
> Setup & maintenance of the data store containers and data store instances
> on it must be automated. It also has to be as simple of a setup as
> possible, and we should avoid hand tweaking the containers - expecting
> checked in scripts/dockerfiles is key.
>
> Aligned with the community ownership approach of Apache, as members of the
> community are excited to contribute & maintain those tests and the
> integration/performance tests, people will be able to step up and do that.
> If there is no longer support for maintaining a particular set of
> integration & performance tests and their data store instances, then we can
> disable those tests. We may document on the website what IO Transforms have
> current integration/performance tests so users know what level of testing
> the various IO Transforms have.
>
>
> What about requirements for the container management software itself?
> -------------------------
> * We should have the data store instances themselves in Docker. Docker
> allows new instances to be spun up in a quick, reproducible way and is
> fairly platform independent. It has wide support from a variety of
> different container management services.
> * As little admin work required as possible. Crashing instances should be
> restarted, setup should be simple, everything possible should be
> scripted/scriptable.
> * Logs and test output should be on a publicly available website, without
> needing to log into test execution machine. Centralized capture of
> monitoring info/logs from instances running in the containers would support
> this. Ideally, this would just be supported by the container software out
> of the box.
> * It'd be useful to have good persistent volume in the container management
> software so that databases don't have to reload large data sets every time.
> * The containers may be a place to execute runners themselves if we need
> larger runner instances, so it should play well with Spark, Flink, etc…
>
> As I discussed earlier on the mailing list, it looks like hosting docker
> containers on kubernetes, docker swarm or mesos+marathon would be a good
> solution.
>
> Thanks,
> Stephen Sisk
>

Reply via email to