Since docker containers can run a script on startup, can we embed the initial data set into that script/container build so that the same docker container and initial data set can be used across multiple ITs. For example, if Python and Java both have JdbcIO, it would be nice if they could leverage the same docker container with the same data set to ensure the same pipeline produces the same results?
This would be different from embedding the data in the specific IT implementation and would also create a coupling between ITs from potentially multiple languages. On Tue, Jan 17, 2017 at 4:27 PM, Stephen Sisk <[email protected]> wrote: > Hi all! > > As I've discussed previously on this list[1], ensuring that we have high > quality IO Transforms is important to beam. We want to do this without > adding too much burden on developers wanting to contribute. Below I have a > concrete proposal for what an IO integration test would look like and an > example integration test[4] that meets those requirements. > > Proposal: we should require that an IO transform includes a passing > integration test showing the IO can connect to real instance of the data > store. We still want/expect comprehensive unit tests on an IO transform, > but we would allow check ins with just some unit tests in the presence of > an IT. > > To support that, we'll require the following pieces associated with an IT: > > 1. Dockerfile that can be used to create a running instance of the data > store. We've previously discussed on this list that we would use docker > images running inside kubernetes or mesos[2], and I'd prefer having a > kubernetes/mesos script to start a given data store, but for a single > instance data store, we can take a dockerfile and use it to create a simple > kubernetes/mesos app. If you have questions about how maintaining the > containers long term would work, check [2] as I discussed a detailed plan > there. > > 2. Code to load test data on the data store created by #1. Needs to be self > contained. For now, the easiest way to do this would be to have code inside > of the IT. > > 3. The IT. I propose keeping this inside of the same module as the IO > transform itself since having all the IO transform ITs in one module would > mean there may be conflicts between different data store's dependencies. > Integration tests will need connection information pointing to the data > store it is testing. As discussed previously on this list[3], it should > receive that connection information via TestPipelineOptions. > > I'd like to get something up and running soon so people checking in new IO > transforms can start taking advantage of an IT framework. Thus, there are a > couple simplifying assumptions in this plan. Pieces of the plan that I > anticipate will evolve: > > 1. The test data load script - we would like to write these in a uniform > way and especially ensure that the test data is cleaned up after the tests > run. > > 2. Spinning up/down instances - for now, we'd likely need to do this > manually. It'd be good to get an automated process for this. That's > especially critical for performance tests with multiple nodes - there's no > need to keep instances running for that. > > Integrating closer with PKB would be a good way to do both of these things, > but first let's focus on getting some basic ITs running. > > As a concrete example of this proposal, I've written JDBC IO IT [4]. > JdbcIOTest already did a lot of test setup, so I heavily re-used it. The > key pieces: > > * The integration test is in JdbcIOIT. > > * JdbcIOIT reads the TestPipelineOptions defined in PostgresTestOptions. We > may move the TestOptions files into a common place so they can be shared > between tests. > > * Test data is created/cleaned up inside of the IT. > > * kubernetes/mesos scripts - I have provided examples of both under the > "jdbc/src/test/resources" directory, but I'd like us to decide as a project > which container orchestration service we want to use - I'll send mail about > that shortly. > > thanks! > Stephen > > [1] Integration Testing Sources > https://lists.apache.org/thread.html/518d78478ae9b6a56d6a690033071a > a6e3b817546499c4f0f18d247d@%3Cdev.beam.apache.org%3E > > [2] Container Orchestration software for hosting data stores > https://lists.apache.org/thread.html/5825b35b895839d0b33b6c726c1de0 > e76bdb9653d1e913b1207c6c4d@%3Cdev.beam.apache.org%3E > > [3] Some Thoughts on IO Integration Tests > https://lists.apache.org/thread.html/637803ccae9c9efc0f4ed01499f1a0 > 658fa73e761ab6ff4e8fa7b469@%3Cdev.beam.apache.org%3E > > [4] JDBC IO IT using postgres > https://github.com/ssisk/beam/tree/io-testing/sdks/java/io/jdbc - have not > been reviewed yet, so may contain code errors, but it does run & pass :) >
