Re: IO Integration tests - concrete proposal

Stephen Sisk Thu, 19 Jan 2017 11:33:06 -0800

I definitely agree that sharing resources between tests is more efficient.

Etienne - do you think it's necessary to separate the IT from the data
loading script? The postgres/JdbcIOIT can use the natural namespacing of
tables and I feel pretty comfortable that will work well over time. You
haven't explicitly mentioned it, but I'm assuming that elasticsearch
doesn't allow such namespacing, so that's why you're having to do the
separation? I'm not trying to discourage separating data load from IT, just
wondering whether it's truly necessary.


I was trying to consolidate what we're discussed down to a few guidelines.
I think those are that IO ITs:
1. Can assume the data load script has been run before the test (unless the
data load script is run by the test itself)
2. Must work if they are run multiple times without the data load script
being run in between (ie, they should clean up after themselves or use
namespacing such that tests don't interfere with one another)
3. Tests that generate large amounts of data will attempt to clean up after
themselves. (ie, if you just write 100 rows, don't worry about it - if you
write 5 gb of data, you'd need to clean up.) We will not assume this will
always succeed in cleaning up, but my assumption is that if a particular
data store gets into a bad state, we'll just destroy/recreate that
particular data store.

If the tests follow those assumptions, then that should support all the
scenarios I can think of: running data store create + data load script
occasionally (say, once a week or month) all the way up to running them
once per test run (if we decided to go that far.)

S

On Wed, Jan 18, 2017 at 7:57 AM Etienne Chauchot <[email protected]>
wrote:

Hi,

Yes, thanks all for these clarifications about testing architecture.

I agree that point 1 and 2 should be shared between tests as much as
possible. Especially sharing data loading between tests is more
time-effective and resource-effective: tests that need data (testRead,
testSplit, ...) will save the loading time, the wait for asynchronous
indexation and cleaning time. Just a small comment:

If we share the data loading between tests, then tests that expect an
empty dataset (testWrite, ...), obviously cannot clear the shared dataset.

So they will need to write to a dedicated place (other than read tests)
and clean it afterwards.

I will update ElasticSearch read IT
(https://github.com/echauchot/beam/tree/BEAM-1184-ELASTICSEARCH-IO-IT)
to not do data loading/cleaning and write IT to use another location
than read IT

Etienne

Le 18/01/2017 à 13:47, Jean-Baptiste Onofré a écrit :
> Hi guys,
>
> Firs, great e-mail Stephen: complete and detailed proposal.
>
> Lukasz raised a good point: it makes sense to be able to leverage the
> same "bootstrap" script.
>
> We discussed about providing the following in each IO:
> 1. code to load data (java, script, whatever)
> 2. script to bootstrap the backend (dockerfile, kubernetes script, ...)
> 3. actual integration tests
>
> Only 3 is specific to the IO: 1 and 2 can be the same either if we run
> integration tests for Python or integration tests for Java SDKs.
>
> However,  3 may depend to 1 and 2 (the integration tests perform some
> assertion based on the loaded data for instance).
> Today, correct me if I'm  wrong, but 1 and 2 will be executed by hand
> or by Jenkins using a "description" of where the code and script are
> located.
>
> So, I think that we can put 1 and 2 in the IO and use "descriptor" to
> do the bootstrapping.
>
> Regards
> JB
>
> On 01/17/2017 04:37 PM, Lukasz Cwik wrote:
>> Since docker containers can run a script on startup, can we embed the
>> initial data set into that script/container build so that the same
>> docker
>> container and initial data set can be used across multiple ITs. For
>> example, if Python and Java both have JdbcIO, it would be nice if they
>> could leverage the same docker container with the same data set to
>> ensure
>> the same pipeline produces the same results?
>>
>> This would be different from embedding the data in the specific IT
>> implementation and would also create a coupling between ITs from
>> potentially multiple languages.
>>
>> On Tue, Jan 17, 2017 at 4:27 PM, Stephen Sisk <[email protected]>
>> wrote:
>>
>>> Hi all!
>>>
>>> As I've discussed previously on this list[1], ensuring that we have
>>> high
>>> quality IO Transforms is important to beam. We want to do this without
>>> adding too much burden on developers wanting to contribute. Below I
>>> have a
>>> concrete proposal for what an IO integration test would look like
>>> and an
>>> example integration test[4] that meets those requirements.
>>>
>>> Proposal: we should require that an IO transform includes a passing
>>> integration test showing the IO can connect to real instance of the
>>> data
>>> store. We still want/expect comprehensive unit tests on an IO
>>> transform,
>>> but we would allow check ins with just some unit tests in the
>>> presence of
>>> an IT.
>>>
>>> To support that, we'll require the following pieces associated with
>>> an IT:
>>>
>>> 1. Dockerfile that can be used to create a running instance of the data
>>> store. We've previously discussed on this list that we would use docker
>>> images running inside kubernetes or mesos[2], and I'd prefer having a
>>> kubernetes/mesos script to start a given data store, but for a single
>>> instance data store, we can take a dockerfile and use it to create a
>>> simple
>>> kubernetes/mesos app. If you have questions about how maintaining the
>>> containers long term would work, check [2] as I discussed a detailed
>>> plan
>>> there.
>>>
>>> 2. Code to load test data on the data store created by #1. Needs to
>>> be self
>>> contained. For now, the easiest way to do this would be to have code
>>> inside
>>> of the IT.
>>>
>>> 3. The IT. I propose keeping this inside of the same module as the IO
>>> transform itself since having all the IO transform ITs in one module
>>> would
>>> mean there may be conflicts between different data store's
>>> dependencies.
>>> Integration tests will need connection information pointing to the data
>>> store it is testing. As discussed previously on this list[3], it should
>>> receive that connection information via TestPipelineOptions.
>>>
>>> I'd like to get something up and running soon so people checking in
>>> new IO
>>> transforms can start taking advantage of an IT framework. Thus,
>>> there are a
>>> couple simplifying assumptions in this plan. Pieces of the plan that I
>>> anticipate will evolve:
>>>
>>> 1. The test data load script - we would like to write these in a
>>> uniform
>>> way and especially ensure that the test data is cleaned up after the
>>> tests
>>> run.
>>>
>>> 2. Spinning up/down instances - for now, we'd likely need to do this
>>> manually. It'd be good to get an automated process for this. That's
>>> especially critical for performance tests with multiple nodes -
>>> there's no
>>> need to keep instances running for that.
>>>
>>> Integrating closer with PKB would be a good way to do both of these
>>> things,
>>> but first let's focus on getting some basic ITs running.
>>>
>>> As a concrete example of this proposal, I've written JDBC IO IT [4].
>>> JdbcIOTest already did a lot of test setup, so I heavily re-used it.
>>> The
>>> key pieces:
>>>
>>> * The integration test is in JdbcIOIT.
>>>
>>> * JdbcIOIT reads the TestPipelineOptions defined in
>>> PostgresTestOptions. We
>>> may move the TestOptions files into a common place so they can be
>>> shared
>>> between tests.
>>>
>>> * Test data is created/cleaned up inside of the IT.
>>>
>>> * kubernetes/mesos scripts - I have provided examples of both under the
>>> "jdbc/src/test/resources" directory, but I'd like us to decide as a
>>> project
>>> which container orchestration service we want to use - I'll send
>>> mail about
>>> that shortly.
>>>
>>> thanks!
>>> Stephen
>>>
>>> [1] Integration Testing Sources
>>> https://lists.apache.org/thread.html/518d78478ae9b6a56d6a690033071a
>>> a6e3b817546499c4f0f18d247d@%3Cdev.beam.apache.org%3E
>>>
>>> [2] Container Orchestration software for hosting data stores
>>> https://lists.apache.org/thread.html/5825b35b895839d0b33b6c726c1de0
>>> e76bdb9653d1e913b1207c6c4d@%3Cdev.beam.apache.org%3E
>>>
>>> [3] Some Thoughts on IO Integration Tests
>>> https://lists.apache.org/thread.html/637803ccae9c9efc0f4ed01499f1a0
>>> 658fa73e761ab6ff4e8fa7b469@%3Cdev.beam.apache.org%3E
>>>
>>> [4] JDBC IO IT using postgres
>>> https://github.com/ssisk/beam/tree/io-testing/sdks/java/io/jdbc -
>>> have not
>>> been reviewed yet, so may contain code errors, but it does run &
>>> pass :)
>>>
>>
>

Re: IO Integration tests - concrete proposal

Reply via email to