Re: IO Integration tests - concrete proposal

Jean-Baptiste Onofré Wed, 25 Jan 2017 23:31:21 -0800

Hi Stephen

Yup it sounds good. My proposal is just to document a bit the best practices 
for IO.


Thanks !
Regards
JB⁣

On Jan 26, 2017, 02:25, at 02:25, Stephen Sisk <[email protected]> wrote:
>hi JB!
>
>"IO Writing Guide" sounds like BEAM-1025 (User guide - "How to create
>Beam
>IO Transforms") that I've been working on. Let me pull together the
>stuff
>I've been working on into a draft that folks can take a look at. I had
>an
>earlier draft that was more focused on sources/sinks but since we're
>moving
>away from those, I started a re-write. I'll aim for end of week for
>sharing
>a draft.
>
>There's also a section about fakes in the testing doc:
>https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-NprQ7vbf1jNVRgdqeEE8I/edit#heading=h.cykbne9o4iv
>
>
>Sorry the testing doc/how to create user guide have sat in draft form
>for a
>while, I've wanted to finish up the integration testing environment for
>IOs
>first.
>
>S
>
>On Wed, Jan 25, 2017 at 8:52 AM Jean-Baptiste Onofré <[email protected]>
>wrote:
>
>Hi
>
>It's what I mentioned in a previous email yup. It should refer a "IO
>Writing Guide⁣" describing the purpose of service interface,
>fake/mock, ...
>
>I will tackle that in a PR.
>
>Regards
>JB
>
>On Jan 25, 2017, 09:54, at 09:54, Etienne Chauchot
><[email protected]>
>wrote:
>>Hey Stephen,
>>
>>That seems perfect!
>>
>>Another thing, more about software design, maybe you could add in the
>>guide comments what have been discussed in the ML about making
>standard
>>
>>the use of:
>>
>>- IOService interface in UT and IT,
>>
>>- implementations EmbeddedIOService and MockIOServcice for UT
>>
>>- implementation RealIOService for IT (name proposal)
>>
>>if we all have an agreement on these points. Maybe it requires some
>>more
>>discussions (methods in the interface, are almost passthrough
>>implementations -EmbeddedIOService, RealIOService - needed, ...)
>>
>>Etienne
>>
>>
>>Le 24/01/2017 à 06:47, Stephen Sisk a écrit :
>>> hey,
>>>
>>> thanks - these are good questions/thoughts.
>>>
>>>> I am more reserved on that one regarding flakiness. IMHO, it is
>>better to
>>> clean in all cases.
>>>
>>> I strongly agree that we should attempt to clean in each case, and
>>the
>>> system should support that. I should have stated that more firmly.
>As
>>I
>>> think about it more, you're also right that we should just not try
>to
>>do
>>> the data loading inside of the test. I amended the guidelines based
>>on your
>>> comments and put them in the draft "Testing IO transforms in Apache
>>Beam"
>>> doc that I've been working on [1].
>>>
>>> Here's that snippet:
>>> """
>>>
>>> For both types of tests (integration and performance), you'll need
>to
>>have
>>> scripts that set up your test data - they will be run independent of
>>the
>>> tests themselves.
>>>
>>> The Integration and Perf Tests themselves:
>>>
>>> 1. Can assume the data load script has been run before the test
>>>
>>> 2. Must work if they are run multiple times without the data load
>>script
>>> being run in between (ie, they should clean up after themselves or
>>use
>>> namespacing such that tests don't interfere with one another)
>>>
>>> 3. Read tests must not load data or clean data
>>>
>>> 4. Write tests must use another storage location than read tests
>>(using
>>> namespace/table names/etc.. for example) and if possible clean it
>>after
>>> each test.
>>> """
>>>
>>> Any other comments?
>>>
>>> Stephen
>>>
>>> [1]
>>>
>>
>https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-NprQ7vbf1jNVRgdqeEE8I/edit#heading=h.uj505twpx0m
>>>
>>> On Mon, Jan 23, 2017 at 5:19 AM Etienne Chauchot
>><[email protected]>
>>> wrote:
>>>
>>> Hi Stephen,
>>>
>>> My comments are inline
>>>
>>> Le 19/01/2017 à 20:32, Stephen Sisk a écrit :
>>>> I definitely agree that sharing resources between tests is more
>>efficient.
>>>>
>>>> Etienne - do you think it's necessary to separate the IT from the
>>data
>>>> loading script?
>>> Actually, I see separation between IT and loading script more as a
>an
>>> improvement (time and resource effective) not as a necessity.
>Indeed,
>>> for now, for example, loading in ES IT is done within the IT (see
>>>
>https://github.com/echauchot/beam/tree/BEAM-1184-ELASTICSEARCH-IO-IT)
>>>
>>>> The postgres/JdbcIOIT can use the natural namespacing of
>>>> tables and I feel pretty comfortable that will work well over time.
>>> You mean using the same table name with different namespace? But
>>IMHO,
>>> it is still "using another place" that I mentioned, read IT and
>write
>>IT
>>> could use same table name in different namespaces.
>>>>    You
>>>> haven't explicitly mentioned it, but I'm assuming that
>elasticsearch
>>>> doesn't allow such namespacing, so that's why you're having to do
>>the
>>>> separation?
>>> Actually in ES, there is no namespace notion but there is index
>name.
>>> The index is the documents storing entity that is split. And there
>is
>>> the document type that is more like a class definition for the
>>document.
>>> So basically, we could have read IT using readIndex.docType and
>write
>>IT
>>> using writeIndex.docType.
>>>> I'm not trying to discourage separating data load from IT, just
>>>> wondering whether it's truly necessary.
>>> IMHO, more like an optimization like I mentioned.
>>>> I was trying to consolidate what we're discussed down to a few
>>guidelines.
>>>> I think those are that IO ITs:
>>>> 1. Can assume the data load script has been run before the test
>>(unless
>>> the
>>>> data load script is run by the test itself)
>>> I Agree
>>>> 2. Must work if they are run multiple times without the data load
>>script
>>>> being run in between (ie, they should clean up after themselves or
>>use
>>>> namespacing such that tests don't interfere with one another)
>>> Yes, sure
>>>> 3. Tests that generate large amounts of data will attempt to clean
>>up
>>> after
>>>> themselves. (ie, if you just write 100 rows, don't worry about it -
>>if you
>>>> write 5 gb of data, you'd need to clean up.) We will not assume
>this
>>will
>>>> always succeed in cleaning up, but my assumption is that if a
>>particular
>>>> data store gets into a bad state, we'll just destroy/recreate that
>>>> particular data store.
>>> I am more reserved on that one regarding flakiness. IMHO, it is
>>better
>>> to clean in all cases. I mentioned in a thread that sharding in the
>>> datastore might change depending on data volume (it is not he case
>>for
>>> ES because the sharding is defined by configuration) or a
>>> shard/partition in the datastore can become so big that it will be
>>split
>>> more by the IO. Imagine that a test that writes 100 rows does not do
>>> cleanup and is run 1 000 times, then the storage entity becomes
>>bigger
>>> and bigger and it might then be split into more bundles than
>asserted
>>in
>>> split tests (either by decision of the datastore or because
>>> desiredBundleSize is small)
>>>> If the tests follow those assumptions, then that should support all
>>the
>>>> scenarios I can think of: running data store create + data load
>>script
>>>> occasionally (say, once a week or month) all the way up to running
>>them
>>>> once per test run (if we decided to go that far.)
>>> Yes but do we chose to enforce a standard way of coding integration
>>> tests such as
>>> - loading data is done by and exterior loading script
>>> - read tests: do not load data,  do not clean data
>>> - write tests: use another storage place than read tests (using
>>> namespace for example) and clean it after each test.
>>> ?
>>>
>>> Etienne
>>>> S
>>>>
>>>> On Wed, Jan 18, 2017 at 7:57 AM Etienne Chauchot
>><[email protected]>
>>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> Yes, thanks all for these clarifications about testing
>architecture.
>>>>
>>>> I agree that point 1 and 2 should be shared between tests as much
>as
>>>> possible. Especially sharing data loading between tests is more
>>>> time-effective and resource-effective: tests that need data
>>(testRead,
>>>> testSplit, ...) will save the loading time, the wait for
>>asynchronous
>>>> indexation and cleaning time. Just a small comment:
>>>>
>>>> If we share the data loading between tests, then tests that expect
>>an
>>>> empty dataset (testWrite, ...), obviously cannot clear the shared
>>dataset.
>>>>
>>>> So they will need to write to a dedicated place (other than read
>>tests)
>>>> and clean it afterwards.
>>>>
>>>> I will update ElasticSearch read IT
>>>>
>>(https://github.com/echauchot/beam/tree/BEAM-1184-ELASTICSEARCH-IO-IT)
>>>> to not do data loading/cleaning and write IT to use another
>location
>>>> than read IT
>>>>
>>>> Etienne
>>>>
>>>> Le 18/01/2017 à 13:47, Jean-Baptiste Onofré a écrit :
>>>>> Hi guys,
>>>>>
>>>>> Firs, great e-mail Stephen: complete and detailed proposal.
>>>>>
>>>>> Lukasz raised a good point: it makes sense to be able to leverage
>>the
>>>>> same "bootstrap" script.
>>>>>
>>>>> We discussed about providing the following in each IO:
>>>>> 1. code to load data (java, script, whatever)
>>>>> 2. script to bootstrap the backend (dockerfile, kubernetes script,
>>...)
>>>>> 3. actual integration tests
>>>>>
>>>>> Only 3 is specific to the IO: 1 and 2 can be the same either if we
>>run
>>>>> integration tests for Python or integration tests for Java SDKs.
>>>>>
>>>>> However,  3 may depend to 1 and 2 (the integration tests perform
>>some
>>>>> assertion based on the loaded data for instance).
>>>>> Today, correct me if I'm  wrong, but 1 and 2 will be executed by
>>hand
>>>>> or by Jenkins using a "description" of where the code and script
>>are
>>>>> located.
>>>>>
>>>>> So, I think that we can put 1 and 2 in the IO and use "descriptor"
>>to
>>>>> do the bootstrapping.
>>>>>
>>>>> Regards
>>>>> JB
>>>>>
>>>>> On 01/17/2017 04:37 PM, Lukasz Cwik wrote:
>>>>>> Since docker containers can run a script on startup, can we embed
>>the
>>>>>> initial data set into that script/container build so that the
>same
>>>>>> docker
>>>>>> container and initial data set can be used across multiple ITs.
>>For
>>>>>> example, if Python and Java both have JdbcIO, it would be nice if
>>they
>>>>>> could leverage the same docker container with the same data set
>to
>>>>>> ensure
>>>>>> the same pipeline produces the same results?
>>>>>>
>>>>>> This would be different from embedding the data in the specific
>IT
>>>>>> implementation and would also create a coupling between ITs from
>>>>>> potentially multiple languages.
>>>>>>
>>>>>> On Tue, Jan 17, 2017 at 4:27 PM, Stephen Sisk
>><[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi all!
>>>>>>>
>>>>>>> As I've discussed previously on this list[1], ensuring that we
>>have
>>>>>>> high
>>>>>>> quality IO Transforms is important to beam. We want to do this
>>without
>>>>>>> adding too much burden on developers wanting to contribute.
>Below
>>I
>>>>>>> have a
>>>>>>> concrete proposal for what an IO integration test would look
>like
>>>>>>> and an
>>>>>>> example integration test[4] that meets those requirements.
>>>>>>>
>>>>>>> Proposal: we should require that an IO transform includes a
>>passing
>>>>>>> integration test showing the IO can connect to real instance of
>>the
>>>>>>> data
>>>>>>> store. We still want/expect comprehensive unit tests on an IO
>>>>>>> transform,
>>>>>>> but we would allow check ins with just some unit tests in the
>>>>>>> presence of
>>>>>>> an IT.
>>>>>>>
>>>>>>> To support that, we'll require the following pieces associated
>>with
>>>>>>> an IT:
>>>>>>>
>>>>>>> 1. Dockerfile that can be used to create a running instance of
>>the data
>>>>>>> store. We've previously discussed on this list that we would use
>>docker
>>>>>>> images running inside kubernetes or mesos[2], and I'd prefer
>>having a
>>>>>>> kubernetes/mesos script to start a given data store, but for a
>>single
>>>>>>> instance data store, we can take a dockerfile and use it to
>>create a
>>>>>>> simple
>>>>>>> kubernetes/mesos app. If you have questions about how
>maintaining
>>the
>>>>>>> containers long term would work, check [2] as I discussed a
>>detailed
>>>>>>> plan
>>>>>>> there.
>>>>>>>
>>>>>>> 2. Code to load test data on the data store created by #1. Needs
>>to
>>>>>>> be self
>>>>>>> contained. For now, the easiest way to do this would be to have
>>code
>>>>>>> inside
>>>>>>> of the IT.
>>>>>>>
>>>>>>> 3. The IT. I propose keeping this inside of the same module as
>>the IO
>>>>>>> transform itself since having all the IO transform ITs in one
>>module
>>>>>>> would
>>>>>>> mean there may be conflicts between different data store's
>>>>>>> dependencies.
>>>>>>> Integration tests will need connection information pointing to
>>the data
>>>>>>> store it is testing. As discussed previously on this list[3], it
>>should
>>>>>>> receive that connection information via TestPipelineOptions.
>>>>>>>
>>>>>>> I'd like to get something up and running soon so people checking
>>in
>>>>>>> new IO
>>>>>>> transforms can start taking advantage of an IT framework. Thus,
>>>>>>> there are a
>>>>>>> couple simplifying assumptions in this plan. Pieces of the plan
>>that I
>>>>>>> anticipate will evolve:
>>>>>>>
>>>>>>> 1. The test data load script - we would like to write these in a
>>>>>>> uniform
>>>>>>> way and especially ensure that the test data is cleaned up after
>>the
>>>>>>> tests
>>>>>>> run.
>>>>>>>
>>>>>>> 2. Spinning up/down instances - for now, we'd likely need to do
>>this
>>>>>>> manually. It'd be good to get an automated process for this.
>>That's
>>>>>>> especially critical for performance tests with multiple nodes -
>>>>>>> there's no
>>>>>>> need to keep instances running for that.
>>>>>>>
>>>>>>> Integrating closer with PKB would be a good way to do both of
>>these
>>>>>>> things,
>>>>>>> but first let's focus on getting some basic ITs running.
>>>>>>>
>>>>>>> As a concrete example of this proposal, I've written JDBC IO IT
>>[4].
>>>>>>> JdbcIOTest already did a lot of test setup, so I heavily re-used
>>it.
>>>>>>> The
>>>>>>> key pieces:
>>>>>>>
>>>>>>> * The integration test is in JdbcIOIT.
>>>>>>>
>>>>>>> * JdbcIOIT reads the TestPipelineOptions defined in
>>>>>>> PostgresTestOptions. We
>>>>>>> may move the TestOptions files into a common place so they can
>be
>>>>>>> shared
>>>>>>> between tests.
>>>>>>>
>>>>>>> * Test data is created/cleaned up inside of the IT.
>>>>>>>
>>>>>>> * kubernetes/mesos scripts - I have provided examples of both
>>under the
>>>>>>> "jdbc/src/test/resources" directory, but I'd like us to decide
>as
>>a
>>>>>>> project
>>>>>>> which container orchestration service we want to use - I'll send
>>>>>>> mail about
>>>>>>> that shortly.
>>>>>>>
>>>>>>> thanks!
>>>>>>> Stephen
>>>>>>>
>>>>>>> [1] Integration Testing Sources
>>>>>>>
>>https://lists.apache.org/thread.html/518d78478ae9b6a56d6a690033071a
>>>>>>> a6e3b817546499c4f0f18d247d@%3Cdev.beam.apache.org%3E
>>>>>>>
>>>>>>> [2] Container Orchestration software for hosting data stores
>>>>>>>
>>https://lists.apache.org/thread.html/5825b35b895839d0b33b6c726c1de0
>>>>>>> e76bdb9653d1e913b1207c6c4d@%3Cdev.beam.apache.org%3E
>>>>>>>
>>>>>>> [3] Some Thoughts on IO Integration Tests
>>>>>>>
>>https://lists.apache.org/thread.html/637803ccae9c9efc0f4ed01499f1a0
>>>>>>> 658fa73e761ab6ff4e8fa7b469@%3Cdev.beam.apache.org%3E
>>>>>>>
>>>>>>> [4] JDBC IO IT using postgres
>>>>>>> https://github.com/ssisk/beam/tree/io-testing/sdks/java/io/jdbc
>-
>>>>>>> have not
>>>>>>> been reviewed yet, so may contain code errors, but it does run &
>>>>>>> pass :)
>>>>>>>

Re: IO Integration tests - concrete proposal

Reply via email to