Hi Stephen Yup it sounds good. My proposal is just to document a bit the best practices for IO.
Thanks ! Regards JB On Jan 26, 2017, 02:25, at 02:25, Stephen Sisk <[email protected]> wrote: >hi JB! > >"IO Writing Guide" sounds like BEAM-1025 (User guide - "How to create >Beam >IO Transforms") that I've been working on. Let me pull together the >stuff >I've been working on into a draft that folks can take a look at. I had >an >earlier draft that was more focused on sources/sinks but since we're >moving >away from those, I started a re-write. I'll aim for end of week for >sharing >a draft. > >There's also a section about fakes in the testing doc: >https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-NprQ7vbf1jNVRgdqeEE8I/edit#heading=h.cykbne9o4iv > > >Sorry the testing doc/how to create user guide have sat in draft form >for a >while, I've wanted to finish up the integration testing environment for >IOs >first. > >S > >On Wed, Jan 25, 2017 at 8:52 AM Jean-Baptiste Onofré <[email protected]> >wrote: > >Hi > >It's what I mentioned in a previous email yup. It should refer a "IO >Writing Guide" describing the purpose of service interface, >fake/mock, ... > >I will tackle that in a PR. > >Regards >JB > >On Jan 25, 2017, 09:54, at 09:54, Etienne Chauchot ><[email protected]> >wrote: >>Hey Stephen, >> >>That seems perfect! >> >>Another thing, more about software design, maybe you could add in the >>guide comments what have been discussed in the ML about making >standard >> >>the use of: >> >>- IOService interface in UT and IT, >> >>- implementations EmbeddedIOService and MockIOServcice for UT >> >>- implementation RealIOService for IT (name proposal) >> >>if we all have an agreement on these points. Maybe it requires some >>more >>discussions (methods in the interface, are almost passthrough >>implementations -EmbeddedIOService, RealIOService - needed, ...) >> >>Etienne >> >> >>Le 24/01/2017 à 06:47, Stephen Sisk a écrit : >>> hey, >>> >>> thanks - these are good questions/thoughts. >>> >>>> I am more reserved on that one regarding flakiness. IMHO, it is >>better to >>> clean in all cases. >>> >>> I strongly agree that we should attempt to clean in each case, and >>the >>> system should support that. I should have stated that more firmly. >As >>I >>> think about it more, you're also right that we should just not try >to >>do >>> the data loading inside of the test. I amended the guidelines based >>on your >>> comments and put them in the draft "Testing IO transforms in Apache >>Beam" >>> doc that I've been working on [1]. >>> >>> Here's that snippet: >>> """ >>> >>> For both types of tests (integration and performance), you'll need >to >>have >>> scripts that set up your test data - they will be run independent of >>the >>> tests themselves. >>> >>> The Integration and Perf Tests themselves: >>> >>> 1. Can assume the data load script has been run before the test >>> >>> 2. Must work if they are run multiple times without the data load >>script >>> being run in between (ie, they should clean up after themselves or >>use >>> namespacing such that tests don't interfere with one another) >>> >>> 3. Read tests must not load data or clean data >>> >>> 4. Write tests must use another storage location than read tests >>(using >>> namespace/table names/etc.. for example) and if possible clean it >>after >>> each test. >>> """ >>> >>> Any other comments? >>> >>> Stephen >>> >>> [1] >>> >> >https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-NprQ7vbf1jNVRgdqeEE8I/edit#heading=h.uj505twpx0m >>> >>> On Mon, Jan 23, 2017 at 5:19 AM Etienne Chauchot >><[email protected]> >>> wrote: >>> >>> Hi Stephen, >>> >>> My comments are inline >>> >>> Le 19/01/2017 à 20:32, Stephen Sisk a écrit : >>>> I definitely agree that sharing resources between tests is more >>efficient. >>>> >>>> Etienne - do you think it's necessary to separate the IT from the >>data >>>> loading script? >>> Actually, I see separation between IT and loading script more as a >an >>> improvement (time and resource effective) not as a necessity. >Indeed, >>> for now, for example, loading in ES IT is done within the IT (see >>> >https://github.com/echauchot/beam/tree/BEAM-1184-ELASTICSEARCH-IO-IT) >>> >>>> The postgres/JdbcIOIT can use the natural namespacing of >>>> tables and I feel pretty comfortable that will work well over time. >>> You mean using the same table name with different namespace? But >>IMHO, >>> it is still "using another place" that I mentioned, read IT and >write >>IT >>> could use same table name in different namespaces. >>>> You >>>> haven't explicitly mentioned it, but I'm assuming that >elasticsearch >>>> doesn't allow such namespacing, so that's why you're having to do >>the >>>> separation? >>> Actually in ES, there is no namespace notion but there is index >name. >>> The index is the documents storing entity that is split. And there >is >>> the document type that is more like a class definition for the >>document. >>> So basically, we could have read IT using readIndex.docType and >write >>IT >>> using writeIndex.docType. >>>> I'm not trying to discourage separating data load from IT, just >>>> wondering whether it's truly necessary. >>> IMHO, more like an optimization like I mentioned. >>>> I was trying to consolidate what we're discussed down to a few >>guidelines. >>>> I think those are that IO ITs: >>>> 1. Can assume the data load script has been run before the test >>(unless >>> the >>>> data load script is run by the test itself) >>> I Agree >>>> 2. Must work if they are run multiple times without the data load >>script >>>> being run in between (ie, they should clean up after themselves or >>use >>>> namespacing such that tests don't interfere with one another) >>> Yes, sure >>>> 3. Tests that generate large amounts of data will attempt to clean >>up >>> after >>>> themselves. (ie, if you just write 100 rows, don't worry about it - >>if you >>>> write 5 gb of data, you'd need to clean up.) We will not assume >this >>will >>>> always succeed in cleaning up, but my assumption is that if a >>particular >>>> data store gets into a bad state, we'll just destroy/recreate that >>>> particular data store. >>> I am more reserved on that one regarding flakiness. IMHO, it is >>better >>> to clean in all cases. I mentioned in a thread that sharding in the >>> datastore might change depending on data volume (it is not he case >>for >>> ES because the sharding is defined by configuration) or a >>> shard/partition in the datastore can become so big that it will be >>split >>> more by the IO. Imagine that a test that writes 100 rows does not do >>> cleanup and is run 1 000 times, then the storage entity becomes >>bigger >>> and bigger and it might then be split into more bundles than >asserted >>in >>> split tests (either by decision of the datastore or because >>> desiredBundleSize is small) >>>> If the tests follow those assumptions, then that should support all >>the >>>> scenarios I can think of: running data store create + data load >>script >>>> occasionally (say, once a week or month) all the way up to running >>them >>>> once per test run (if we decided to go that far.) >>> Yes but do we chose to enforce a standard way of coding integration >>> tests such as >>> - loading data is done by and exterior loading script >>> - read tests: do not load data, do not clean data >>> - write tests: use another storage place than read tests (using >>> namespace for example) and clean it after each test. >>> ? >>> >>> Etienne >>>> S >>>> >>>> On Wed, Jan 18, 2017 at 7:57 AM Etienne Chauchot >><[email protected]> >>>> wrote: >>>> >>>> Hi, >>>> >>>> Yes, thanks all for these clarifications about testing >architecture. >>>> >>>> I agree that point 1 and 2 should be shared between tests as much >as >>>> possible. Especially sharing data loading between tests is more >>>> time-effective and resource-effective: tests that need data >>(testRead, >>>> testSplit, ...) will save the loading time, the wait for >>asynchronous >>>> indexation and cleaning time. Just a small comment: >>>> >>>> If we share the data loading between tests, then tests that expect >>an >>>> empty dataset (testWrite, ...), obviously cannot clear the shared >>dataset. >>>> >>>> So they will need to write to a dedicated place (other than read >>tests) >>>> and clean it afterwards. >>>> >>>> I will update ElasticSearch read IT >>>> >>(https://github.com/echauchot/beam/tree/BEAM-1184-ELASTICSEARCH-IO-IT) >>>> to not do data loading/cleaning and write IT to use another >location >>>> than read IT >>>> >>>> Etienne >>>> >>>> Le 18/01/2017 à 13:47, Jean-Baptiste Onofré a écrit : >>>>> Hi guys, >>>>> >>>>> Firs, great e-mail Stephen: complete and detailed proposal. >>>>> >>>>> Lukasz raised a good point: it makes sense to be able to leverage >>the >>>>> same "bootstrap" script. >>>>> >>>>> We discussed about providing the following in each IO: >>>>> 1. code to load data (java, script, whatever) >>>>> 2. script to bootstrap the backend (dockerfile, kubernetes script, >>...) >>>>> 3. actual integration tests >>>>> >>>>> Only 3 is specific to the IO: 1 and 2 can be the same either if we >>run >>>>> integration tests for Python or integration tests for Java SDKs. >>>>> >>>>> However, 3 may depend to 1 and 2 (the integration tests perform >>some >>>>> assertion based on the loaded data for instance). >>>>> Today, correct me if I'm wrong, but 1 and 2 will be executed by >>hand >>>>> or by Jenkins using a "description" of where the code and script >>are >>>>> located. >>>>> >>>>> So, I think that we can put 1 and 2 in the IO and use "descriptor" >>to >>>>> do the bootstrapping. >>>>> >>>>> Regards >>>>> JB >>>>> >>>>> On 01/17/2017 04:37 PM, Lukasz Cwik wrote: >>>>>> Since docker containers can run a script on startup, can we embed >>the >>>>>> initial data set into that script/container build so that the >same >>>>>> docker >>>>>> container and initial data set can be used across multiple ITs. >>For >>>>>> example, if Python and Java both have JdbcIO, it would be nice if >>they >>>>>> could leverage the same docker container with the same data set >to >>>>>> ensure >>>>>> the same pipeline produces the same results? >>>>>> >>>>>> This would be different from embedding the data in the specific >IT >>>>>> implementation and would also create a coupling between ITs from >>>>>> potentially multiple languages. >>>>>> >>>>>> On Tue, Jan 17, 2017 at 4:27 PM, Stephen Sisk >><[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hi all! >>>>>>> >>>>>>> As I've discussed previously on this list[1], ensuring that we >>have >>>>>>> high >>>>>>> quality IO Transforms is important to beam. We want to do this >>without >>>>>>> adding too much burden on developers wanting to contribute. >Below >>I >>>>>>> have a >>>>>>> concrete proposal for what an IO integration test would look >like >>>>>>> and an >>>>>>> example integration test[4] that meets those requirements. >>>>>>> >>>>>>> Proposal: we should require that an IO transform includes a >>passing >>>>>>> integration test showing the IO can connect to real instance of >>the >>>>>>> data >>>>>>> store. We still want/expect comprehensive unit tests on an IO >>>>>>> transform, >>>>>>> but we would allow check ins with just some unit tests in the >>>>>>> presence of >>>>>>> an IT. >>>>>>> >>>>>>> To support that, we'll require the following pieces associated >>with >>>>>>> an IT: >>>>>>> >>>>>>> 1. Dockerfile that can be used to create a running instance of >>the data >>>>>>> store. We've previously discussed on this list that we would use >>docker >>>>>>> images running inside kubernetes or mesos[2], and I'd prefer >>having a >>>>>>> kubernetes/mesos script to start a given data store, but for a >>single >>>>>>> instance data store, we can take a dockerfile and use it to >>create a >>>>>>> simple >>>>>>> kubernetes/mesos app. If you have questions about how >maintaining >>the >>>>>>> containers long term would work, check [2] as I discussed a >>detailed >>>>>>> plan >>>>>>> there. >>>>>>> >>>>>>> 2. Code to load test data on the data store created by #1. Needs >>to >>>>>>> be self >>>>>>> contained. For now, the easiest way to do this would be to have >>code >>>>>>> inside >>>>>>> of the IT. >>>>>>> >>>>>>> 3. The IT. I propose keeping this inside of the same module as >>the IO >>>>>>> transform itself since having all the IO transform ITs in one >>module >>>>>>> would >>>>>>> mean there may be conflicts between different data store's >>>>>>> dependencies. >>>>>>> Integration tests will need connection information pointing to >>the data >>>>>>> store it is testing. As discussed previously on this list[3], it >>should >>>>>>> receive that connection information via TestPipelineOptions. >>>>>>> >>>>>>> I'd like to get something up and running soon so people checking >>in >>>>>>> new IO >>>>>>> transforms can start taking advantage of an IT framework. Thus, >>>>>>> there are a >>>>>>> couple simplifying assumptions in this plan. Pieces of the plan >>that I >>>>>>> anticipate will evolve: >>>>>>> >>>>>>> 1. The test data load script - we would like to write these in a >>>>>>> uniform >>>>>>> way and especially ensure that the test data is cleaned up after >>the >>>>>>> tests >>>>>>> run. >>>>>>> >>>>>>> 2. Spinning up/down instances - for now, we'd likely need to do >>this >>>>>>> manually. It'd be good to get an automated process for this. >>That's >>>>>>> especially critical for performance tests with multiple nodes - >>>>>>> there's no >>>>>>> need to keep instances running for that. >>>>>>> >>>>>>> Integrating closer with PKB would be a good way to do both of >>these >>>>>>> things, >>>>>>> but first let's focus on getting some basic ITs running. >>>>>>> >>>>>>> As a concrete example of this proposal, I've written JDBC IO IT >>[4]. >>>>>>> JdbcIOTest already did a lot of test setup, so I heavily re-used >>it. >>>>>>> The >>>>>>> key pieces: >>>>>>> >>>>>>> * The integration test is in JdbcIOIT. >>>>>>> >>>>>>> * JdbcIOIT reads the TestPipelineOptions defined in >>>>>>> PostgresTestOptions. We >>>>>>> may move the TestOptions files into a common place so they can >be >>>>>>> shared >>>>>>> between tests. >>>>>>> >>>>>>> * Test data is created/cleaned up inside of the IT. >>>>>>> >>>>>>> * kubernetes/mesos scripts - I have provided examples of both >>under the >>>>>>> "jdbc/src/test/resources" directory, but I'd like us to decide >as >>a >>>>>>> project >>>>>>> which container orchestration service we want to use - I'll send >>>>>>> mail about >>>>>>> that shortly. >>>>>>> >>>>>>> thanks! >>>>>>> Stephen >>>>>>> >>>>>>> [1] Integration Testing Sources >>>>>>> >>https://lists.apache.org/thread.html/518d78478ae9b6a56d6a690033071a >>>>>>> a6e3b817546499c4f0f18d247d@%3Cdev.beam.apache.org%3E >>>>>>> >>>>>>> [2] Container Orchestration software for hosting data stores >>>>>>> >>https://lists.apache.org/thread.html/5825b35b895839d0b33b6c726c1de0 >>>>>>> e76bdb9653d1e913b1207c6c4d@%3Cdev.beam.apache.org%3E >>>>>>> >>>>>>> [3] Some Thoughts on IO Integration Tests >>>>>>> >>https://lists.apache.org/thread.html/637803ccae9c9efc0f4ed01499f1a0 >>>>>>> 658fa73e761ab6ff4e8fa7b469@%3Cdev.beam.apache.org%3E >>>>>>> >>>>>>> [4] JDBC IO IT using postgres >>>>>>> https://github.com/ssisk/beam/tree/io-testing/sdks/java/io/jdbc >- >>>>>>> have not >>>>>>> been reviewed yet, so may contain code errors, but it does run & >>>>>>> pass :) >>>>>>>
