Re: Hosting data stores for IO Transform testing

Stephen Sisk Fri, 20 Jan 2017 09:15:12 -0800

hey folks! I wanted to gather any last thoughts that people might have. I'd
like to get started setting this up - anyone else have input?


S

On Thu, Jan 19, 2017 at 11:41 AM Stephen Sisk <[email protected]> wrote:

> Glad to hear you support kubernetes (although to be clear, I'm rooting for
> the right solution for us in the long run - if anyone has a strong reason
> for dcos, I'm excited to hear it.)
>
> I agree with you that testing IO in failure scenarios seems like a
> fruitful area for future work, but that I don't want to tackle it just yet
> (and I'm not hearing that we think it affects our current decision - if
> someone does, I'd like to hear about it.) I am going to split off a thread
> for that discussion because I think the discussion informs how we write our
> unit tests currently, and want to clarify it.
>
> On Wed, Jan 18, 2017 at 1:42 PM Ismaël Mejía <[email protected]> wrote:
>
> Hello again,
>
> Stephen, I agree with you the real question is what is the scope of the
> tests, maybe the discussion so far has been more about testing a ‘real’
> data store and finding infra/performance issues (and future regressions),
> but having a modern cluster manager opens the door to create more
> interesting integration tests like the ones I mentioned, in particular my
> idea is more oriented towards the validation of the ‘correct’expected
> behavior of the IOs and runners. But this is quite ambitious for a first
> goal, maybe we should first get things working and let this for later (if
> there is still interest).
>
> I am not sure that unit tests are enough to test distribution issues
> because they are harder to simulate in particular if we add the fact that
> we can have too many moving pieces. For example, imagine that we run a Beam
> pipeline deployed via Spark on a YARN cluster (where some nodes can fail)
> that reads from Kafka (with some slow partition) and writes to Cassandra
> (with a partition that goes down). You see, this is a quite complex
> combination of pieces (and possible issues), but it is not a totally
> artificial scenario, in fact this is a common architecture, and this can
> (at least in theory) be simulated with a cluster manager, but I don’t see
> how can I easily reproduce this with a unit test.
>
> Anyway, this scenario makes me think that the boundaries of what we want to
> test are really important. Complexity can be huge.
>
> About the Mesos package question, effectively I referred to Mesos Universe
> (the repo you linked), and what you said is sadly true, it is not easy to
> find multi-node instance packages that are the most interesting ones for
> our tests (in both k8s or mesos). I agree with your decision of using
> Kubernetes, I just wanted to mention that in some cases we will need to
> produce these multi-node packages to have interesting tests.
>
> Ismaël
>
>
> On Wed, Jan 18, 2017 at 10:09 PM, Jean-Baptiste Onofré <[email protected]>
> wrote:
>
> > Yes, for both DCOS (Mesos+Marathon) and Kubernetes, I think we may find
> > single node config but not sure for multi-node setup. Anyway, I'm not
> sure
> > if we find a multi-node configuration, it would cover our needs.
> >
> > Regards
> > JB
> >
> > On 01/18/2017 12:52 PM, Stephen Sisk wrote:
> >
> >> ah! I looked around a bit more and found the dcos package repo -
> >> https://github.com/mesosphere/universe/tree/version-3.x/repo/packages
> >>
> >> poking around a bit, I can find a lot of packages for single node
> >> instances, but not many packages for multi-node instances. Single node
> >> instance packages are kind of useful, but I don't think it's *too*
> >> helpful.
> >> The multi-node instance packages that run the data store's high
> >> availability mode are where the real work is, and it seems like both
> >> kubernetes helm and dcos' package universe don't have a lot of those.
> >>
> >> S
> >>
> >> On Wed, Jan 18, 2017 at 9:56 AM Stephen Sisk <[email protected]> wrote:
> >>
> >> Hi Ishmael,
> >>>
> >>> these are good questions, thanks for raising them.
> >>>
> >>> Ability to modify network/compute resources to simulate failures
> >>> =================================================
> >>> I see two real questions here:
> >>> 1. Is this something we want to do?
> >>> 2. Is it possible with both/either?
> >>>
> >>> So far, the test strategy I've been advocating is that we test problems
> >>> like this in unit tests rather than do this in ITs/Perf tests.
> Otherwise,
> >>> it's hard to re-create the same conditions.
> >>>
> >>> I can investigate whether it's possible, but I want to clarify whether
> >>> this is something that we care about. I know both support killing
> >>> individual nodes. I haven't seen a lot of network control in either,
> but
> >>> haven't tried to look for it.
> >>>
> >>> Availability of ready to play packages
> >>> ============================
> >>> I did look at this, and as far as I could tell, mesos didn't have any
> >>> pre-built packages for multi-node clusters of data stores. If there's a
> >>> good repository of them that we trust, that would definitely save us
> >>> time.
> >>> Can you point me at the mesos repository?
> >>>
> >>> S
> >>>
> >>>
> >>>
> >>> On Wed, Jan 18, 2017 at 8:37 AM Jean-Baptiste Onofré <[email protected]>
> >>> wrote:
> >>>
> >>> ⁣Hi Ismael
> >>>
> >>> Stephen will reply with details but I know he did a comparison and
> >>> evaluate different options.
> >>>
> >>> He tested with the jdbc Io itests.
> >>>
> >>> Regards
> >>> JB
> >>>
> >>> On Jan 18, 2017, 08:26, at 08:26, "Ismaël Mejía" <[email protected]>
> >>> wrote:
> >>>
> >>>> Thanks for your analysis Stephen, good arguments / references.
> >>>>
> >>>> One quick question. Have you checked the APIs of both
> >>>> (Mesos/Kubernetes) to
> >>>> see
> >>>> if we can do programmatically do more complex tests (I suppose so, but
> >>>> you
> >>>> don't mention how easy or if those are possible), for example to
> >>>> simulate a
> >>>> slow networking slave (to test stragglers), or to arbitrarily kill one
> >>>> slave (e.g. if I want to test the correct behavior of a runner/IO that
> >>>> is
> >>>> reading from it) ?
> >>>>
> >>>> Other missing point in the review is the availability of ready to play
> >>>> packages,
> >>>> I think in this area mesos/dcos seems more advanced no? I haven't
> >>>> looked
> >>>> recently but at least 6 months ago there were not many helm packages
> >>>> ready
> >>>> for
> >>>> example to test kafka or the hadoop echosystem stuff (hdfs, hbase,
> >>>> etc). Has
> >>>> this been improved ? because preparing this also is a considerable
> >>>> amount of
> >>>> work on the other hand this could be also a chance to contribute to
> >>>> kubernetes.
> >>>>
> >>>> Regards,
> >>>> Ismaël
> >>>>
> >>>>
> >>>>
> >>>> On Wed, Jan 18, 2017 at 2:36 AM, Stephen Sisk <[email protected]
> >
> >>>> wrote:
> >>>>
> >>>> hi!
> >>>>>
> >>>>> I've been continuing this investigation, and have some more info to
> >>>>>
> >>>> report,
> >>>>
> >>>>> and hopefully we can start making some decisions.
> >>>>>
> >>>>> To support performance testing, I've been investigating
> >>>>>
> >>>> mesos+marathon and
> >>>>
> >>>>> kubernetes for running data stores in their high availability mode. I
> >>>>>
> >>>> have
> >>>>
> >>>>> been examining features that kubernetes/mesos+marathon use to support
> >>>>>
> >>>> this.
> >>>>
> >>>>>
> >>>>> Setting up a multi-node cluster in a high availability mode tends to
> >>>>>
> >>>> be
> >>>>
> >>>>> more expensive time-wise than the single node instances I've played
> >>>>>
> >>>> around
> >>>>
> >>>>> with in the past. Rather than do a full build out with both
> >>>>>
> >>>> kubernetes and
> >>>>
> >>>>> mesos, I'd like to pick one of the two options to build the prototype
> >>>>> cluster with. If the prototype doesn't go well, we could still go
> >>>>>
> >>>> back to
> >>>>
> >>>>> the other option, but I'd like to change us from a mode of "let's
> >>>>>
> >>>> look at
> >>>>
> >>>>> all the options" to one of "here's the favorite, let's prove that
> >>>>>
> >>>> works for
> >>>>
> >>>>> us".
> >>>>>
> >>>>> Below are the features that I've seen are important to multi-node
> >>>>>
> >>>> instances
> >>>>
> >>>>> of data stores. I'm sure other folks on the list have done this
> >>>>>
> >>>> before, so
> >>>>
> >>>>> feel free to pipe up if I'm missing a good solution to a problem.
> >>>>>
> >>>>> DNS/Discovery
> >>>>>
> >>>>> --------------------
> >>>>>
> >>>>> Necessary for talking between nodes (eg, cassandra nodes all need to
> >>>>>
> >>>> be
> >>>>
> >>>>> able to talk to a set of seed nodes.)
> >>>>>
> >>>>> * Kubernetes has built-in DNS/discovery between nodes.
> >>>>>
> >>>>> * Mesos has supports this via mesos-dns, which isn't a part of core
> >>>>>
> >>>> mesos,
> >>>>
> >>>>> but is in dcos, which is the mesos distribution I've been using and
> >>>>>
> >>>> that I
> >>>>
> >>>>> would expect us to use.
> >>>>>
> >>>>> Instances properly distributed across nodes
> >>>>>
> >>>>> ------------------------------------------------------------
> >>>>>
> >>>>> If multiple instances of a data source end up on the same underlying
> >>>>>
> >>>> VM, we
> >>>>
> >>>>> may not get good performance out of those instances since the
> >>>>>
> >>>> underlying VM
> >>>>
> >>>>> may be more taxed than other VMs.
> >>>>>
> >>>>> * Kubernetes has a beta feature StatefulSets[1] which allow for
> >>>>>
> >>>> containers
> >>>>
> >>>>> distributed so that there's one container per underlying machine (as
> >>>>>
> >>>> well
> >>>>
> >>>>> as a lot of other useful features like easy stable dns names.)
> >>>>>
> >>>>> * Mesos can support this via the built in UNIQUE constraint [2]
> >>>>>
> >>>>> Load balancing
> >>>>>
> >>>>> --------------------
> >>>>>
> >>>>> Incoming requests from users need to be distributed to the various
> >>>>>
> >>>> machines
> >>>>
> >>>>> - this is important for many data stores' high availability modes.
> >>>>>
> >>>>> * Kubernetes supports easily hooking up to an external load balancer
> >>>>>
> >>>> when
> >>>>
> >>>>> on a cloud (and can be configured to work with a built-in load
> >>>>>
> >>>> balancer if
> >>>>
> >>>>> not)
> >>>>>
> >>>>> * Mesos supports this via marathon-lb [3], which is an install-able
> >>>>>
> >>>> package
> >>>>
> >>>>> in DC/OS
> >>>>>
> >>>>> Persistent Volumes tied to specific instances
> >>>>>
> >>>>> ------------------------------------------------------------
> >>>>>
> >>>>> Databases often need persistent state (for example to store the data
> >>>>>
> >>>> :), so
> >>>>
> >>>>> it's an important part of running our service.
> >>>>>
> >>>>> * Kubernetes StatefulSets supports this
> >>>>>
> >>>>> * Mesos+marathon apps with persistent volumes supports this [4] [5]
> >>>>>
> >>>>> As I mentioned above, I'd like to focus on either kubernetes or mesos
> >>>>>
> >>>> for
> >>>>
> >>>>> my investigation, and as I go further along, I'm seeing kubernetes as
> >>>>> better suited to our needs.
> >>>>>
> >>>>> (1) It supports more of the features we want out of the box and with
> >>>>> StatefulSets, Kubernetes handles them all together neatly - eg. DC/OS
> >>>>> requires marathon-lb to be installed and mesos-dns to be configured.
> >>>>>
> >>>>> (2) I'm also finding that there seem to be more examples of using
> >>>>> kubernetes to solve the types of problems we're working on. This is
> >>>>> somewhat subjective, but in my experience as I've tried to learn both
> >>>>> kubernetes and mesos, I personally found it generally easier to get
> >>>>> kubernetes running than mesos due to the tutorials/examples available
> >>>>>
> >>>> for
> >>>>
> >>>>> kubernetes.
> >>>>>
> >>>>> (3) Lower cost of initial setup - as I discussed in a previous
> >>>>>
> >>>> mail[6],
> >>>>
> >>>>> kubernetes was far easier to get set up even when I knew the exact
> >>>>>
> >>>> steps.
> >>>>
> >>>>> Mesos took me around 27 steps [7], which involved a lot of config
> >>>>>
> >>>> that was
> >>>>
> >>>>> easy to get wrong (it took me about 5 tries to get all the steps
> >>>>>
> >>>> correct in
> >>>>
> >>>>> one go.) Kubernetes took me around 8 steps and very little config.
> >>>>>
> >>>>> Given that, I'd like to focus my investigation/prototyping on
> >>>>>
> >>>> Kubernetes.
> >>>>
> >>>>> To
> >>>>> be clear, it's fairly close and I think both Mesos and Kubernetes
> >>>>>
> >>>> could
> >>>>
> >>>>> support what we need, so if we run into issues with kubernetes, Mesos
> >>>>>
> >>>> still
> >>>>
> >>>>> seems like a viable option that we could fall back to.
> >>>>>
> >>>>> Thanks,
> >>>>> Stephen
> >>>>>
> >>>>>
> >>>>> [1] Kubernetes StatefulSets
> >>>>>
> >>>>>
> >>>> https://kubernetes.io/docs/concepts/abstractions/controllers
> >>> /statefulsets/
> >>>
> >>>>
> >>>>> [2] mesos unique constraint -
> >>>>> https://mesosphere.github.io/marathon/docs/constraints.html
> >>>>>
> >>>>> [3]
> >>>>> https://mesosphere.github.io/marathon/docs/service-
> >>>>> discovery-load-balancing.html
> >>>>>  and https://mesosphere.com/blog/2015/12/04/dcos-marathon-lb/
> >>>>>
> >>>>> [4]
> >>>>>
> >>>> https://mesosphere.github.io/marathon/docs/persistent-volumes.html
> >>>>
> >>>>>
> >>>>> [5]
> >>>>>
> >>>> https://dcos.io/docs/1.7/usage/tutorials/marathon/stateful-services/
> >>>>
> >>>>>
> >>>>> [6] Container Orchestration software for hosting data stores
> >>>>> https://lists.apache.org/thread.html/5825b35b895839d0b33b6c726c1de0
> >>>>> e76bdb9653d1e913b1207c6c4d@%3Cdev.beam.apache.org%3E
> >>>>>
> >>>>> [7]
> https://github.com/ssisk/beam/blob/support/support/mesos/setup.md
> >>>>>
> >>>>>
> >>>>> On Thu, Dec 29, 2016 at 5:44 PM Davor Bonaci <[email protected]>
> >>>>>
> >>>> wrote:
> >>>>
> >>>>>
> >>>>> Just a quick drive-by comment: how tests are laid out has
> >>>>>>
> >>>>> non-trivial
> >>>>
> >>>>> tradeoffs on how/where continuous integration runs, and how results
> >>>>>>
> >>>>> are
> >>>>
> >>>>> integrated into the tooling. The current state is certainly not
> >>>>>>
> >>>>> ideal
> >>>>
> >>>>> (e.g., due to multiple test executions some links in Jenkins point
> >>>>>>
> >>>>> where
> >>>>
> >>>>> they shouldn't), but most other alternatives had even bigger
> >>>>>>
> >>>>> drawbacks at
> >>>>
> >>>>> the time. If someone has great ideas that don't explode the number
> >>>>>>
> >>>>> of
> >>>>
> >>>>> modules, please share ;-)
> >>>>>>
> >>>>>> On Mon, Dec 26, 2016 at 6:30 AM, Etienne Chauchot
> >>>>>>
> >>>>> <[email protected]>
> >>>>
> >>>>> wrote:
> >>>>>>
> >>>>>> Hi Stephen,
> >>>>>>>
> >>>>>>> Thanks for taking the time to comment.
> >>>>>>>
> >>>>>>> My comments are bellow in the email:
> >>>>>>>
> >>>>>>>
> >>>>>>> Le 24/12/2016 à 00:07, Stephen Sisk a écrit :
> >>>>>>>
> >>>>>>> hey Etienne -
> >>>>>>>>
> >>>>>>>> thanks for your thoughts and thanks for sharing your
> >>>>>>>>
> >>>>>>> experiences. I
> >>>>
> >>>>> generally agree with what you're saying. Quick comments below:
> >>>>>>>>
> >>>>>>>> IT are stored alongside with UT in src/test directory of the IO
> >>>>>>>>
> >>>>>>> but
> >>>>
> >>>>> they
> >>>>>
> >>>>>>
> >>>>>>>>> might go to dedicated module, waiting for a consensus
> >>>>>>>> I don't have a strong opinion or feel that I've worked enough
> >>>>>>>>
> >>>>>>> with
> >>>>
> >>>>> maven
> >>>>>
> >>>>>> to
> >>>>>>>> understand all the consequences - I'd love for someone with more
> >>>>>>>>
> >>>>>>> maven
> >>>>
> >>>>> experience to weigh in. If this becomes blocking, I'd say check
> >>>>>>>>
> >>>>>>> it in,
> >>>>
> >>>>> and
> >>>>>>
> >>>>>>> we can refactor later if it proves problematic.
> >>>>>>>>
> >>>>>>>> Sure, not a blocking point, it could be refactored afterwards.
> >>>>>>>
> >>>>>> Just as
> >>>>
> >>>>> a
> >>>>>
> >>>>>> reminder, JB mentioned that storing IT in separate module allows
> >>>>>>>
> >>>>>> to
> >>>>
> >>>>> have
> >>>>>
> >>>>>> more coherence between all IT (same behavior) and to do cross IO
> >>>>>>> integration tests. JB, have you experienced some long term
> >>>>>>>
> >>>>>> drawbacks of
> >>>>
> >>>>> storing IT in a separate module, like, for example, more
> >>>>>>>
> >>>>>> difficult
> >>>>
> >>>>> maintenance due to "distance" with production code?
> >>>>>>>
> >>>>>>>
> >>>>>>>   Also IMHO, it is better that tests load/clean data than doing
> >>>>>>>>
> >>>>>>> some
> >>>>
> >>>>>
> >>>>>>>>> assumptions about the running order of the tests.
> >>>>>>>> I definitely agree that we don't want to make assumptions about
> >>>>>>>>
> >>>>>>> the
> >>>>
> >>>>> running
> >>>>>>>> order of the tests - that way lies pain. :) It will be
> >>>>>>>>
> >>>>>>> interesting to
> >>>>
> >>>>> see
> >>>>>>
> >>>>>>> how the performance tests work out since they will need more
> >>>>>>>>
> >>>>>>> data (and
> >>>>
> >>>>> thus
> >>>>>>>> loading data can take much longer.)
> >>>>>>>>
> >>>>>>>> Yes, performance testing might push in the direction of data
> >>>>>>>
> >>>>>> loading
> >>>>
> >>>>> from
> >>>>>
> >>>>>> outside the tests due to loading time.
> >>>>>>>
> >>>>>>>   This should also be an easier problem
> >>>>>>>> for read tests than for write tests - if we have long running
> >>>>>>>>
> >>>>>>> instances,
> >>>>>
> >>>>>> read tests don't really need cleanup. And if write tests only
> >>>>>>>>
> >>>>>>> write a
> >>>>
> >>>>> small
> >>>>>>>> amount of data, as long as we are sure we're writing to uniquely
> >>>>>>>> identifiable locations (ie, new table per test or something
> >>>>>>>>
> >>>>>>> similar),
> >>>>
> >>>>> we
> >>>>>
> >>>>>> can clean up the write test data on a slower schedule.
> >>>>>>>>
> >>>>>>>> I agree
> >>>>>>>
> >>>>>>>
> >>>>>>>> this will tend to go to the direction of long running data store
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> instances rather than data store instances started (and
> >>>>>>>>
> >>>>>>> optionally
> >>>>
> >>>>> loaded)
> >>>>>>
> >>>>>>> before tests.
> >>>>>>>> It may be easiest to start with a "data stores stay running"
> >>>>>>>> implementation, and then if we see issues with that move towards
> >>>>>>>>
> >>>>>>> tests
> >>>>
> >>>>> that
> >>>>>>>> start/stop the data stores on each run. One thing I'd like to
> >>>>>>>>
> >>>>>>> make
> >>>>
> >>>>> sure
> >>>>>
> >>>>>> is
> >>>>>>
> >>>>>>> that we're not manually tweaking the configurations for data
> >>>>>>>>
> >>>>>>> stores.
> >>>>
> >>>>> One
> >>>>>
> >>>>>> way we could do that is to destroy/recreate the data stores on a
> >>>>>>>>
> >>>>>>> slower
> >>>>>
> >>>>>> schedule - maybe once per week. That way if the script is
> >>>>>>>>
> >>>>>>> changed or
> >>>>
> >>>>> the
> >>>>>
> >>>>>> data store instances are changed, we'd be able to detect it
> >>>>>>>>
> >>>>>>> relatively
> >>>>
> >>>>> soon
> >>>>>>>> while still removing the need for the tests to manage the data
> >>>>>>>>
> >>>>>>> stores.
> >>>>
> >>>>>
> >>>>>>>> I agree. In addition to configuration manual tweaking, there
> >>>>>>>
> >>>>>> might be
> >>>>
> >>>>> cases in which a data store re-partition data during a test or
> >>>>>>>
> >>>>>> after
> >>>>
> >>>>> some
> >>>>>
> >>>>>> tests while the dataset changes. The IO must be tolerant to that
> >>>>>>>
> >>>>>> but
> >>>>
> >>>>> the
> >>>>>
> >>>>>> asserts (number of bundles for example) in test must not fail in
> >>>>>>>
> >>>>>> that
> >>>>
> >>>>> case.
> >>>>>>
> >>>>>>> I would also prefer if possible that the tests do not manage data
> >>>>>>>
> >>>>>> stores
> >>>>>
> >>>>>> (not setup them, not start them, not stop them)
> >>>>>>>
> >>>>>>>
> >>>>>>> as a general note, I suspect many of the folks in the states
> >>>>>>>>
> >>>>>>> will be
> >>>>
> >>>>> on
> >>>>>
> >>>>>> holiday until Jan 2nd/3rd.
> >>>>>>>>
> >>>>>>>> S
> >>>>>>>>
> >>>>>>>> On Fri, Dec 23, 2016 at 7:48 AM Etienne Chauchot
> >>>>>>>>
> >>>>>>> <[email protected]
> >>>>
> >>>>>
> >>>>>> wrote:
> >>>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Recently we had a discussion about integration tests of IOs.
> >>>>>>>>>
> >>>>>>>> I'm
> >>>>
> >>>>> preparing a PR for integration tests of the elasticSearch IO
> >>>>>>>>> (
> >>>>>>>>> https://github.com/echauchot/incubator-beam/tree/BEAM-1184-E
> >>>>>>>>> LASTICSEARCH-IO
> >>>>>>>>> as a first shot) which are very important IMHO because they
> >>>>>>>>>
> >>>>>>>> helped
> >>>>
> >>>>> catch
> >>>>>>
> >>>>>>> some bugs that UT could not (volume, data store instance
> >>>>>>>>>
> >>>>>>>> sharing,
> >>>>
> >>>>> real
> >>>>>
> >>>>>> data store instance ...)
> >>>>>>>>>
> >>>>>>>>> I would like to have your thoughts/remarks about points bellow.
> >>>>>>>>>
> >>>>>>>> Some
> >>>>
> >>>>> of
> >>>>>
> >>>>>> these points are also discussed here
> >>>>>>>>>
> >>>>>>>>> https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-Np
> >>>>>>>>> rQ7vbf1jNVRgdqeEE8I/edit#heading=h.7ly6e7beup8a
> >>>>>>>>> :
> >>>>>>>>>
> >>>>>>>>> - UT and IT have a similar architecture, but while UT focus on
> >>>>>>>>>
> >>>>>>>> testing
> >>>>>
> >>>>>> the correct behavior of the code including corner cases and use
> >>>>>>>>>
> >>>>>>>> embedded
> >>>>>>
> >>>>>>> in memory data store, IT assume that the behavior is correct
> >>>>>>>>>
> >>>>>>>> (strong
> >>>>
> >>>>> UT)
> >>>>>>
> >>>>>>> and focus on higher volume testing and testing against real
> >>>>>>>>>
> >>>>>>>> data
> >>>>
> >>>>> store
> >>>>>
> >>>>>> instance(s)
> >>>>>>>>>
> >>>>>>>>> - For now, IT are stored alongside with UT in src/test
> >>>>>>>>>
> >>>>>>>> directory of
> >>>>
> >>>>> the
> >>>>>
> >>>>>> IO but they might go to dedicated module, waiting for a
> >>>>>>>>>
> >>>>>>>> consensus.
> >>>>
> >>>>> Maven
> >>>>>>
> >>>>>>> is not configured to run them automatically because data store
> >>>>>>>>>
> >>>>>>>> is not
> >>>>
> >>>>> available on jenkins server yet
> >>>>>>>>>
> >>>>>>>>> - For now, they only use DirectRunner, but they will  be run
> >>>>>>>>>
> >>>>>>>> against
> >>>>
> >>>>> each runner.
> >>>>>>>>>
> >>>>>>>>> - IT do not setup data store instance (like stated in the above
> >>>>>>>>> document) they assume that one is already running (hardcoded
> >>>>>>>>> configuration in test for now, waiting for a common solution to
> >>>>>>>>>
> >>>>>>>> pass
> >>>>
> >>>>> configuration to IT). A docker container script is provided in
> >>>>>>>>>
> >>>>>>>> the
> >>>>
> >>>>> contrib directory as a starting point to whatever orchestration
> >>>>>>>>>
> >>>>>>>> software
> >>>>>>
> >>>>>>> will be chosen.
> >>>>>>>>>
> >>>>>>>>> - IT load and clean test data before and after each test if
> >>>>>>>>>
> >>>>>>>> needed.
> >>>>
> >>>>> It
> >>>>>
> >>>>>> is simpler to do so because some tests need empty data store
> >>>>>>>>>
> >>>>>>>> (write
> >>>>
> >>>>> test) and because, as discussed in the document, tests might
> >>>>>>>>>
> >>>>>>>> not be
> >>>>
> >>>>> the
> >>>>>
> >>>>>> only users of the data store. Also IMHO, it is better that
> >>>>>>>>>
> >>>>>>>> tests
> >>>>
> >>>>> load/clean data than doing some assumptions about the running
> >>>>>>>>>
> >>>>>>>> order
> >>>>
> >>>>> of
> >>>>>
> >>>>>> the tests.
> >>>>>>>>>
> >>>>>>>>> If we generalize this pattern to all IT tests, this will tend
> >>>>>>>>>
> >>>>>>>> to go
> >>>>
> >>>>> to
> >>>>>
> >>>>>> the direction of long running data store instances rather than
> >>>>>>>>>
> >>>>>>>> data
> >>>>
> >>>>> store instances started (and optionally loaded) before tests.
> >>>>>>>>>
> >>>>>>>>> Besides if we where to change our minds and load data from
> >>>>>>>>>
> >>>>>>>> outside
> >>>>
> >>>>> the
> >>>>>
> >>>>>> tests, a logstash script is provided.
> >>>>>>>>>
> >>>>>>>>> If you have any thoughts or remarks I'm all ears :)
> >>>>>>>>>
> >>>>>>>>> Regards,
> >>>>>>>>>
> >>>>>>>>> Etienne
> >>>>>>>>>
> >>>>>>>>> Le 14/12/2016 à 17:07, Jean-Baptiste Onofré a écrit :
> >>>>>>>>>
> >>>>>>>>> Hi Stephen,
> >>>>>>>>>>
> >>>>>>>>>> the purpose of having in a specific module is to share
> >>>>>>>>>>
> >>>>>>>>> resources and
> >>>>
> >>>>> apply the same behavior from IT perspective and be able to
> >>>>>>>>>>
> >>>>>>>>> have IT
> >>>>
> >>>>> "cross" IO (for instance, reading from JMS and sending to
> >>>>>>>>>>
> >>>>>>>>> Kafka, I
> >>>>
> >>>>> think that's the key idea for integration tests).
> >>>>>>>>>>
> >>>>>>>>>> For instance, in Karaf, we have:
> >>>>>>>>>> - utest in each module
> >>>>>>>>>> - itest module containing itests for all modules all together
> >>>>>>>>>>
> >>>>>>>>>> Regards
> >>>>>>>>>> JB
> >>>>>>>>>>
> >>>>>>>>>> On 12/14/2016 04:59 PM, Stephen Sisk wrote:
> >>>>>>>>>>
> >>>>>>>>>> Hi Etienne,
> >>>>>>>>>>>
> >>>>>>>>>>> thanks for following up and answering my questions.
> >>>>>>>>>>>
> >>>>>>>>>>> re: where to store integration tests - having them all in a
> >>>>>>>>>>>
> >>>>>>>>>> separate
> >>>>>
> >>>>>> module
> >>>>>>>>>>> is an interesting idea. I couldn't find JB's comments about
> >>>>>>>>>>>
> >>>>>>>>>> moving
> >>>>
> >>>>> them
> >>>>>>
> >>>>>>> into a separate module in the PR - can you share the reasons
> >>>>>>>>>>>
> >>>>>>>>>> for
> >>>>
> >>>>> doing so?
> >>>>>>>>>>> The IO integration/perf tests so it does seem like they'll
> >>>>>>>>>>>
> >>>>>>>>>> need to
> >>>>
> >>>>> be
> >>>>>
> >>>>>> treated in a special manner, but given that there is already
> >>>>>>>>>>>
> >>>>>>>>>> an IO
> >>>>
> >>>>> specific
> >>>>>>>>>>> module, it may just be that we need to treat all the ITs in
> >>>>>>>>>>>
> >>>>>>>>>> the IO
> >>>>
> >>>>> module
> >>>>>>>>>>> the same way. I don't have strong opinions either way right
> >>>>>>>>>>>
> >>>>>>>>>> now.
> >>>>
> >>>>>
> >>>>>>>>>>> S
> >>>>>>>>>>>
> >>>>>>>>>>> On Wed, Dec 14, 2016 at 2:39 AM Etienne Chauchot <
> >>>>>>>>>>>
> >>>>>>>>>> [email protected]>
> >>>>>>
> >>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Hi guys,
> >>>>>>>>>>>
> >>>>>>>>>>> @Stephen: I addressed all your comments directly in the PR,
> >>>>>>>>>>>
> >>>>>>>>>> thanks!
> >>>>
> >>>>> I just wanted to comment here about the docker image I used:
> >>>>>>>>>>>
> >>>>>>>>>> the
> >>>>
> >>>>> only
> >>>>>
> >>>>>> official Elastic image contains only ElasticSearch. But for
> >>>>>>>>>>>
> >>>>>>>>>> testing I
> >>>>>
> >>>>>> needed logstash (for ingestion) and kibana (not for
> >>>>>>>>>>>
> >>>>>>>>>> integration
> >>>>
> >>>>> tests,
> >>>>>>
> >>>>>>> but to easily test REST requests to ES using sense). This is
> >>>>>>>>>>>
> >>>>>>>>>> why I
> >>>>
> >>>>> use
> >>>>>>
> >>>>>>> an ELK (Elasticsearch+Logstash+Kibana) image. This one
> >>>>>>>>>>>
> >>>>>>>>>> isreleased
> >>>>
> >>>>> under
> >>>>>>>>>>> theapache 2 license.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Besides, there is also a point about where to store
> >>>>>>>>>>>
> >>>>>>>>>> integration
> >>>>
> >>>>> tests:
> >>>>>>
> >>>>>>> JB proposed in the PR to store integration tests to dedicated
> >>>>>>>>>>>
> >>>>>>>>>> module
> >>>>>
> >>>>>> rather than directly in the IO module (like I did).
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Etienne
> >>>>>>>>>>>
> >>>>>>>>>>> Le 01/12/2016 à 20:14, Stephen Sisk a écrit :
> >>>>>>>>>>>
> >>>>>>>>>>> hey!
> >>>>>>>>>>>>
> >>>>>>>>>>>> thanks for sending this. I'm very excited to see this
> >>>>>>>>>>>>
> >>>>>>>>>>> change. I
> >>>>
> >>>>> added some
> >>>>>>>>>>>> detail-oriented code review comments in addition to what
> >>>>>>>>>>>>
> >>>>>>>>>>> I've
> >>>>
> >>>>> discussed
> >>>>>>>>>>>> here.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The general goal is to allow for re-usable instantiation of
> >>>>>>>>>>>>
> >>>>>>>>>>> particular
> >>>>>>
> >>>>>>>
> >>>>>>>>>>>> data
> >>>>>>>>>>>
> >>>>>>>>>>> store instances and this seems like a good start. Looks like
> >>>>>>>>>>>>
> >>>>>>>>>>> you
> >>>>
> >>>>> also have
> >>>>>>>>>>>> a script to generate test data for your tests - that's
> >>>>>>>>>>>>
> >>>>>>>>>>> great.
> >>>>
> >>>>>
> >>>>>>>>>>>> The next steps (definitely not blocking your work) will be
> >>>>>>>>>>>>
> >>>>>>>>>>> to have
> >>>>
> >>>>> ways to
> >>>>>>>>>>>> create instances from the docker images you have here, and
> >>>>>>>>>>>>
> >>>>>>>>>>> use
> >>>>
> >>>>> them
> >>>>>
> >>>>>> in the
> >>>>>>>>>>>> tests. We'll need support in the test framework for that
> >>>>>>>>>>>>
> >>>>>>>>>>> since
> >>>>
> >>>>> it'll
> >>>>>
> >>>>>> be
> >>>>>>>>>>>> different on developer machines and in the beam jenkins
> >>>>>>>>>>>>
> >>>>>>>>>>> cluster,
> >>>>
> >>>>> but
> >>>>>
> >>>>>> your
> >>>>>>>>>>>> scripts here allow someone running these tests locally to
> >>>>>>>>>>>>
> >>>>>>>>>>> not have
> >>>>
> >>>>> to
> >>>>>>
> >>>>>>>
> >>>>>>>>>>>> worry
> >>>>>>>>>>>
> >>>>>>>>>>> about getting the instance set up and can manually adjust,
> >>>>>>>>>>>>
> >>>>>>>>>>> so this
> >>>>
> >>>>> is
> >>>>>>
> >>>>>>> a
> >>>>>>>>>>>> good incremental step.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I have some thoughts now that I'm reviewing your scripts
> >>>>>>>>>>>>
> >>>>>>>>>>> (that I
> >>>>
> >>>>> didn't
> >>>>>>>>>>>> have previously, so we are learning this together):
> >>>>>>>>>>>> * It may be useful to try and document why we chose a
> >>>>>>>>>>>>
> >>>>>>>>>>> particular
> >>>>
> >>>>> docker
> >>>>>>>>>>>> image as the base (ie, "this is the official supported
> >>>>>>>>>>>>
> >>>>>>>>>>> elastic
> >>>>
> >>>>> search
> >>>>>>
> >>>>>>> docker image" or "this image has several data stores
> >>>>>>>>>>>>
> >>>>>>>>>>> together that
> >>>>
> >>>>> can be
> >>>>>>>>>>>> used for a couple different tests")  - I'm curious as to
> >>>>>>>>>>>>
> >>>>>>>>>>> whether
> >>>>
> >>>>> the
> >>>>>
> >>>>>> community thinks that is important
> >>>>>>>>>>>>
> >>>>>>>>>>>> One thing that I called out in the comment that's worth
> >>>>>>>>>>>>
> >>>>>>>>>>> mentioning
> >>>>
> >>>>> on the
> >>>>>>>>>>>> larger list - if you want to specify which specific runners
> >>>>>>>>>>>>
> >>>>>>>>>>> a test
> >>>>
> >>>>> uses,
> >>>>>>>>>>>> that can be controlled in the pom for the module. I updated
> >>>>>>>>>>>>
> >>>>>>>>>>> the
> >>>>
> >>>>> testing
> >>>>>>>>>>>>
> >>>>>>>>>>>> doc
> >>>>>>>>>>>
> >>>>>>>>>>> mentioned previously in this thread with a TODO to talk
> >>>>>>>>>>>>
> >>>>>>>>>>> about this
> >>>>
> >>>>> more. I
> >>>>>>>>>>>> think we should also make it so that IO modules have that
> >>>>>>>>>>>> automatically,
> >>>>>>>>>>>>
> >>>>>>>>>>>> so
> >>>>>>>>>>>
> >>>>>>>>>>> developers don't have to worry about it.
> >>>>>>>>>>>>
> >>>>>>>>>>>> S
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Thu, Dec 1, 2016 at 9:00 AM Etienne Chauchot <
> >>>>>>>>>>>>
> >>>>>>>>>>> [email protected]>
> >>>>>>
> >>>>>>>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Stephen,
> >>>>>>>>>>>>
> >>>>>>>>>>>> As discussed, I added injection script, docker containers
> >>>>>>>>>>>>
> >>>>>>>>>>> scripts
> >>>>
> >>>>> and
> >>>>>>
> >>>>>>> integration tests to the sdks/java/io/elasticsearch/contrib
> >>>>>>>>>>>> <
> >>>>>>>>>>>>
> >>>>>>>>>>>> https://github.com/apache/incubator-beam/pull/1439/files/1e7
> >>>>>>>>>>>>
> >>>>>>>>>>> e2f0a6e1a1777d31ae2c886c920efccd708b5#diff-e243536428d06ade7
> >>>>>>>>> d824cefcb3ed0b9
> >>>>>>>>>
> >>>>>>>>> directory in that PR:
> >>>>>>>>>>
> >>>>>>>>>>> https://github.com/apache/incubator-beam/pull/1439.
> >>>>>>>>>>>>
> >>>>>>>>>>>> These work well but they are first shot. Do you have any
> >>>>>>>>>>>>
> >>>>>>>>>>> comments
> >>>>
> >>>>> about
> >>>>>>>>>>>> those?
> >>>>>>>>>>>>
> >>>>>>>>>>>> Besides I am not very sure that these files should be in the
> >>>>>>>>>>>>
> >>>>>>>>>>> IO
> >>>>
> >>>>> itself
> >>>>>>
> >>>>>>> (even in contrib directory, out of maven source
> >>>>>>>>>>>>
> >>>>>>>>>>> directories). Any
> >>>>
> >>>>>
> >>>>>>>>>>>> thoughts?
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Etienne
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Le 23/11/2016 à 19:03, Stephen Sisk a écrit :
> >>>>>>>>>>>>
> >>>>>>>>>>>> It's great to hear more experiences.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I'm also glad to hear that people see real value in the
> >>>>>>>>>>>>>
> >>>>>>>>>>>> high
> >>>>
> >>>>> volume/performance benchmark tests. I tried to capture that
> >>>>>>>>>>>>>
> >>>>>>>>>>>> in
> >>>>
> >>>>> the
> >>>>>
> >>>>>>
> >>>>>>>>>>>>> Testing
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> doc I shared, under "Reasons for Beam Test Strategy". [1]
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> It does generally sound like we're in agreement here. Areas
> >>>>>>>>>>>>>
> >>>>>>>>>>>> of
> >>>>
> >>>>> discussion
> >>>>>>>>>>>>
> >>>>>>>>>>>>
>
>

Re: Hosting data stores for IO Transform testing

Reply via email to