Re: Hosting data stores for IO Transform testing

Ismaël Mejía Wed, 18 Jan 2017 13:43:20 -0800

Hello again,

Stephen, I agree with you the real question is what is the scope of the
tests, maybe the discussion so far has been more about testing a ‘real’
data store and finding infra/performance issues (and future regressions),
but having a modern cluster manager opens the door to create more
interesting integration tests like the ones I mentioned, in particular my
idea is more oriented towards the validation of the ‘correct’expected
behavior of the IOs and runners. But this is quite ambitious for a first
goal, maybe we should first get things working and let this for later (if
there is still interest).


I am not sure that unit tests are enough to test distribution issues
because they are harder to simulate in particular if we add the fact that
we can have too many moving pieces. For example, imagine that we run a Beam
pipeline deployed via Spark on a YARN cluster (where some nodes can fail)
that reads from Kafka (with some slow partition) and writes to Cassandra
(with a partition that goes down). You see, this is a quite complex
combination of pieces (and possible issues), but it is not a totally
artificial scenario, in fact this is a common architecture, and this can
(at least in theory) be simulated with a cluster manager, but I don’t see
how can I easily reproduce this with a unit test.

Anyway, this scenario makes me think that the boundaries of what we want to
test are really important. Complexity can be huge.

About the Mesos package question, effectively I referred to Mesos Universe
(the repo you linked), and what you said is sadly true, it is not easy to
find multi-node instance packages that are the most interesting ones for
our tests (in both k8s or mesos). I agree with your decision of using
Kubernetes, I just wanted to mention that in some cases we will need to
produce these multi-node packages to have interesting tests.

Ismaël


On Wed, Jan 18, 2017 at 10:09 PM, Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

> Yes, for both DCOS (Mesos+Marathon) and Kubernetes, I think we may find
> single node config but not sure for multi-node setup. Anyway, I'm not sure
> if we find a multi-node configuration, it would cover our needs.
>
> Regards
> JB
>
> On 01/18/2017 12:52 PM, Stephen Sisk wrote:
>
>> ah! I looked around a bit more and found the dcos package repo -
>> https://github.com/mesosphere/universe/tree/version-3.x/repo/packages
>>
>> poking around a bit, I can find a lot of packages for single node
>> instances, but not many packages for multi-node instances. Single node
>> instance packages are kind of useful, but I don't think it's *too*
>> helpful.
>> The multi-node instance packages that run the data store's high
>> availability mode are where the real work is, and it seems like both
>> kubernetes helm and dcos' package universe don't have a lot of those.
>>
>> S
>>
>> On Wed, Jan 18, 2017 at 9:56 AM Stephen Sisk <s...@google.com> wrote:
>>
>> Hi Ishmael,
>>>
>>> these are good questions, thanks for raising them.
>>>
>>> Ability to modify network/compute resources to simulate failures
>>> =================================================
>>> I see two real questions here:
>>> 1. Is this something we want to do?
>>> 2. Is it possible with both/either?
>>>
>>> So far, the test strategy I've been advocating is that we test problems
>>> like this in unit tests rather than do this in ITs/Perf tests. Otherwise,
>>> it's hard to re-create the same conditions.
>>>
>>> I can investigate whether it's possible, but I want to clarify whether
>>> this is something that we care about. I know both support killing
>>> individual nodes. I haven't seen a lot of network control in either, but
>>> haven't tried to look for it.
>>>
>>> Availability of ready to play packages
>>> ============================
>>> I did look at this, and as far as I could tell, mesos didn't have any
>>> pre-built packages for multi-node clusters of data stores. If there's a
>>> good repository of them that we trust, that would definitely save us
>>> time.
>>> Can you point me at the mesos repository?
>>>
>>> S
>>>
>>>
>>>
>>> On Wed, Jan 18, 2017 at 8:37 AM Jean-Baptiste Onofré <j...@nanthrax.net>
>>> wrote:
>>>
>>> ⁣Hi Ismael
>>>
>>> Stephen will reply with details but I know he did a comparison and
>>> evaluate different options.
>>>
>>> He tested with the jdbc Io itests.
>>>
>>> Regards
>>> JB
>>>
>>> On Jan 18, 2017, 08:26, at 08:26, "Ismaël Mejía" <ieme...@gmail.com>
>>> wrote:
>>>
>>>> Thanks for your analysis Stephen, good arguments / references.
>>>>
>>>> One quick question. Have you checked the APIs of both
>>>> (Mesos/Kubernetes) to
>>>> see
>>>> if we can do programmatically do more complex tests (I suppose so, but
>>>> you
>>>> don't mention how easy or if those are possible), for example to
>>>> simulate a
>>>> slow networking slave (to test stragglers), or to arbitrarily kill one
>>>> slave (e.g. if I want to test the correct behavior of a runner/IO that
>>>> is
>>>> reading from it) ?
>>>>
>>>> Other missing point in the review is the availability of ready to play
>>>> packages,
>>>> I think in this area mesos/dcos seems more advanced no? I haven't
>>>> looked
>>>> recently but at least 6 months ago there were not many helm packages
>>>> ready
>>>> for
>>>> example to test kafka or the hadoop echosystem stuff (hdfs, hbase,
>>>> etc). Has
>>>> this been improved ? because preparing this also is a considerable
>>>> amount of
>>>> work on the other hand this could be also a chance to contribute to
>>>> kubernetes.
>>>>
>>>> Regards,
>>>> Ismaël
>>>>
>>>>
>>>>
>>>> On Wed, Jan 18, 2017 at 2:36 AM, Stephen Sisk <s...@google.com.invalid>
>>>> wrote:
>>>>
>>>> hi!
>>>>>
>>>>> I've been continuing this investigation, and have some more info to
>>>>>
>>>> report,
>>>>
>>>>> and hopefully we can start making some decisions.
>>>>>
>>>>> To support performance testing, I've been investigating
>>>>>
>>>> mesos+marathon and
>>>>
>>>>> kubernetes for running data stores in their high availability mode. I
>>>>>
>>>> have
>>>>
>>>>> been examining features that kubernetes/mesos+marathon use to support
>>>>>
>>>> this.
>>>>
>>>>>
>>>>> Setting up a multi-node cluster in a high availability mode tends to
>>>>>
>>>> be
>>>>
>>>>> more expensive time-wise than the single node instances I've played
>>>>>
>>>> around
>>>>
>>>>> with in the past. Rather than do a full build out with both
>>>>>
>>>> kubernetes and
>>>>
>>>>> mesos, I'd like to pick one of the two options to build the prototype
>>>>> cluster with. If the prototype doesn't go well, we could still go
>>>>>
>>>> back to
>>>>
>>>>> the other option, but I'd like to change us from a mode of "let's
>>>>>
>>>> look at
>>>>
>>>>> all the options" to one of "here's the favorite, let's prove that
>>>>>
>>>> works for
>>>>
>>>>> us".
>>>>>
>>>>> Below are the features that I've seen are important to multi-node
>>>>>
>>>> instances
>>>>
>>>>> of data stores. I'm sure other folks on the list have done this
>>>>>
>>>> before, so
>>>>
>>>>> feel free to pipe up if I'm missing a good solution to a problem.
>>>>>
>>>>> DNS/Discovery
>>>>>
>>>>> --------------------
>>>>>
>>>>> Necessary for talking between nodes (eg, cassandra nodes all need to
>>>>>
>>>> be
>>>>
>>>>> able to talk to a set of seed nodes.)
>>>>>
>>>>> * Kubernetes has built-in DNS/discovery between nodes.
>>>>>
>>>>> * Mesos has supports this via mesos-dns, which isn't a part of core
>>>>>
>>>> mesos,
>>>>
>>>>> but is in dcos, which is the mesos distribution I've been using and
>>>>>
>>>> that I
>>>>
>>>>> would expect us to use.
>>>>>
>>>>> Instances properly distributed across nodes
>>>>>
>>>>> ------------------------------------------------------------
>>>>>
>>>>> If multiple instances of a data source end up on the same underlying
>>>>>
>>>> VM, we
>>>>
>>>>> may not get good performance out of those instances since the
>>>>>
>>>> underlying VM
>>>>
>>>>> may be more taxed than other VMs.
>>>>>
>>>>> * Kubernetes has a beta feature StatefulSets[1] which allow for
>>>>>
>>>> containers
>>>>
>>>>> distributed so that there's one container per underlying machine (as
>>>>>
>>>> well
>>>>
>>>>> as a lot of other useful features like easy stable dns names.)
>>>>>
>>>>> * Mesos can support this via the built in UNIQUE constraint [2]
>>>>>
>>>>> Load balancing
>>>>>
>>>>> --------------------
>>>>>
>>>>> Incoming requests from users need to be distributed to the various
>>>>>
>>>> machines
>>>>
>>>>> - this is important for many data stores' high availability modes.
>>>>>
>>>>> * Kubernetes supports easily hooking up to an external load balancer
>>>>>
>>>> when
>>>>
>>>>> on a cloud (and can be configured to work with a built-in load
>>>>>
>>>> balancer if
>>>>
>>>>> not)
>>>>>
>>>>> * Mesos supports this via marathon-lb [3], which is an install-able
>>>>>
>>>> package
>>>>
>>>>> in DC/OS
>>>>>
>>>>> Persistent Volumes tied to specific instances
>>>>>
>>>>> ------------------------------------------------------------
>>>>>
>>>>> Databases often need persistent state (for example to store the data
>>>>>
>>>> :), so
>>>>
>>>>> it's an important part of running our service.
>>>>>
>>>>> * Kubernetes StatefulSets supports this
>>>>>
>>>>> * Mesos+marathon apps with persistent volumes supports this [4] [5]
>>>>>
>>>>> As I mentioned above, I'd like to focus on either kubernetes or mesos
>>>>>
>>>> for
>>>>
>>>>> my investigation, and as I go further along, I'm seeing kubernetes as
>>>>> better suited to our needs.
>>>>>
>>>>> (1) It supports more of the features we want out of the box and with
>>>>> StatefulSets, Kubernetes handles them all together neatly - eg. DC/OS
>>>>> requires marathon-lb to be installed and mesos-dns to be configured.
>>>>>
>>>>> (2) I'm also finding that there seem to be more examples of using
>>>>> kubernetes to solve the types of problems we're working on. This is
>>>>> somewhat subjective, but in my experience as I've tried to learn both
>>>>> kubernetes and mesos, I personally found it generally easier to get
>>>>> kubernetes running than mesos due to the tutorials/examples available
>>>>>
>>>> for
>>>>
>>>>> kubernetes.
>>>>>
>>>>> (3) Lower cost of initial setup - as I discussed in a previous
>>>>>
>>>> mail[6],
>>>>
>>>>> kubernetes was far easier to get set up even when I knew the exact
>>>>>
>>>> steps.
>>>>
>>>>> Mesos took me around 27 steps [7], which involved a lot of config
>>>>>
>>>> that was
>>>>
>>>>> easy to get wrong (it took me about 5 tries to get all the steps
>>>>>
>>>> correct in
>>>>
>>>>> one go.) Kubernetes took me around 8 steps and very little config.
>>>>>
>>>>> Given that, I'd like to focus my investigation/prototyping on
>>>>>
>>>> Kubernetes.
>>>>
>>>>> To
>>>>> be clear, it's fairly close and I think both Mesos and Kubernetes
>>>>>
>>>> could
>>>>
>>>>> support what we need, so if we run into issues with kubernetes, Mesos
>>>>>
>>>> still
>>>>
>>>>> seems like a viable option that we could fall back to.
>>>>>
>>>>> Thanks,
>>>>> Stephen
>>>>>
>>>>>
>>>>> [1] Kubernetes StatefulSets
>>>>>
>>>>>
>>>> https://kubernetes.io/docs/concepts/abstractions/controllers
>>> /statefulsets/
>>>
>>>>
>>>>> [2] mesos unique constraint -
>>>>> https://mesosphere.github.io/marathon/docs/constraints.html
>>>>>
>>>>> [3]
>>>>> https://mesosphere.github.io/marathon/docs/service-
>>>>> discovery-load-balancing.html
>>>>>  and https://mesosphere.com/blog/2015/12/04/dcos-marathon-lb/
>>>>>
>>>>> [4]
>>>>>
>>>> https://mesosphere.github.io/marathon/docs/persistent-volumes.html
>>>>
>>>>>
>>>>> [5]
>>>>>
>>>> https://dcos.io/docs/1.7/usage/tutorials/marathon/stateful-services/
>>>>
>>>>>
>>>>> [6] Container Orchestration software for hosting data stores
>>>>> https://lists.apache.org/thread.html/5825b35b895839d0b33b6c726c1de0
>>>>> e76bdb9653d1e913b1207c6c4d@%3Cdev.beam.apache.org%3E
>>>>>
>>>>> [7] https://github.com/ssisk/beam/blob/support/support/mesos/setup.md
>>>>>
>>>>>
>>>>> On Thu, Dec 29, 2016 at 5:44 PM Davor Bonaci <da...@apache.org>
>>>>>
>>>> wrote:
>>>>
>>>>>
>>>>> Just a quick drive-by comment: how tests are laid out has
>>>>>>
>>>>> non-trivial
>>>>
>>>>> tradeoffs on how/where continuous integration runs, and how results
>>>>>>
>>>>> are
>>>>
>>>>> integrated into the tooling. The current state is certainly not
>>>>>>
>>>>> ideal
>>>>
>>>>> (e.g., due to multiple test executions some links in Jenkins point
>>>>>>
>>>>> where
>>>>
>>>>> they shouldn't), but most other alternatives had even bigger
>>>>>>
>>>>> drawbacks at
>>>>
>>>>> the time. If someone has great ideas that don't explode the number
>>>>>>
>>>>> of
>>>>
>>>>> modules, please share ;-)
>>>>>>
>>>>>> On Mon, Dec 26, 2016 at 6:30 AM, Etienne Chauchot
>>>>>>
>>>>> <echauc...@gmail.com>
>>>>
>>>>> wrote:
>>>>>>
>>>>>> Hi Stephen,
>>>>>>>
>>>>>>> Thanks for taking the time to comment.
>>>>>>>
>>>>>>> My comments are bellow in the email:
>>>>>>>
>>>>>>>
>>>>>>> Le 24/12/2016 à 00:07, Stephen Sisk a écrit :
>>>>>>>
>>>>>>> hey Etienne -
>>>>>>>>
>>>>>>>> thanks for your thoughts and thanks for sharing your
>>>>>>>>
>>>>>>> experiences. I
>>>>
>>>>> generally agree with what you're saying. Quick comments below:
>>>>>>>>
>>>>>>>> IT are stored alongside with UT in src/test directory of the IO
>>>>>>>>
>>>>>>> but
>>>>
>>>>> they
>>>>>
>>>>>>
>>>>>>>>> might go to dedicated module, waiting for a consensus
>>>>>>>> I don't have a strong opinion or feel that I've worked enough
>>>>>>>>
>>>>>>> with
>>>>
>>>>> maven
>>>>>
>>>>>> to
>>>>>>>> understand all the consequences - I'd love for someone with more
>>>>>>>>
>>>>>>> maven
>>>>
>>>>> experience to weigh in. If this becomes blocking, I'd say check
>>>>>>>>
>>>>>>> it in,
>>>>
>>>>> and
>>>>>>
>>>>>>> we can refactor later if it proves problematic.
>>>>>>>>
>>>>>>>> Sure, not a blocking point, it could be refactored afterwards.
>>>>>>>
>>>>>> Just as
>>>>
>>>>> a
>>>>>
>>>>>> reminder, JB mentioned that storing IT in separate module allows
>>>>>>>
>>>>>> to
>>>>
>>>>> have
>>>>>
>>>>>> more coherence between all IT (same behavior) and to do cross IO
>>>>>>> integration tests. JB, have you experienced some long term
>>>>>>>
>>>>>> drawbacks of
>>>>
>>>>> storing IT in a separate module, like, for example, more
>>>>>>>
>>>>>> difficult
>>>>
>>>>> maintenance due to "distance" with production code?
>>>>>>>
>>>>>>>
>>>>>>>   Also IMHO, it is better that tests load/clean data than doing
>>>>>>>>
>>>>>>> some
>>>>
>>>>>
>>>>>>>>> assumptions about the running order of the tests.
>>>>>>>> I definitely agree that we don't want to make assumptions about
>>>>>>>>
>>>>>>> the
>>>>
>>>>> running
>>>>>>>> order of the tests - that way lies pain. :) It will be
>>>>>>>>
>>>>>>> interesting to
>>>>
>>>>> see
>>>>>>
>>>>>>> how the performance tests work out since they will need more
>>>>>>>>
>>>>>>> data (and
>>>>
>>>>> thus
>>>>>>>> loading data can take much longer.)
>>>>>>>>
>>>>>>>> Yes, performance testing might push in the direction of data
>>>>>>>
>>>>>> loading
>>>>
>>>>> from
>>>>>
>>>>>> outside the tests due to loading time.
>>>>>>>
>>>>>>>   This should also be an easier problem
>>>>>>>> for read tests than for write tests - if we have long running
>>>>>>>>
>>>>>>> instances,
>>>>>
>>>>>> read tests don't really need cleanup. And if write tests only
>>>>>>>>
>>>>>>> write a
>>>>
>>>>> small
>>>>>>>> amount of data, as long as we are sure we're writing to uniquely
>>>>>>>> identifiable locations (ie, new table per test or something
>>>>>>>>
>>>>>>> similar),
>>>>
>>>>> we
>>>>>
>>>>>> can clean up the write test data on a slower schedule.
>>>>>>>>
>>>>>>>> I agree
>>>>>>>
>>>>>>>
>>>>>>>> this will tend to go to the direction of long running data store
>>>>>>>>
>>>>>>>>>
>>>>>>>>> instances rather than data store instances started (and
>>>>>>>>
>>>>>>> optionally
>>>>
>>>>> loaded)
>>>>>>
>>>>>>> before tests.
>>>>>>>> It may be easiest to start with a "data stores stay running"
>>>>>>>> implementation, and then if we see issues with that move towards
>>>>>>>>
>>>>>>> tests
>>>>
>>>>> that
>>>>>>>> start/stop the data stores on each run. One thing I'd like to
>>>>>>>>
>>>>>>> make
>>>>
>>>>> sure
>>>>>
>>>>>> is
>>>>>>
>>>>>>> that we're not manually tweaking the configurations for data
>>>>>>>>
>>>>>>> stores.
>>>>
>>>>> One
>>>>>
>>>>>> way we could do that is to destroy/recreate the data stores on a
>>>>>>>>
>>>>>>> slower
>>>>>
>>>>>> schedule - maybe once per week. That way if the script is
>>>>>>>>
>>>>>>> changed or
>>>>
>>>>> the
>>>>>
>>>>>> data store instances are changed, we'd be able to detect it
>>>>>>>>
>>>>>>> relatively
>>>>
>>>>> soon
>>>>>>>> while still removing the need for the tests to manage the data
>>>>>>>>
>>>>>>> stores.
>>>>
>>>>>
>>>>>>>> I agree. In addition to configuration manual tweaking, there
>>>>>>>
>>>>>> might be
>>>>
>>>>> cases in which a data store re-partition data during a test or
>>>>>>>
>>>>>> after
>>>>
>>>>> some
>>>>>
>>>>>> tests while the dataset changes. The IO must be tolerant to that
>>>>>>>
>>>>>> but
>>>>
>>>>> the
>>>>>
>>>>>> asserts (number of bundles for example) in test must not fail in
>>>>>>>
>>>>>> that
>>>>
>>>>> case.
>>>>>>
>>>>>>> I would also prefer if possible that the tests do not manage data
>>>>>>>
>>>>>> stores
>>>>>
>>>>>> (not setup them, not start them, not stop them)
>>>>>>>
>>>>>>>
>>>>>>> as a general note, I suspect many of the folks in the states
>>>>>>>>
>>>>>>> will be
>>>>
>>>>> on
>>>>>
>>>>>> holiday until Jan 2nd/3rd.
>>>>>>>>
>>>>>>>> S
>>>>>>>>
>>>>>>>> On Fri, Dec 23, 2016 at 7:48 AM Etienne Chauchot
>>>>>>>>
>>>>>>> <echauc...@gmail.com
>>>>
>>>>>
>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Recently we had a discussion about integration tests of IOs.
>>>>>>>>>
>>>>>>>> I'm
>>>>
>>>>> preparing a PR for integration tests of the elasticSearch IO
>>>>>>>>> (
>>>>>>>>> https://github.com/echauchot/incubator-beam/tree/BEAM-1184-E
>>>>>>>>> LASTICSEARCH-IO
>>>>>>>>> as a first shot) which are very important IMHO because they
>>>>>>>>>
>>>>>>>> helped
>>>>
>>>>> catch
>>>>>>
>>>>>>> some bugs that UT could not (volume, data store instance
>>>>>>>>>
>>>>>>>> sharing,
>>>>
>>>>> real
>>>>>
>>>>>> data store instance ...)
>>>>>>>>>
>>>>>>>>> I would like to have your thoughts/remarks about points bellow.
>>>>>>>>>
>>>>>>>> Some
>>>>
>>>>> of
>>>>>
>>>>>> these points are also discussed here
>>>>>>>>>
>>>>>>>>> https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-Np
>>>>>>>>> rQ7vbf1jNVRgdqeEE8I/edit#heading=h.7ly6e7beup8a
>>>>>>>>> :
>>>>>>>>>
>>>>>>>>> - UT and IT have a similar architecture, but while UT focus on
>>>>>>>>>
>>>>>>>> testing
>>>>>
>>>>>> the correct behavior of the code including corner cases and use
>>>>>>>>>
>>>>>>>> embedded
>>>>>>
>>>>>>> in memory data store, IT assume that the behavior is correct
>>>>>>>>>
>>>>>>>> (strong
>>>>
>>>>> UT)
>>>>>>
>>>>>>> and focus on higher volume testing and testing against real
>>>>>>>>>
>>>>>>>> data
>>>>
>>>>> store
>>>>>
>>>>>> instance(s)
>>>>>>>>>
>>>>>>>>> - For now, IT are stored alongside with UT in src/test
>>>>>>>>>
>>>>>>>> directory of
>>>>
>>>>> the
>>>>>
>>>>>> IO but they might go to dedicated module, waiting for a
>>>>>>>>>
>>>>>>>> consensus.
>>>>
>>>>> Maven
>>>>>>
>>>>>>> is not configured to run them automatically because data store
>>>>>>>>>
>>>>>>>> is not
>>>>
>>>>> available on jenkins server yet
>>>>>>>>>
>>>>>>>>> - For now, they only use DirectRunner, but they will  be run
>>>>>>>>>
>>>>>>>> against
>>>>
>>>>> each runner.
>>>>>>>>>
>>>>>>>>> - IT do not setup data store instance (like stated in the above
>>>>>>>>> document) they assume that one is already running (hardcoded
>>>>>>>>> configuration in test for now, waiting for a common solution to
>>>>>>>>>
>>>>>>>> pass
>>>>
>>>>> configuration to IT). A docker container script is provided in
>>>>>>>>>
>>>>>>>> the
>>>>
>>>>> contrib directory as a starting point to whatever orchestration
>>>>>>>>>
>>>>>>>> software
>>>>>>
>>>>>>> will be chosen.
>>>>>>>>>
>>>>>>>>> - IT load and clean test data before and after each test if
>>>>>>>>>
>>>>>>>> needed.
>>>>
>>>>> It
>>>>>
>>>>>> is simpler to do so because some tests need empty data store
>>>>>>>>>
>>>>>>>> (write
>>>>
>>>>> test) and because, as discussed in the document, tests might
>>>>>>>>>
>>>>>>>> not be
>>>>
>>>>> the
>>>>>
>>>>>> only users of the data store. Also IMHO, it is better that
>>>>>>>>>
>>>>>>>> tests
>>>>
>>>>> load/clean data than doing some assumptions about the running
>>>>>>>>>
>>>>>>>> order
>>>>
>>>>> of
>>>>>
>>>>>> the tests.
>>>>>>>>>
>>>>>>>>> If we generalize this pattern to all IT tests, this will tend
>>>>>>>>>
>>>>>>>> to go
>>>>
>>>>> to
>>>>>
>>>>>> the direction of long running data store instances rather than
>>>>>>>>>
>>>>>>>> data
>>>>
>>>>> store instances started (and optionally loaded) before tests.
>>>>>>>>>
>>>>>>>>> Besides if we where to change our minds and load data from
>>>>>>>>>
>>>>>>>> outside
>>>>
>>>>> the
>>>>>
>>>>>> tests, a logstash script is provided.
>>>>>>>>>
>>>>>>>>> If you have any thoughts or remarks I'm all ears :)
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>>
>>>>>>>>> Etienne
>>>>>>>>>
>>>>>>>>> Le 14/12/2016 à 17:07, Jean-Baptiste Onofré a écrit :
>>>>>>>>>
>>>>>>>>> Hi Stephen,
>>>>>>>>>>
>>>>>>>>>> the purpose of having in a specific module is to share
>>>>>>>>>>
>>>>>>>>> resources and
>>>>
>>>>> apply the same behavior from IT perspective and be able to
>>>>>>>>>>
>>>>>>>>> have IT
>>>>
>>>>> "cross" IO (for instance, reading from JMS and sending to
>>>>>>>>>>
>>>>>>>>> Kafka, I
>>>>
>>>>> think that's the key idea for integration tests).
>>>>>>>>>>
>>>>>>>>>> For instance, in Karaf, we have:
>>>>>>>>>> - utest in each module
>>>>>>>>>> - itest module containing itests for all modules all together
>>>>>>>>>>
>>>>>>>>>> Regards
>>>>>>>>>> JB
>>>>>>>>>>
>>>>>>>>>> On 12/14/2016 04:59 PM, Stephen Sisk wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Etienne,
>>>>>>>>>>>
>>>>>>>>>>> thanks for following up and answering my questions.
>>>>>>>>>>>
>>>>>>>>>>> re: where to store integration tests - having them all in a
>>>>>>>>>>>
>>>>>>>>>> separate
>>>>>
>>>>>> module
>>>>>>>>>>> is an interesting idea. I couldn't find JB's comments about
>>>>>>>>>>>
>>>>>>>>>> moving
>>>>
>>>>> them
>>>>>>
>>>>>>> into a separate module in the PR - can you share the reasons
>>>>>>>>>>>
>>>>>>>>>> for
>>>>
>>>>> doing so?
>>>>>>>>>>> The IO integration/perf tests so it does seem like they'll
>>>>>>>>>>>
>>>>>>>>>> need to
>>>>
>>>>> be
>>>>>
>>>>>> treated in a special manner, but given that there is already
>>>>>>>>>>>
>>>>>>>>>> an IO
>>>>
>>>>> specific
>>>>>>>>>>> module, it may just be that we need to treat all the ITs in
>>>>>>>>>>>
>>>>>>>>>> the IO
>>>>
>>>>> module
>>>>>>>>>>> the same way. I don't have strong opinions either way right
>>>>>>>>>>>
>>>>>>>>>> now.
>>>>
>>>>>
>>>>>>>>>>> S
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Dec 14, 2016 at 2:39 AM Etienne Chauchot <
>>>>>>>>>>>
>>>>>>>>>> echauc...@gmail.com>
>>>>>>
>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi guys,
>>>>>>>>>>>
>>>>>>>>>>> @Stephen: I addressed all your comments directly in the PR,
>>>>>>>>>>>
>>>>>>>>>> thanks!
>>>>
>>>>> I just wanted to comment here about the docker image I used:
>>>>>>>>>>>
>>>>>>>>>> the
>>>>
>>>>> only
>>>>>
>>>>>> official Elastic image contains only ElasticSearch. But for
>>>>>>>>>>>
>>>>>>>>>> testing I
>>>>>
>>>>>> needed logstash (for ingestion) and kibana (not for
>>>>>>>>>>>
>>>>>>>>>> integration
>>>>
>>>>> tests,
>>>>>>
>>>>>>> but to easily test REST requests to ES using sense). This is
>>>>>>>>>>>
>>>>>>>>>> why I
>>>>
>>>>> use
>>>>>>
>>>>>>> an ELK (Elasticsearch+Logstash+Kibana) image. This one
>>>>>>>>>>>
>>>>>>>>>> isreleased
>>>>
>>>>> under
>>>>>>>>>>> theapache 2 license.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Besides, there is also a point about where to store
>>>>>>>>>>>
>>>>>>>>>> integration
>>>>
>>>>> tests:
>>>>>>
>>>>>>> JB proposed in the PR to store integration tests to dedicated
>>>>>>>>>>>
>>>>>>>>>> module
>>>>>
>>>>>> rather than directly in the IO module (like I did).
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Etienne
>>>>>>>>>>>
>>>>>>>>>>> Le 01/12/2016 à 20:14, Stephen Sisk a écrit :
>>>>>>>>>>>
>>>>>>>>>>> hey!
>>>>>>>>>>>>
>>>>>>>>>>>> thanks for sending this. I'm very excited to see this
>>>>>>>>>>>>
>>>>>>>>>>> change. I
>>>>
>>>>> added some
>>>>>>>>>>>> detail-oriented code review comments in addition to what
>>>>>>>>>>>>
>>>>>>>>>>> I've
>>>>
>>>>> discussed
>>>>>>>>>>>> here.
>>>>>>>>>>>>
>>>>>>>>>>>> The general goal is to allow for re-usable instantiation of
>>>>>>>>>>>>
>>>>>>>>>>> particular
>>>>>>
>>>>>>>
>>>>>>>>>>>> data
>>>>>>>>>>>
>>>>>>>>>>> store instances and this seems like a good start. Looks like
>>>>>>>>>>>>
>>>>>>>>>>> you
>>>>
>>>>> also have
>>>>>>>>>>>> a script to generate test data for your tests - that's
>>>>>>>>>>>>
>>>>>>>>>>> great.
>>>>
>>>>>
>>>>>>>>>>>> The next steps (definitely not blocking your work) will be
>>>>>>>>>>>>
>>>>>>>>>>> to have
>>>>
>>>>> ways to
>>>>>>>>>>>> create instances from the docker images you have here, and
>>>>>>>>>>>>
>>>>>>>>>>> use
>>>>
>>>>> them
>>>>>
>>>>>> in the
>>>>>>>>>>>> tests. We'll need support in the test framework for that
>>>>>>>>>>>>
>>>>>>>>>>> since
>>>>
>>>>> it'll
>>>>>
>>>>>> be
>>>>>>>>>>>> different on developer machines and in the beam jenkins
>>>>>>>>>>>>
>>>>>>>>>>> cluster,
>>>>
>>>>> but
>>>>>
>>>>>> your
>>>>>>>>>>>> scripts here allow someone running these tests locally to
>>>>>>>>>>>>
>>>>>>>>>>> not have
>>>>
>>>>> to
>>>>>>
>>>>>>>
>>>>>>>>>>>> worry
>>>>>>>>>>>
>>>>>>>>>>> about getting the instance set up and can manually adjust,
>>>>>>>>>>>>
>>>>>>>>>>> so this
>>>>
>>>>> is
>>>>>>
>>>>>>> a
>>>>>>>>>>>> good incremental step.
>>>>>>>>>>>>
>>>>>>>>>>>> I have some thoughts now that I'm reviewing your scripts
>>>>>>>>>>>>
>>>>>>>>>>> (that I
>>>>
>>>>> didn't
>>>>>>>>>>>> have previously, so we are learning this together):
>>>>>>>>>>>> * It may be useful to try and document why we chose a
>>>>>>>>>>>>
>>>>>>>>>>> particular
>>>>
>>>>> docker
>>>>>>>>>>>> image as the base (ie, "this is the official supported
>>>>>>>>>>>>
>>>>>>>>>>> elastic
>>>>
>>>>> search
>>>>>>
>>>>>>> docker image" or "this image has several data stores
>>>>>>>>>>>>
>>>>>>>>>>> together that
>>>>
>>>>> can be
>>>>>>>>>>>> used for a couple different tests")  - I'm curious as to
>>>>>>>>>>>>
>>>>>>>>>>> whether
>>>>
>>>>> the
>>>>>
>>>>>> community thinks that is important
>>>>>>>>>>>>
>>>>>>>>>>>> One thing that I called out in the comment that's worth
>>>>>>>>>>>>
>>>>>>>>>>> mentioning
>>>>
>>>>> on the
>>>>>>>>>>>> larger list - if you want to specify which specific runners
>>>>>>>>>>>>
>>>>>>>>>>> a test
>>>>
>>>>> uses,
>>>>>>>>>>>> that can be controlled in the pom for the module. I updated
>>>>>>>>>>>>
>>>>>>>>>>> the
>>>>
>>>>> testing
>>>>>>>>>>>>
>>>>>>>>>>>> doc
>>>>>>>>>>>
>>>>>>>>>>> mentioned previously in this thread with a TODO to talk
>>>>>>>>>>>>
>>>>>>>>>>> about this
>>>>
>>>>> more. I
>>>>>>>>>>>> think we should also make it so that IO modules have that
>>>>>>>>>>>> automatically,
>>>>>>>>>>>>
>>>>>>>>>>>> so
>>>>>>>>>>>
>>>>>>>>>>> developers don't have to worry about it.
>>>>>>>>>>>>
>>>>>>>>>>>> S
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Dec 1, 2016 at 9:00 AM Etienne Chauchot <
>>>>>>>>>>>>
>>>>>>>>>>> echauc...@gmail.com>
>>>>>>
>>>>>>>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Stephen,
>>>>>>>>>>>>
>>>>>>>>>>>> As discussed, I added injection script, docker containers
>>>>>>>>>>>>
>>>>>>>>>>> scripts
>>>>
>>>>> and
>>>>>>
>>>>>>> integration tests to the sdks/java/io/elasticsearch/contrib
>>>>>>>>>>>> <
>>>>>>>>>>>>
>>>>>>>>>>>> https://github.com/apache/incubator-beam/pull/1439/files/1e7
>>>>>>>>>>>>
>>>>>>>>>>> e2f0a6e1a1777d31ae2c886c920efccd708b5#diff-e243536428d06ade7
>>>>>>>>> d824cefcb3ed0b9
>>>>>>>>>
>>>>>>>>> directory in that PR:
>>>>>>>>>>
>>>>>>>>>>> https://github.com/apache/incubator-beam/pull/1439.
>>>>>>>>>>>>
>>>>>>>>>>>> These work well but they are first shot. Do you have any
>>>>>>>>>>>>
>>>>>>>>>>> comments
>>>>
>>>>> about
>>>>>>>>>>>> those?
>>>>>>>>>>>>
>>>>>>>>>>>> Besides I am not very sure that these files should be in the
>>>>>>>>>>>>
>>>>>>>>>>> IO
>>>>
>>>>> itself
>>>>>>
>>>>>>> (even in contrib directory, out of maven source
>>>>>>>>>>>>
>>>>>>>>>>> directories). Any
>>>>
>>>>>
>>>>>>>>>>>> thoughts?
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>>
>>>>>>>>>>>> Etienne
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Le 23/11/2016 à 19:03, Stephen Sisk a écrit :
>>>>>>>>>>>>
>>>>>>>>>>>> It's great to hear more experiences.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm also glad to hear that people see real value in the
>>>>>>>>>>>>>
>>>>>>>>>>>> high
>>>>
>>>>> volume/performance benchmark tests. I tried to capture that
>>>>>>>>>>>>>
>>>>>>>>>>>> in
>>>>
>>>>> the
>>>>>
>>>>>>
>>>>>>>>>>>>> Testing
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> doc I shared, under "Reasons for Beam Test Strategy". [1]
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> It does generally sound like we're in agreement here. Areas
>>>>>>>>>>>>>
>>>>>>>>>>>> of
>>>>
>>>>> discussion
>>>>>>>>>>>>
>>>>>>>>>>>>

Re: Hosting data stores for IO Transform testing

Reply via email to