Re: Hosting data stores for IO Transform testing

Etienne Chauchot Mon, 26 Dec 2016 05:01:52 -0800

JB,

Thanks for the link!


My comments are bellow in the email


Le 24/12/2016 à 08:14, Jean-Baptiste Onofré a écrit :

Hi Etienne,

Thanks for sharing !

For the itest module, I'm in favor of a dedicated module too.

It's what I did in Karaf:

https://github.com/apache/karaf/tree/master/itests

Itests contains all the integration tests for Karaf, covering allmodules/features.


What we should add:
- multi runners execution (as we do in Karaf/Felix using Pax Exam)

Yes, like I said, multi-runner is planned. It is a key point ofintegration tests. We will definitely do this.

- bootstrapping of requires resource (as we do in Karaf startingActiveMQ for instance).

I'm kind of hesitating, maybe bootstrapping could be done outside on aweekly basis (like Stephen suggested, to discover configuration tweakingASAP) and loading/cleaning by the tests. What are, in your opinion, themajor benefits/drawbacks of bootstrapping data stores in tests?


I will help on this.

Thanks!


Thanks
Regards
JB

On 12/23/2016 04:48 PM, Etienne Chauchot wrote:

Hi,

Recently we had a discussion about integration tests of IOs. I'm
preparing a PR for integration tests of the elasticSearch IO

(https://github.com/echauchot/incubator-beam/tree/BEAM-1184-ELASTICSEARCH-IO

as a first shot) which are very important IMHO because they helped catch
some bugs that UT could not (volume, data store instance sharing, real
data store instance ...)

I would like to have your thoughts/remarks about points bellow. Some of
these points are also discussed here

https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-NprQ7vbf1jNVRgdqeEE8I/edit#heading=h.7ly6e7beup8a

:

- UT and IT have a similar architecture, but while UT focus on testing
the correct behavior of the code including corner cases and use embedded
in memory data store, IT assume that the behavior is correct (strong UT)
and focus on higher volume testing and testing against real data store
instance(s)

- For now, IT are stored alongside with UT in src/test directory of the
IO but they might go to dedicated module, waiting for a consensus. Maven
is not configured to run them automatically because data store is not
available on jenkins server yet

- For now, they only use DirectRunner, but they will  be run against
each runner.

- IT do not setup data store instance (like stated in the above
document) they assume that one is already running (hardcoded
configuration in test for now, waiting for a common solution to pass
configuration to IT). A docker container script is provided in the
contrib directory as a starting point to whatever orchestration software
will be chosen.

- IT load and clean test data before and after each test if needed. It
is simpler to do so because some tests need empty data store (write
test) and because, as discussed in the document, tests might not be the
only users of the data store. Also IMHO, it is better that tests
load/clean data than doing some assumptions about the running order of
the tests.

If we generalize this pattern to all IT tests, this will tend to go to
the direction of long running data store instances rather than data
store instances started (and optionally loaded) before tests.

Besides if we where to change our minds and load data from outside the
tests, a logstash script is provided.

If you have any thoughts or remarks I'm all ears :)

Regards,

Etienne

Le 14/12/2016 à 17:07, Jean-Baptiste Onofré a écrit :

Hi Stephen,

the purpose of having in a specific module is to share resources and
apply the same behavior from IT perspective and be able to have IT
"cross" IO (for instance, reading from JMS and sending to Kafka, I
think that's the key idea for integration tests).

For instance, in Karaf, we have:
- utest in each module
- itest module containing itests for all modules all together

Regards
JB

On 12/14/2016 04:59 PM, Stephen Sisk wrote:

Hi Etienne,

thanks for following up and answering my questions.

re: where to store integration tests - having them all in a separate
module

is an interesting idea. I couldn't find JB's comments about movingthem

into a separate module in the PR - can you share the reasons for
doing so?
The IO integration/perf tests so it does seem like they'll need to be
treated in a special manner, but given that there is already an IO
specific
module, it may just be that we need to treat all the ITs in the IO
module
the same way. I don't have strong opinions either way right now.

S

On Wed, Dec 14, 2016 at 2:39 AM Etienne Chauchot <echauc...@gmail.com>
wrote:

Hi guys,

@Stephen: I addressed all your comments directly in the PR, thanks!
I just wanted to comment here about the docker image I used: the only
official Elastic image contains only ElasticSearch. But for testing I
needed logstash (for ingestion) and kibana (not for integration tests,
but to easily test REST requests to ES using sense). This is why I use

an ELK (Elasticsearch+Logstash+Kibana) image. This one isreleasedunder

theapache 2 license.


Besides, there is also a point about where to store integration tests:
JB proposed in the PR to store integration tests to dedicated module
rather than directly in the IO module (like I did).



Etienne

Le 01/12/2016 à 20:14, Stephen Sisk a écrit :

hey!

thanks for sending this. I'm very excited to see this change. I
added some
detail-oriented code review comments in addition to what I'vediscussed
here.
The general goal is to allow for re-usable instantiation ofparticular

data

store instances and this seems like a good start. Looks like you
also have
a script to generate test data for your tests - that's great.

The next steps (definitely not blocking your work) will be to have
ways to
create instances from the docker images you have here, and use them
in the

tests. We'll need support in the test framework for that sinceit'll be

different on developer machines and in the beam jenkins cluster, but
your
scripts here allow someone running these tests locally to not have to

worry

about getting the instance set up and can manually adjust, so thisis a
good incremental step.
I have some thoughts now that I'm reviewing your scripts (that Ididn't
have previously, so we are learning this together):
* It may be useful to try and document why we chose a particulardocker
image as the base (ie, "this is the official supported elastic search
docker image" or "this image has several data stores together that
can be
used for a couple different tests")  - I'm curious as to whether the
community thinks that is important

One thing that I called out in the comment that's worth mentioning
on the
larger list - if you want to specify which specific runners a test
uses,
that can be controlled in the pom for the module. I updated thetesting

doc

mentioned previously in this thread with a TODO to talk about this
more. I
think we should also make it so that IO modules have that
automatically,

so

developers don't have to worry about it.

S

On Thu, Dec 1, 2016 at 9:00 AM Etienne Chauchot <echauc...@gmail.com>

wrote:


Stephen,

As discussed, I added injection script, docker containers scripts and
integration tests to the sdks/java/io/elasticsearch/contrib
<

https://github.com/apache/incubator-beam/pull/1439/files/1e7e2f0a6e1a1777d31ae2c886c920efccd708b5#diff-e243536428d06ade7d824cefcb3ed0b9

directory in that PR:
https://github.com/apache/incubator-beam/pull/1439.
These work well but they are first shot. Do you have any commentsabout
those?
Besides I am not very sure that these files should be in the IOitself
(even in contrib directory, out of maven source directories). Any

thoughts?


Thanks,

Etienne



Le 23/11/2016 à 19:03, Stephen Sisk a écrit :

It's great to hear more experiences.

I'm also glad to hear that people see real value in the high
volume/performance benchmark tests. I tried to capture that in the

Testing

doc I shared, under "Reasons for Beam Test Strategy". [1]

It does generally sound like we're in agreement here. Areas of
discussion

see:

1. People like the idea of bringing up fresh instances for eachtest

rather than keeping instances running all the time, since that
ensures no
contamination between tests. That seems reasonable to me. If we see
flakiness in the tests or we note that setting up/tearing down
instances

is

taking a lot of time,
2. Deciding on cluster management software/orchestration software- I

want

to make sure we land on the right tool here since choosing the
wrong tool
could result in administration of the instances taking more work. I

suspect

that's a good place for a follow up discussion, so I'll start a
separate
thread on that. I'm happy with whatever tool we choose, but Iwant to

make

sure we take a moment to consider different options and have a
reason for
choosing one.

Etienne - thanks for being willing to port your creation/otherscripts

over. You might be a good early tester of whether this system works
well
for everyone.

Stephen

[1]  Reasons for Beam Test Strategy -

https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-NprQ7vbf1jNVRgdqeEE8I/edit?ts=58349aec#



On Wed, Nov 23, 2016 at 12:48 AM Jean-Baptiste Onofré
<j...@nanthrax.net>
wrote:

I second Etienne there.

We worked together on the ElasticsearchIO and definitely, the high
valuable test we did were integration tests with ES on docker and
high
volume.

I think we have to distinguish the two kinds of tests:
1. utests are located in the IO itself and basically they should
cover
the core behaviors of the IO
2. itests are located as contrib in the IO (they could be part of
the IO
but executed by the integration-test plugin or a specific profile)
that

deals with "real" backend and high volumes. The resourcesrequired by

the itest can be bootstrapped by Jenkins (for instance using
Mesos/Marathon and docker images as already discussed, and it's
what I'm
doing on my own "server").

It's basically what Stephen described.

We have to not relay only on itest: utests are very important and
they
validate the core behavior.

My $0.01 ;)

Regards
JB

On 11/23/2016 09:27 AM, Etienne Chauchot wrote:

Hi Stephen,

I like your proposition very much and I also agree that docker +
some
orchestration software would be great !

On the elasticsearchIO (PR to be created this week) there isdockercontainer creation scripts and logstash data ingestion scriptfor IT

environment available in contrib directory alongside with
integration
tests themselves. I'll be happy to make them compliant to new IT
environment.

What you say bellow about the need for external IT environment is
particularly true. As an example with ES what came out in first
implementation was that there were problems starting at some high

volume

of data (timeouts, ES windowing overflow...) that could nothave be

seen

on embedded ES version. Also there where some particularities to
external instance like secondary (replica) shards that where not

visible

on embedded instance.

Besides, I also favor bringing up instances before test because it
allows (amongst other things) to be sure to start on a freshdataset

for

the test to be deterministic.

Etienne


Le 23/11/2016 à 02:00, Stephen Sisk a écrit :

Hi,

I'm excited we're getting lots of discussion going. There aremany

threads
of conversation here, we may choose to split some of them off
into a
different email thread. I'm also betting I missed some of the
questions in
this thread, so apologies ahead of time for that. Also apologies
for

the

amount of text, I provided some quick summaries at the top ofeach
section.

Amit - thanks for your thoughts. I've responded in detail below.
Ismael - thanks for offering to help. There's plenty of work
here to

go

around. I'll try and think about how we can divide up some next
steps
(probably in a separate thread.) The main next step I see is
deciding
between kubernetes/mesos+marathon/docker swarm - I'm working on
that,

but

having lots of different thoughts on what the
advantages/disadvantages

of

those are would be helpful (I'm not entirely sure of the
protocol for
collaborating on sub-projects like this.)

These issues are all related to what kind of tests we want to
write. I
think a kubernetes/mesos/swarm cluster could support all the use
cases

we've discussed here (and thus should not block moving forwardwith

this),
but understanding what we want to test will help us understand
how the
cluster will be used. I'm working on a proposed user guide for
testing

IO

Transforms, and I'm going to send out a link to that + a short
summary

to

the list shortly so folks can get a better sense of where I'm
coming
from.



Here's my thinking on the questions we've raised here -

Embedded versions of data stores for testing
--------------------
Summary: yes! But we still need real data stores to test against.

I am a gigantic fan of using embedded versions of the variousdata

stores.
I think we should test everything we possibly can using them,
and do

the

majority of our correctness testing using embedded versions + the

direct

runner. However, it's also important to have at least one testthatactually connects to an actual instance, so we can getcoverage for
things
like credentials, real connection strings, etc...
The key point is that embedded versions definitely can't covertheperformance tests, so we need to host instances if we want totest

that.

I consider the integration tests/performance benchmarks to be
costly
things
that we do only for the IO transforms with large amounts of
community
support/usage. A random IO transform used by a few users doesn't

necessarily need integration & perf tests, but for heavilyused IO

transforms, there's a lot of community value in these tests. The
maintenance proposal below scales with the amount of community
support
for
a particular IO transform.



Reusing data stores ("use the data stores across executions.")
------------------
Summary: I favor a hybrid approach: some frequently used, very
small
instances that we keep up all the time + larger multi-container
data
store
instances that we spin up for perf tests.

I don't think we need to have a strong answer to this question,
but I
think
we do need to know what range of capabilities we need, and use
that to
inform our requirements on the hosting infrastructure. I think
kubernetes/mesos + docker can support all the scenarios I discuss

below.

I had been thinking of a hybrid approach - reuse someinstances and
don't
reuse others. Some tests require isolation from other tests (eg.
performance benchmarking), while others can easily re-use thesame
database/data store instance over time, provided they are
written in

the

correct manner (eg. a simple read or write correctnessintegration
tests)
To me, the question of whether to use one instance over timefor a
test vs
spin up an instance for each test comes down to a trade offbetween
these
factors:
1. Flakiness of spin-up of an instance - if it's super flaky,we'll
want to
keep more instances up and running rather than bring themup/down.

(this

may also vary by the data store in question)
2. Frequency of testing - if we are running tests every 5
minutes, it

may

be wasteful to bring machines up/down every time. If we run
tests once

day or week, it seems wasteful to keep the machines up the whole
time.
3. Isolation requirements - If tests must be isolated, itmeans we

either

have to bring up the instances for each test, or we have to have
some
sort
of signaling mechanism to indicate that a given instance is in
use. I
strongly favor bringing up an instance per test.
4. Number/size of containers - if we need a large number of
machines
for a
particular test, keeping them running all the time will use more
resources.

The major unknown to me is how flaky it'll be to spin theseup. I'm

hopeful/assuming they'll be pretty stable to bring up, but I
think the
best
way to test that is to start doing it.

I suspect the sweet spot is the following: have a set of verysmall

data

store instances that stay up to support small-data-sizepost-commit
end to
end tests (post-commits run frequently and the data size meansthe
instances would not use many resources), combined with the
ability to
spin
up larger instances for once a day/week performance benchmarks
(these

use

up more resources and are used less frequently.) That's the mix
I'll
propose in my docs on testing IO transforms.  If spinning up new
instances
is cheap/non-flaky, I'd be fine with the idea of spinning up
instances
for
each test.



Management ("what's the overhead of managing such a deployment")
--------------------
Summary: I propose that anyone can contribute scripts for
setting up

data

store instances + integration/perf tests, but if the community
doesn't

maintain a particular data store's tests, we disable the testsand

turn off
the data store instances.

Management of these instances is a crucial question. First, let's

break

down what tasks we'll need to do on a recurring basis:
1. Ongoing maintenance (update to new versions, both instance &
dependencies) - we don't want to have a lot of old versions
vulnerable

to

attacks/buggy
2. Investigate breakages/regressions
(I'm betting there will be more things we'll discover - let me
know if
you
have suggestions)

There's a couple goals I see:
1. We should only do sys admin work for things that give us a
lot of
benefit. (ie, don't build IT/perf/data store set up scripts for
data
stores
without a large community)
2. We should do as much as possible of testing via
in-memory/embedded
testing (as you brought up).
3. Reduce the amount of manual administration overhead

As I discussed above, I think that integration tests/performance
benchmarks

are costly things that we should do only for the IO transformswith

large

amounts of community support/usage. Thus, I propose that we
limit the

IO

transforms that get integration tests & performance benchmarks to

those

that have community support for maintaining the data store
instances.

We can enforce this organically using some simple rules:
1. Investigating breakages/regressions: if a givenintegration/perf

test

starts failing and no one investigates it within a set period of
time

(a

week?), we disable the tests and shut off the data store
instances if

we

have instances running. When someone wants to step up and
support it
again,
they can fix the test, check it in, and re-enable the test.
2. Ongoing maintenance: every N months, file a jira issue that
is just
"is
the IO Transform X data store up to date?" - if the jira is not
resolved in
a set period of time (1 month?), the perf/integration tests are

disabled,

and the data store instances shut off.

This is pretty flexible -
* If a particular person or organization wants to support an IO
transform,
they can. If a group of people all organically organize tokeep the

tests

running, they can.
* It can be mostly automated - there's not a lot of central
organizing
work
that needs to be done.

Exposing the information about what IO transforms currently have

running

IT/perf benchmarks on the website will let users know what IO

transforms

are well supported.

I like this solution, but I also recognize this is a tricky
problem.

This

is something the community needs to be supportive of, so I'm
open to
other
thoughts.
Simulating failures in real nodes ("programmatic tests tosimulate
failure")
-----------------
Summary: 1) Focus our testing on the code in Beam 2) We should
encourage a
design pattern separating out network/retry logic from themain IO
transform logic

We *could* create instance failure in any container management

software

we can use their programmatic APIs to determine which containers
are
running the instances, and ask them to kill the container in
question.

slow node would be trickier, but I'm sure we could figure it out
- for
example, add a network proxy that would delay responses.
However, I would argue that this type of testing doesn't gainus a
lot, and
is complicated to set up. I think it will be easier to testnetwork
errors
and retry behavior in unit tests for the IO transforms.

Part of the way to handle this is to separate out the read code
from

the

network code (eg. bigtable has BigtableService). If you put the

"handle

errors/retry logic" code in a separate MySourceService class,
you can
test
MySourceService on the wide variety of networks errors/data store
problems,
and then your main IO transform tests focus on the read behavior
and
handling the small set of errors the MySourceService class will

return.


I also think we should focus on testing the IO Transform, not
the data
store - if we kill a node in a data store, it's that data store's
problem,
not beam's problem. As you were pointing out, there are a *large*
number of
possible ways that a particular data store can fail, and we
would like

to

support many different data stores. Rather than try to test that
each
data
store behaves well, we should ensure that we handle
generic/expected
errors
in a graceful manner.






Ismaeal had a couple other quick comments/questions, I'll answer
here

We can use this to test other runners running on multiple
machines - I
agree. This is also necessary for a good performance benchmark
test.
"providing the test machines to mount the cluster" - we candiscuss

this

further, but one possible option is that google may be willing to

donate

something to support this.

"IO Consistency" - let's follow up on those questions in another

thread.

That's as much about the public interface we provide to users as

anything

else. I agree with your sentiment that a user should be able to
expect
predictable behavior from the different IO transforms.

Thanks for everyone's questions/comments - I really am excited
to see
that
people care about this :)

Stephen

On Tue, Nov 22, 2016 at 7:59 AM Ismaël Mejía <ieme...@gmail.com>

wrote:

Hello,

@Stephen Thanks for your proposal, it is really interesting, I
would
really
like to help with this. I have never played with Kubernetes but
this
seems
a really nice chance to do something useful with it.
We (at Talend) are testing most of the IOs using simplecontainer
images
and in some particular cases ‘clusters’ of containers using
docker-compose
(a little bit like Amit’s (2) proposal). It would be really
nice to
have
this at the Beam level, in particular to try to test morecomplexsemantics, I don’t know how programmable kubernetes is toachieve
this for
example:
Let’s think we have a cluster of Cassandra or Kafka nodes, Iwould
like to
have programmatic tests to simulate failure (e.g. kill anode), or
simulate
a really slow node, to ensure that the IO behaves as expected
in the
Beam
pipeline for the given runner.

Another related idea is to improve IO consistency: Today the
different IOs
have small differences in their failure behavior, I really
would like
to be
able to predict with more precision what will happen in case of

errors,

e.g. what is the correct behavior if I am writing to a Kafka
node and
there
is a network partition, does the Kafka sink retries or no ? and
what
if it
is the JdbcIO ?, will it work the same e.g. assuming
checkpointing?
Or do
we guarantee exactly once writes somehow?, today I am not sure
about
what
happens (or if the expected behavior depends on the runner),
but well
maybe
it is just that I don’t know and we have tests to ensure this.

Of course both are really hard problems, but I think with your
proposal we
can try to tackle them, as well as the performance ones. And
apart of
the
data stores, I think it will be also really nice to be able to
test

the

runners in a distributed manner.

So what is the next step? How do you imagine such integration
tests?
? Who
can provide the test machines so we can mount the cluster?
Maybe my ideas are a bit too far away for an initial setup,but it
will be
really nice to start working on this.

Ismael
On Tue, Nov 22, 2016 at 11:00 AM, Amit Sela<amitsel...@gmail.com>
wrote:
Hi Stephen,

I was wondering about how we plan to use the data stores across
executions.
Clearly, it's best to setup a new instance (container) forevery

test,

running a "standalone" store (say HBase/Cassandra for
example), and
once
the test is done, teardown the instance. It should also be
agnostic

to

the

runtime environment (e.g., Docker on Kubernetes).
I'm wondering though what's the overhead of managing such a

deployment

which could become heavy and complicated as more IOs are
supported

and

more

test cases introduced.

Another way to go would be to have small clusters of different
data

stores

and run against new "namespaces" (while lazily evicting old
ones),
but I
think this is less likely as maintaining a distributed instance

(even

small one) for each data store sounds even more complex.

A third approach would be to to simply have an "embedded"
in-memory
instance of a data store as part of a test that runs against it
(such as

an

embedded Kafka, though not a data store).

This is probably the simplest solution in terms oforchestration,

but it
depends on having a proper "embedded" implementation for an IO.

Does this make sense to you ? have you considered it ?

Thanks,
Amit

On Tue, Nov 22, 2016 at 8:20 AM Jean-Baptiste Onofré <

j...@nanthrax.net

wrote:

Hi Stephen,

as already discussed a bit together, it sounds great !
Especially I

like

it as a both integration test platform and good coverage for
IOs.

I'm very late on this but, as said, I will share with you my

Marathon

JSON and Mesos docker images.

By the way, I started to experiment a bit kubernetes and
swamp but
it's
not yet complete. I will share what I have on the same github
repo.

Thanks !
Regards
JB

On 11/16/2016 11:36 PM, Stephen Sisk wrote:

Hi everyone!

Currently we have a good set of unit tests for our IO
Transforms -

those

tend to run against in-memory versions of the data stores.

However,

we'd

like to further increase our test coverage to include
running them

against

real instances of the data stores that the IO Transforms work

against

(e.g.

cassandra, mongodb, kafka, etc…), which means we'll need to
have

real

instances of various data stores.

Additionally, if we want to do performance regression
detection,

it's

important to have instances of the services that behave

realistically,

which isn't true of in-memory or dev versions of theservices.
Proposed solution
-------------------------
If we accept this proposal, we would create an
infrastructure for

running

real instances of data stores inside of containers, using

container

management software like mesos/marathon, kubernetes, docker
swarm,

etc…

to

manage the instances.

This would enable us to build integration tests that run
against

those

real

instances and performance tests that run against those real

instances

(like
those that Jason Kuster is proposing elsewhere.)
Why do we need one centralized set of instances vs justhaving
various
people host their own instances?
-------------------------
Reducing flakiness of tests is key. By not havingdependencies

from

the
core project on external services/instances of data stores
we have
guaranteed access to the services and the group can fixissues

that

arise.

An exception would be something that has an ops team
supporting it

(eg,

AWS, Google Cloud or other professionally managed service) -
those

we

trust
will be stable.


There may be a lot of different data stores needed - how
will we
maintain
them?
-------------------------
It will take work above and beyond that of a normal set ofunit

tests

to

build and maintain integration/performance tests & their data

store

instances.
Setup & maintenance of the data store containers and datastore

instances

on it must be automated. It also has to be as simple of a
setup as
possible, and we should avoid hand tweaking the containers -

expecting

checked in scripts/dockerfiles is key.

Aligned with the community ownership approach of Apache, as

members

of

the

community are excited to contribute & maintain those tests
and the
integration/performance tests, people will be able to step
up and

do

that.
If there is no longer support for maintaining a particular
set of
integration & performance tests and their data storeinstances,

then

we

can

disable those tests. We may document on the website what IO

Transforms

have

current integration/performance tests so users know what
level of

testing

the various IO Transforms have.


What about requirements for the container management software

itself?

-------------------------
* We should have the data store instances themselves inDocker.
Docker
allows new instances to be spun up in a quick,reproducible way

and

is

fairly platform independent. It has wide support from a
variety of
different container management services.
* As little admin work required as possible. Crashinginstances

should

be

restarted, setup should be simple, everything possible
should be
scripted/scriptable.
* Logs and test output should be on a publicly available
website,

without

needing to log into test execution machine. Centralized
capture of
monitoring info/logs from instances running in the containers

would

support

this. Ideally, this would just be supported by the container

software

out
of the box.
* It'd be useful to have good persistent volume in the
container
management
software so that databases don't have to reload large datasets

every

time.

* The containers may be a place to execute runners
themselves if

we

need

larger runner instances, so it should play well with Spark,
Flink,

etc…

As I discussed earlier on the mailing list, it looks like
hosting

docker

containers on kubernetes, docker swarm or mesos+marathon
would be

good

solution.

Thanks,
Stephen Sisk

--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: Hosting data stores for IO Transform testing

Reply via email to