Re: Hosting data stores for IO Transform testing

2016-12-14 Thread Jean-Baptiste Onofré

Hi Stephen,

the purpose of having in a specific module is to share resources and 
apply the same behavior from IT perspective and be able to have IT 
"cross" IO (for instance, reading from JMS and sending to Kafka, I think 
that's the key idea for integration tests).


For instance, in Karaf, we have:
- utest in each module
- itest module containing itests for all modules all together

Regards
JB

On 12/14/2016 04:59 PM, Stephen Sisk wrote:

Hi Etienne,

thanks for following up and answering my questions.

re: where to store integration tests - having them all in a separate module
is an interesting idea. I couldn't find JB's comments about moving them
into a separate module in the PR - can you share the reasons for doing so?
The IO integration/perf tests so it does seem like they'll need to be
treated in a special manner, but given that there is already an IO specific
module, it may just be that we need to treat all the ITs in the IO module
the same way. I don't have strong opinions either way right now.

S

On Wed, Dec 14, 2016 at 2:39 AM Etienne Chauchot 
wrote:

Hi guys,

@Stephen: I addressed all your comments directly in the PR, thanks!
I just wanted to comment here about the docker image I used: the only
official Elastic image contains only ElasticSearch. But for testing I
needed logstash (for ingestion) and kibana (not for integration tests,
but to easily test REST requests to ES using sense). This is why I use
an ELK (Elasticsearch+Logstash+Kibana) image. This one isreleased under
theapache 2 license.


Besides, there is also a point about where to store integration tests:
JB proposed in the PR to store integration tests to dedicated module
rather than directly in the IO module (like I did).



Etienne

Le 01/12/2016 à 20:14, Stephen Sisk a écrit :

hey!

thanks for sending this. I'm very excited to see this change. I added some
detail-oriented code review comments in addition to what I've discussed
here.

The general goal is to allow for re-usable instantiation of particular

data

store instances and this seems like a good start. Looks like you also have
a script to generate test data for your tests - that's great.

The next steps (definitely not blocking your work) will be to have ways to
create instances from the docker images you have here, and use them in the
tests. We'll need support in the test framework for that since it'll be
different on developer machines and in the beam jenkins cluster, but your
scripts here allow someone running these tests locally to not have to

worry

about getting the instance set up and can manually adjust, so this is a
good incremental step.

I have some thoughts now that I'm reviewing your scripts (that I didn't
have previously, so we are learning this together):
* It may be useful to try and document why we chose a particular docker
image as the base (ie, "this is the official supported elastic search
docker image" or "this image has several data stores together that can be
used for a couple different tests")  - I'm curious as to whether the
community thinks that is important

One thing that I called out in the comment that's worth mentioning on the
larger list - if you want to specify which specific runners a test uses,
that can be controlled in the pom for the module. I updated the testing

doc

mentioned previously in this thread with a TODO to talk about this more. I
think we should also make it so that IO modules have that automatically,

so

developers don't have to worry about it.

S

On Thu, Dec 1, 2016 at 9:00 AM Etienne Chauchot 

wrote:


Stephen,

As discussed, I added injection script, docker containers scripts and
integration tests to the sdks/java/io/elasticsearch/contrib
<


https://github.com/apache/incubator-beam/pull/1439/files/1e7e2f0a6e1a1777d31ae2c886c920efccd708b5#diff-e243536428d06ade7d824cefcb3ed0b9

directory in that PR: https://github.com/apache/incubator-beam/pull/1439.

These work well but they are first shot. Do you have any comments about
those?

Besides I am not very sure that these files should be in the IO itself
(even in contrib directory, out of maven source directories). Any

thoughts?


Thanks,

Etienne



Le 23/11/2016 à 19:03, Stephen Sisk a écrit :

It's great to hear more experiences.

I'm also glad to hear that people see real value in the high
volume/performance benchmark tests. I tried to capture that in the

Testing

doc I shared, under "Reasons for Beam Test Strategy". [1]

It does generally sound like we're in agreement here. Areas of discussion

I

see:
1.  People like the idea of bringing up fresh instances for each test
rather than keeping instances running all the time, since that ensures no
contamination between tests. That seems reasonable to me. If we see
flakiness in the tests or we note that setting up/tearing down instances

is

taking a lot of time,
2. Deciding on cluster management software/orchestration software - I

want

to make sure we land on the right tool here since choosing the wrong tool
co

Re: Hosting data stores for IO Transform testing

2016-12-14 Thread Stephen Sisk
Hi Etienne,

thanks for following up and answering my questions.

re: where to store integration tests - having them all in a separate module
is an interesting idea. I couldn't find JB's comments about moving them
into a separate module in the PR - can you share the reasons for doing so?
The IO integration/perf tests so it does seem like they'll need to be
treated in a special manner, but given that there is already an IO specific
module, it may just be that we need to treat all the ITs in the IO module
the same way. I don't have strong opinions either way right now.

S

On Wed, Dec 14, 2016 at 2:39 AM Etienne Chauchot 
wrote:

Hi guys,

@Stephen: I addressed all your comments directly in the PR, thanks!
I just wanted to comment here about the docker image I used: the only
official Elastic image contains only ElasticSearch. But for testing I
needed logstash (for ingestion) and kibana (not for integration tests,
but to easily test REST requests to ES using sense). This is why I use
an ELK (Elasticsearch+Logstash+Kibana) image. This one isreleased under
theapache 2 license.


Besides, there is also a point about where to store integration tests:
JB proposed in the PR to store integration tests to dedicated module
rather than directly in the IO module (like I did).



Etienne

Le 01/12/2016 à 20:14, Stephen Sisk a écrit :
> hey!
>
> thanks for sending this. I'm very excited to see this change. I added some
> detail-oriented code review comments in addition to what I've discussed
> here.
>
> The general goal is to allow for re-usable instantiation of particular
data
> store instances and this seems like a good start. Looks like you also have
> a script to generate test data for your tests - that's great.
>
> The next steps (definitely not blocking your work) will be to have ways to
> create instances from the docker images you have here, and use them in the
> tests. We'll need support in the test framework for that since it'll be
> different on developer machines and in the beam jenkins cluster, but your
> scripts here allow someone running these tests locally to not have to
worry
> about getting the instance set up and can manually adjust, so this is a
> good incremental step.
>
> I have some thoughts now that I'm reviewing your scripts (that I didn't
> have previously, so we are learning this together):
> * It may be useful to try and document why we chose a particular docker
> image as the base (ie, "this is the official supported elastic search
> docker image" or "this image has several data stores together that can be
> used for a couple different tests")  - I'm curious as to whether the
> community thinks that is important
>
> One thing that I called out in the comment that's worth mentioning on the
> larger list - if you want to specify which specific runners a test uses,
> that can be controlled in the pom for the module. I updated the testing
doc
> mentioned previously in this thread with a TODO to talk about this more. I
> think we should also make it so that IO modules have that automatically,
so
> developers don't have to worry about it.
>
> S
>
> On Thu, Dec 1, 2016 at 9:00 AM Etienne Chauchot 
wrote:
>
> Stephen,
>
> As discussed, I added injection script, docker containers scripts and
> integration tests to the sdks/java/io/elasticsearch/contrib
> <
>
https://github.com/apache/incubator-beam/pull/1439/files/1e7e2f0a6e1a1777d31ae2c886c920efccd708b5#diff-e243536428d06ade7d824cefcb3ed0b9
> directory in that PR: https://github.com/apache/incubator-beam/pull/1439.
>
> These work well but they are first shot. Do you have any comments about
> those?
>
> Besides I am not very sure that these files should be in the IO itself
> (even in contrib directory, out of maven source directories). Any
thoughts?
>
> Thanks,
>
> Etienne
>
>
>
> Le 23/11/2016 à 19:03, Stephen Sisk a écrit :
>> It's great to hear more experiences.
>>
>> I'm also glad to hear that people see real value in the high
>> volume/performance benchmark tests. I tried to capture that in the
Testing
>> doc I shared, under "Reasons for Beam Test Strategy". [1]
>>
>> It does generally sound like we're in agreement here. Areas of discussion
> I
>> see:
>> 1.  People like the idea of bringing up fresh instances for each test
>> rather than keeping instances running all the time, since that ensures no
>> contamination between tests. That seems reasonable to me. If we see
>> flakiness in the tests or we note that setting up/tearing down instances
> is
>> taking a lot of time,
>> 2. Deciding on cluster management software/orchestration software - I
want
>> to make sure we land on the right tool here since choosing the wrong tool
>> could result in administration of the instances taking more work. I
> suspect
>> that's a good place for a follow up discussion, so I'll start a separate
>> thread on that. I'm happy with whatever tool we choose, but I want to
make
>> sure we take a moment to consider different options and have a reason for
>> choosing one.

Re: Hosting data stores for IO Transform testing

2016-12-14 Thread Etienne Chauchot

Hi guys,

@Stephen: I addressed all your comments directly in the PR, thanks!
I just wanted to comment here about the docker image I used: the only 
official Elastic image contains only ElasticSearch. But for testing I 
needed logstash (for ingestion) and kibana (not for integration tests, 
but to easily test REST requests to ES using sense). This is why I use 
an ELK (Elasticsearch+Logstash+Kibana) image. This one isreleased under 
theapache 2 license.



Besides, there is also a point about where to store integration tests: 
JB proposed in the PR to store integration tests to dedicated module 
rather than directly in the IO module (like I did).




Etienne

Le 01/12/2016 à 20:14, Stephen Sisk a écrit :

hey!

thanks for sending this. I'm very excited to see this change. I added some
detail-oriented code review comments in addition to what I've discussed
here.

The general goal is to allow for re-usable instantiation of particular data
store instances and this seems like a good start. Looks like you also have
a script to generate test data for your tests - that's great.

The next steps (definitely not blocking your work) will be to have ways to
create instances from the docker images you have here, and use them in the
tests. We'll need support in the test framework for that since it'll be
different on developer machines and in the beam jenkins cluster, but your
scripts here allow someone running these tests locally to not have to worry
about getting the instance set up and can manually adjust, so this is a
good incremental step.

I have some thoughts now that I'm reviewing your scripts (that I didn't
have previously, so we are learning this together):
* It may be useful to try and document why we chose a particular docker
image as the base (ie, "this is the official supported elastic search
docker image" or "this image has several data stores together that can be
used for a couple different tests")  - I'm curious as to whether the
community thinks that is important

One thing that I called out in the comment that's worth mentioning on the
larger list - if you want to specify which specific runners a test uses,
that can be controlled in the pom for the module. I updated the testing doc
mentioned previously in this thread with a TODO to talk about this more. I
think we should also make it so that IO modules have that automatically, so
developers don't have to worry about it.

S

On Thu, Dec 1, 2016 at 9:00 AM Etienne Chauchot  wrote:

Stephen,

As discussed, I added injection script, docker containers scripts and
integration tests to the sdks/java/io/elasticsearch/contrib
<
https://github.com/apache/incubator-beam/pull/1439/files/1e7e2f0a6e1a1777d31ae2c886c920efccd708b5#diff-e243536428d06ade7d824cefcb3ed0b9
directory in that PR: https://github.com/apache/incubator-beam/pull/1439.

These work well but they are first shot. Do you have any comments about
those?

Besides I am not very sure that these files should be in the IO itself
(even in contrib directory, out of maven source directories). Any thoughts?

Thanks,

Etienne



Le 23/11/2016 à 19:03, Stephen Sisk a écrit :

It's great to hear more experiences.

I'm also glad to hear that people see real value in the high
volume/performance benchmark tests. I tried to capture that in the Testing
doc I shared, under "Reasons for Beam Test Strategy". [1]

It does generally sound like we're in agreement here. Areas of discussion

I

see:
1.  People like the idea of bringing up fresh instances for each test
rather than keeping instances running all the time, since that ensures no
contamination between tests. That seems reasonable to me. If we see
flakiness in the tests or we note that setting up/tearing down instances

is

taking a lot of time,
2. Deciding on cluster management software/orchestration software - I want
to make sure we land on the right tool here since choosing the wrong tool
could result in administration of the instances taking more work. I

suspect

that's a good place for a follow up discussion, so I'll start a separate
thread on that. I'm happy with whatever tool we choose, but I want to make
sure we take a moment to consider different options and have a reason for
choosing one.

Etienne - thanks for being willing to port your creation/other scripts
over. You might be a good early tester of whether this system works well
for everyone.

Stephen

[1]  Reasons for Beam Test Strategy -


https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-NprQ7vbf1jNVRgdqeEE8I/edit?ts=58349aec#



On Wed, Nov 23, 2016 at 12:48 AM Jean-Baptiste Onofré 
wrote:


I second Etienne there.

We worked together on the ElasticsearchIO and definitely, the high
valuable test we did were integration tests with ES on docker and high
volume.

I think we have to distinguish the two kinds of tests:
1. utests are located in the IO itself and basically they should cover
the core behaviors of the IO
2. itests are located as contrib in the IO (they could be part of the IO
but exe

Re: Hosting data stores for IO Transform testing

2016-12-01 Thread Stephen Sisk
hey!

thanks for sending this. I'm very excited to see this change. I added some
detail-oriented code review comments in addition to what I've discussed
here.

The general goal is to allow for re-usable instantiation of particular data
store instances and this seems like a good start. Looks like you also have
a script to generate test data for your tests - that's great.

The next steps (definitely not blocking your work) will be to have ways to
create instances from the docker images you have here, and use them in the
tests. We'll need support in the test framework for that since it'll be
different on developer machines and in the beam jenkins cluster, but your
scripts here allow someone running these tests locally to not have to worry
about getting the instance set up and can manually adjust, so this is a
good incremental step.

I have some thoughts now that I'm reviewing your scripts (that I didn't
have previously, so we are learning this together):
* It may be useful to try and document why we chose a particular docker
image as the base (ie, "this is the official supported elastic search
docker image" or "this image has several data stores together that can be
used for a couple different tests")  - I'm curious as to whether the
community thinks that is important

One thing that I called out in the comment that's worth mentioning on the
larger list - if you want to specify which specific runners a test uses,
that can be controlled in the pom for the module. I updated the testing doc
mentioned previously in this thread with a TODO to talk about this more. I
think we should also make it so that IO modules have that automatically, so
developers don't have to worry about it.

S

On Thu, Dec 1, 2016 at 9:00 AM Etienne Chauchot  wrote:

Stephen,

As discussed, I added injection script, docker containers scripts and
integration tests to the sdks/java/io/elasticsearch/contrib
<
https://github.com/apache/incubator-beam/pull/1439/files/1e7e2f0a6e1a1777d31ae2c886c920efccd708b5#diff-e243536428d06ade7d824cefcb3ed0b9
>
directory in that PR: https://github.com/apache/incubator-beam/pull/1439.

These work well but they are first shot. Do you have any comments about
those?

Besides I am not very sure that these files should be in the IO itself
(even in contrib directory, out of maven source directories). Any thoughts?

Thanks,

Etienne



Le 23/11/2016 à 19:03, Stephen Sisk a écrit :
> It's great to hear more experiences.
>
> I'm also glad to hear that people see real value in the high
> volume/performance benchmark tests. I tried to capture that in the Testing
> doc I shared, under "Reasons for Beam Test Strategy". [1]
>
> It does generally sound like we're in agreement here. Areas of discussion
I
> see:
> 1.  People like the idea of bringing up fresh instances for each test
> rather than keeping instances running all the time, since that ensures no
> contamination between tests. That seems reasonable to me. If we see
> flakiness in the tests or we note that setting up/tearing down instances
is
> taking a lot of time,
> 2. Deciding on cluster management software/orchestration software - I want
> to make sure we land on the right tool here since choosing the wrong tool
> could result in administration of the instances taking more work. I
suspect
> that's a good place for a follow up discussion, so I'll start a separate
> thread on that. I'm happy with whatever tool we choose, but I want to make
> sure we take a moment to consider different options and have a reason for
> choosing one.
>
> Etienne - thanks for being willing to port your creation/other scripts
> over. You might be a good early tester of whether this system works well
> for everyone.
>
> Stephen
>
> [1]  Reasons for Beam Test Strategy -
>
https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-NprQ7vbf1jNVRgdqeEE8I/edit?ts=58349aec#
>
>
>
> On Wed, Nov 23, 2016 at 12:48 AM Jean-Baptiste Onofré 
> wrote:
>
>> I second Etienne there.
>>
>> We worked together on the ElasticsearchIO and definitely, the high
>> valuable test we did were integration tests with ES on docker and high
>> volume.
>>
>> I think we have to distinguish the two kinds of tests:
>> 1. utests are located in the IO itself and basically they should cover
>> the core behaviors of the IO
>> 2. itests are located as contrib in the IO (they could be part of the IO
>> but executed by the integration-test plugin or a specific profile) that
>> deals with "real" backend and high volumes. The resources required by
>> the itest can be bootstrapped by Jenkins (for instance using
>> Mesos/Marathon and docker images as already discussed, and it's what I'm
>> doing on my own "server").
>>
>> It's basically what Stephen described.
>>
>> We have to not relay only on itest: utests are very important and they
>> validate the core behavior.
>>
>> My $0.01 ;)
>>
>> Regards
>> JB
>>
>> On 11/23/2016 09:27 AM, Etienne Chauchot wrote:
>>> Hi Stephen,
>>>
>>> I like your proposition very much and I also agree that

Re: Hosting data stores for IO Transform testing

2016-12-01 Thread Etienne Chauchot

Stephen,

As discussed, I added injection script, docker containers scripts and 
integration tests to the sdks/java/io/elasticsearch/contrib 
 
directory in that PR: https://github.com/apache/incubator-beam/pull/1439.


These work well but they are first shot. Do you have any comments about 
those?


Besides I am not very sure that these files should be in the IO itself 
(even in contrib directory, out of maven source directories). Any thoughts?


Thanks,

Etienne



Le 23/11/2016 à 19:03, Stephen Sisk a écrit :

It's great to hear more experiences.

I'm also glad to hear that people see real value in the high
volume/performance benchmark tests. I tried to capture that in the Testing
doc I shared, under "Reasons for Beam Test Strategy". [1]

It does generally sound like we're in agreement here. Areas of discussion I
see:
1.  People like the idea of bringing up fresh instances for each test
rather than keeping instances running all the time, since that ensures no
contamination between tests. That seems reasonable to me. If we see
flakiness in the tests or we note that setting up/tearing down instances is
taking a lot of time,
2. Deciding on cluster management software/orchestration software - I want
to make sure we land on the right tool here since choosing the wrong tool
could result in administration of the instances taking more work. I suspect
that's a good place for a follow up discussion, so I'll start a separate
thread on that. I'm happy with whatever tool we choose, but I want to make
sure we take a moment to consider different options and have a reason for
choosing one.

Etienne - thanks for being willing to port your creation/other scripts
over. You might be a good early tester of whether this system works well
for everyone.

Stephen

[1]  Reasons for Beam Test Strategy -
https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-NprQ7vbf1jNVRgdqeEE8I/edit?ts=58349aec#



On Wed, Nov 23, 2016 at 12:48 AM Jean-Baptiste Onofré 
wrote:


I second Etienne there.

We worked together on the ElasticsearchIO and definitely, the high
valuable test we did were integration tests with ES on docker and high
volume.

I think we have to distinguish the two kinds of tests:
1. utests are located in the IO itself and basically they should cover
the core behaviors of the IO
2. itests are located as contrib in the IO (they could be part of the IO
but executed by the integration-test plugin or a specific profile) that
deals with "real" backend and high volumes. The resources required by
the itest can be bootstrapped by Jenkins (for instance using
Mesos/Marathon and docker images as already discussed, and it's what I'm
doing on my own "server").

It's basically what Stephen described.

We have to not relay only on itest: utests are very important and they
validate the core behavior.

My $0.01 ;)

Regards
JB

On 11/23/2016 09:27 AM, Etienne Chauchot wrote:

Hi Stephen,

I like your proposition very much and I also agree that docker + some
orchestration software would be great !

On the elasticsearchIO (PR to be created this week) there is docker
container creation scripts and logstash data ingestion script for IT
environment available in contrib directory alongside with integration
tests themselves. I'll be happy to make them compliant to new IT
environment.

What you say bellow about the need for external IT environment is
particularly true. As an example with ES what came out in first
implementation was that there were problems starting at some high volume
of data (timeouts, ES windowing overflow...) that could not have be seen
on embedded ES version. Also there where some particularities to
external instance like secondary (replica) shards that where not visible
on embedded instance.

Besides, I also favor bringing up instances before test because it
allows (amongst other things) to be sure to start on a fresh dataset for
the test to be deterministic.

Etienne


Le 23/11/2016 à 02:00, Stephen Sisk a écrit :

Hi,

I'm excited we're getting lots of discussion going. There are many
threads
of conversation here, we may choose to split some of them off into a
different email thread. I'm also betting I missed some of the
questions in
this thread, so apologies ahead of time for that. Also apologies for the
amount of text, I provided some quick summaries at the top of each
section.

Amit - thanks for your thoughts. I've responded in detail below.
Ismael - thanks for offering to help. There's plenty of work here to go
around. I'll try and think about how we can divide up some next steps
(probably in a separate thread.) The main next step I see is deciding
between kubernetes/mesos+marathon/docker swarm - I'm working on that,

but

having lots of different thoughts on what the advantages/disadvantages

of

those are would be helpful (I'm not entirely sure of the protocol for
collaborating on su

Re: Hosting data stores for IO Transform testing

2016-11-27 Thread Jason Kuster
Hey all, figured I'd chime in here since I've been doing some work on
evaluating options for performance testing over the last few weeks.

The tl;dr version is that there are a number of tools out there, but that
the best one I was able to find was a tool called PerfKit Benchmarker
(PKB)[1]. I'll have a doc out to the mailing list tomorrow with the full
results of my investigation to bring everyone up to speed with what I've
found, but there are a few clear wins with PKB. As it turns out, they
already had the ability to benchmark Spark (I have a PR out to extend the
Spark functionality[2] and a couple more improvements in the works), and
I've put together some additional work in a branch on my repository[3] to
enable proof-of-concept Dataflow Java benchmarks. I'm pretty excited about
it overall. Specifically regarding to data stores, there's two things that
make PKB attractive: they already support both container orchestration
systems we're considering (Mesos / Marathon & Kubernetes), and they have
built-in benchmark phases for spin-up and spin-down of resources that are
isolated from the actual benchmark itself, so you get unpolluted data out
of the box without having to do any log parsing or building multiple phases
yourself. This means that once we have docker images set up it'd be pretty
much turn-key to spin up / down instances of data stores for our tests.

Let me know what you think. As mentioned, I'll also have an email out
tomorrow with a doc evaluating the performance testing options I found;
please chime in there too if you're interested!

[1] https://github.com/GoogleCloudPlatform/PerfKitBenchmarker
[2] https://github.com/GoogleCloudPlatform/PerfKitBenchmarker/pull/1214
[3] https://github.com/jasonkuster/PerfKitBenchmarker/tree/beam


On Wed, Nov 23, 2016 at 3:33 PM, Stephen Sisk 
wrote:

> Since this thread has been active recently and I feel like we're
> mid-discussion, I just wanted to let folks know that I won't be checking
> mail thursday/friday (US thanksgiving holiday) - I'll be back next monday.
>
> Thanks!
> Stephen
>
> On Wed, Nov 23, 2016 at 10:03 AM Stephen Sisk  wrote:
>
> > It's great to hear more experiences.
> >
> > I'm also glad to hear that people see real value in the high
> > volume/performance benchmark tests. I tried to capture that in the
> Testing
> > doc I shared, under "Reasons for Beam Test Strategy". [1]
> >
> > It does generally sound like we're in agreement here. Areas of discussion
> > I see:
> > 1.  People like the idea of bringing up fresh instances for each test
> > rather than keeping instances running all the time, since that ensures no
> > contamination between tests. That seems reasonable to me. If we see
> > flakiness in the tests or we note that setting up/tearing down instances
> is
> > taking a lot of time,
> > 2. Deciding on cluster management software/orchestration software - I
> want
> > to make sure we land on the right tool here since choosing the wrong tool
> > could result in administration of the instances taking more work. I
> suspect
> > that's a good place for a follow up discussion, so I'll start a separate
> > thread on that. I'm happy with whatever tool we choose, but I want to
> make
> > sure we take a moment to consider different options and have a reason for
> > choosing one.
> >
> > Etienne - thanks for being willing to port your creation/other scripts
> > over. You might be a good early tester of whether this system works well
> > for everyone.
> >
> > Stephen
> >
> > [1]  Reasons for Beam Test Strategy -
> > https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-
> NprQ7vbf1jNVRgdqeEE8I/edit?ts=58349aec#
> >
> >
> >
> > On Wed, Nov 23, 2016 at 12:48 AM Jean-Baptiste Onofré 
> > wrote:
> >
> > I second Etienne there.
> >
> > We worked together on the ElasticsearchIO and definitely, the high
> > valuable test we did were integration tests with ES on docker and high
> > volume.
> >
> > I think we have to distinguish the two kinds of tests:
> > 1. utests are located in the IO itself and basically they should cover
> > the core behaviors of the IO
> > 2. itests are located as contrib in the IO (they could be part of the IO
> > but executed by the integration-test plugin or a specific profile) that
> > deals with "real" backend and high volumes. The resources required by
> > the itest can be bootstrapped by Jenkins (for instance using
> > Mesos/Marathon and docker images as already discussed, and it's what I'm
> > doing on my own "server").
> >
> > It's basically what Stephen described.
> >
> > We have to not relay only on itest: utests are very important and they
> > validate the core behavior.
> >
> > My $0.01 ;)
> >
> > Regards
> > JB
> >
> > On 11/23/2016 09:27 AM, Etienne Chauchot wrote:
> > > Hi Stephen,
> > >
> > > I like your proposition very much and I also agree that docker + some
> > > orchestration software would be great !
> > >
> > > On the elasticsearchIO (PR to be created this week) there is docker
> > > container creati

Re: Hosting data stores for IO Transform testing

2016-11-23 Thread Stephen Sisk
Since this thread has been active recently and I feel like we're
mid-discussion, I just wanted to let folks know that I won't be checking
mail thursday/friday (US thanksgiving holiday) - I'll be back next monday.

Thanks!
Stephen

On Wed, Nov 23, 2016 at 10:03 AM Stephen Sisk  wrote:

> It's great to hear more experiences.
>
> I'm also glad to hear that people see real value in the high
> volume/performance benchmark tests. I tried to capture that in the Testing
> doc I shared, under "Reasons for Beam Test Strategy". [1]
>
> It does generally sound like we're in agreement here. Areas of discussion
> I see:
> 1.  People like the idea of bringing up fresh instances for each test
> rather than keeping instances running all the time, since that ensures no
> contamination between tests. That seems reasonable to me. If we see
> flakiness in the tests or we note that setting up/tearing down instances is
> taking a lot of time,
> 2. Deciding on cluster management software/orchestration software - I want
> to make sure we land on the right tool here since choosing the wrong tool
> could result in administration of the instances taking more work. I suspect
> that's a good place for a follow up discussion, so I'll start a separate
> thread on that. I'm happy with whatever tool we choose, but I want to make
> sure we take a moment to consider different options and have a reason for
> choosing one.
>
> Etienne - thanks for being willing to port your creation/other scripts
> over. You might be a good early tester of whether this system works well
> for everyone.
>
> Stephen
>
> [1]  Reasons for Beam Test Strategy -
> https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-NprQ7vbf1jNVRgdqeEE8I/edit?ts=58349aec#
>
>
>
> On Wed, Nov 23, 2016 at 12:48 AM Jean-Baptiste Onofré 
> wrote:
>
> I second Etienne there.
>
> We worked together on the ElasticsearchIO and definitely, the high
> valuable test we did were integration tests with ES on docker and high
> volume.
>
> I think we have to distinguish the two kinds of tests:
> 1. utests are located in the IO itself and basically they should cover
> the core behaviors of the IO
> 2. itests are located as contrib in the IO (they could be part of the IO
> but executed by the integration-test plugin or a specific profile) that
> deals with "real" backend and high volumes. The resources required by
> the itest can be bootstrapped by Jenkins (for instance using
> Mesos/Marathon and docker images as already discussed, and it's what I'm
> doing on my own "server").
>
> It's basically what Stephen described.
>
> We have to not relay only on itest: utests are very important and they
> validate the core behavior.
>
> My $0.01 ;)
>
> Regards
> JB
>
> On 11/23/2016 09:27 AM, Etienne Chauchot wrote:
> > Hi Stephen,
> >
> > I like your proposition very much and I also agree that docker + some
> > orchestration software would be great !
> >
> > On the elasticsearchIO (PR to be created this week) there is docker
> > container creation scripts and logstash data ingestion script for IT
> > environment available in contrib directory alongside with integration
> > tests themselves. I'll be happy to make them compliant to new IT
> > environment.
> >
> > What you say bellow about the need for external IT environment is
> > particularly true. As an example with ES what came out in first
> > implementation was that there were problems starting at some high volume
> > of data (timeouts, ES windowing overflow...) that could not have be seen
> > on embedded ES version. Also there where some particularities to
> > external instance like secondary (replica) shards that where not visible
> > on embedded instance.
> >
> > Besides, I also favor bringing up instances before test because it
> > allows (amongst other things) to be sure to start on a fresh dataset for
> > the test to be deterministic.
> >
> > Etienne
> >
> >
> > Le 23/11/2016 à 02:00, Stephen Sisk a écrit :
> >> Hi,
> >>
> >> I'm excited we're getting lots of discussion going. There are many
> >> threads
> >> of conversation here, we may choose to split some of them off into a
> >> different email thread. I'm also betting I missed some of the
> >> questions in
> >> this thread, so apologies ahead of time for that. Also apologies for the
> >> amount of text, I provided some quick summaries at the top of each
> >> section.
> >>
> >> Amit - thanks for your thoughts. I've responded in detail below.
> >> Ismael - thanks for offering to help. There's plenty of work here to go
> >> around. I'll try and think about how we can divide up some next steps
> >> (probably in a separate thread.) The main next step I see is deciding
> >> between kubernetes/mesos+marathon/docker swarm - I'm working on that,
> but
> >> having lots of different thoughts on what the advantages/disadvantages
> of
> >> those are would be helpful (I'm not entirely sure of the protocol for
> >> collaborating on sub-projects like this.)
> >>
> >> These issues are all related to what ki

Re: Hosting data stores for IO Transform testing

2016-11-23 Thread Stephen Sisk
It's great to hear more experiences.

I'm also glad to hear that people see real value in the high
volume/performance benchmark tests. I tried to capture that in the Testing
doc I shared, under "Reasons for Beam Test Strategy". [1]

It does generally sound like we're in agreement here. Areas of discussion I
see:
1.  People like the idea of bringing up fresh instances for each test
rather than keeping instances running all the time, since that ensures no
contamination between tests. That seems reasonable to me. If we see
flakiness in the tests or we note that setting up/tearing down instances is
taking a lot of time,
2. Deciding on cluster management software/orchestration software - I want
to make sure we land on the right tool here since choosing the wrong tool
could result in administration of the instances taking more work. I suspect
that's a good place for a follow up discussion, so I'll start a separate
thread on that. I'm happy with whatever tool we choose, but I want to make
sure we take a moment to consider different options and have a reason for
choosing one.

Etienne - thanks for being willing to port your creation/other scripts
over. You might be a good early tester of whether this system works well
for everyone.

Stephen

[1]  Reasons for Beam Test Strategy -
https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-NprQ7vbf1jNVRgdqeEE8I/edit?ts=58349aec#



On Wed, Nov 23, 2016 at 12:48 AM Jean-Baptiste Onofré 
wrote:

> I second Etienne there.
>
> We worked together on the ElasticsearchIO and definitely, the high
> valuable test we did were integration tests with ES on docker and high
> volume.
>
> I think we have to distinguish the two kinds of tests:
> 1. utests are located in the IO itself and basically they should cover
> the core behaviors of the IO
> 2. itests are located as contrib in the IO (they could be part of the IO
> but executed by the integration-test plugin or a specific profile) that
> deals with "real" backend and high volumes. The resources required by
> the itest can be bootstrapped by Jenkins (for instance using
> Mesos/Marathon and docker images as already discussed, and it's what I'm
> doing on my own "server").
>
> It's basically what Stephen described.
>
> We have to not relay only on itest: utests are very important and they
> validate the core behavior.
>
> My $0.01 ;)
>
> Regards
> JB
>
> On 11/23/2016 09:27 AM, Etienne Chauchot wrote:
> > Hi Stephen,
> >
> > I like your proposition very much and I also agree that docker + some
> > orchestration software would be great !
> >
> > On the elasticsearchIO (PR to be created this week) there is docker
> > container creation scripts and logstash data ingestion script for IT
> > environment available in contrib directory alongside with integration
> > tests themselves. I'll be happy to make them compliant to new IT
> > environment.
> >
> > What you say bellow about the need for external IT environment is
> > particularly true. As an example with ES what came out in first
> > implementation was that there were problems starting at some high volume
> > of data (timeouts, ES windowing overflow...) that could not have be seen
> > on embedded ES version. Also there where some particularities to
> > external instance like secondary (replica) shards that where not visible
> > on embedded instance.
> >
> > Besides, I also favor bringing up instances before test because it
> > allows (amongst other things) to be sure to start on a fresh dataset for
> > the test to be deterministic.
> >
> > Etienne
> >
> >
> > Le 23/11/2016 à 02:00, Stephen Sisk a écrit :
> >> Hi,
> >>
> >> I'm excited we're getting lots of discussion going. There are many
> >> threads
> >> of conversation here, we may choose to split some of them off into a
> >> different email thread. I'm also betting I missed some of the
> >> questions in
> >> this thread, so apologies ahead of time for that. Also apologies for the
> >> amount of text, I provided some quick summaries at the top of each
> >> section.
> >>
> >> Amit - thanks for your thoughts. I've responded in detail below.
> >> Ismael - thanks for offering to help. There's plenty of work here to go
> >> around. I'll try and think about how we can divide up some next steps
> >> (probably in a separate thread.) The main next step I see is deciding
> >> between kubernetes/mesos+marathon/docker swarm - I'm working on that,
> but
> >> having lots of different thoughts on what the advantages/disadvantages
> of
> >> those are would be helpful (I'm not entirely sure of the protocol for
> >> collaborating on sub-projects like this.)
> >>
> >> These issues are all related to what kind of tests we want to write. I
> >> think a kubernetes/mesos/swarm cluster could support all the use cases
> >> we've discussed here (and thus should not block moving forward with
> >> this),
> >> but understanding what we want to test will help us understand how the
> >> cluster will be used. I'm working on a proposed user guide for testing
> 

Re: Hosting data stores for IO Transform testing

2016-11-23 Thread Jean-Baptiste Onofré

I second Etienne there.

We worked together on the ElasticsearchIO and definitely, the high 
valuable test we did were integration tests with ES on docker and high 
volume.


I think we have to distinguish the two kinds of tests:
1. utests are located in the IO itself and basically they should cover 
the core behaviors of the IO
2. itests are located as contrib in the IO (they could be part of the IO 
but executed by the integration-test plugin or a specific profile) that 
deals with "real" backend and high volumes. The resources required by 
the itest can be bootstrapped by Jenkins (for instance using 
Mesos/Marathon and docker images as already discussed, and it's what I'm 
doing on my own "server").


It's basically what Stephen described.

We have to not relay only on itest: utests are very important and they 
validate the core behavior.


My $0.01 ;)

Regards
JB

On 11/23/2016 09:27 AM, Etienne Chauchot wrote:

Hi Stephen,

I like your proposition very much and I also agree that docker + some
orchestration software would be great !

On the elasticsearchIO (PR to be created this week) there is docker
container creation scripts and logstash data ingestion script for IT
environment available in contrib directory alongside with integration
tests themselves. I'll be happy to make them compliant to new IT
environment.

What you say bellow about the need for external IT environment is
particularly true. As an example with ES what came out in first
implementation was that there were problems starting at some high volume
of data (timeouts, ES windowing overflow...) that could not have be seen
on embedded ES version. Also there where some particularities to
external instance like secondary (replica) shards that where not visible
on embedded instance.

Besides, I also favor bringing up instances before test because it
allows (amongst other things) to be sure to start on a fresh dataset for
the test to be deterministic.

Etienne


Le 23/11/2016 à 02:00, Stephen Sisk a écrit :

Hi,

I'm excited we're getting lots of discussion going. There are many
threads
of conversation here, we may choose to split some of them off into a
different email thread. I'm also betting I missed some of the
questions in
this thread, so apologies ahead of time for that. Also apologies for the
amount of text, I provided some quick summaries at the top of each
section.

Amit - thanks for your thoughts. I've responded in detail below.
Ismael - thanks for offering to help. There's plenty of work here to go
around. I'll try and think about how we can divide up some next steps
(probably in a separate thread.) The main next step I see is deciding
between kubernetes/mesos+marathon/docker swarm - I'm working on that, but
having lots of different thoughts on what the advantages/disadvantages of
those are would be helpful (I'm not entirely sure of the protocol for
collaborating on sub-projects like this.)

These issues are all related to what kind of tests we want to write. I
think a kubernetes/mesos/swarm cluster could support all the use cases
we've discussed here (and thus should not block moving forward with
this),
but understanding what we want to test will help us understand how the
cluster will be used. I'm working on a proposed user guide for testing IO
Transforms, and I'm going to send out a link to that + a short summary to
the list shortly so folks can get a better sense of where I'm coming
from.



Here's my thinking on the questions we've raised here -

Embedded versions of data stores for testing

Summary: yes! But we still need real data stores to test against.

I am a gigantic fan of using embedded versions of the various data
stores.
I think we should test everything we possibly can using them, and do the
majority of our correctness testing using embedded versions + the direct
runner. However, it's also important to have at least one test that
actually connects to an actual instance, so we can get coverage for
things
like credentials, real connection strings, etc...

The key point is that embedded versions definitely can't cover the
performance tests, so we need to host instances if we want to test that.

I consider the integration tests/performance benchmarks to be costly
things
that we do only for the IO transforms with large amounts of community
support/usage. A random IO transform used by a few users doesn't
necessarily need integration & perf tests, but for heavily used IO
transforms, there's a lot of community value in these tests. The
maintenance proposal below scales with the amount of community support
for
a particular IO transform.



Reusing data stores ("use the data stores across executions.")
--
Summary: I favor a hybrid approach: some frequently used, very small
instances that we keep up all the time + larger multi-container data
store
instances that we spin up for perf tests.

I don't think we need to have a strong answer to this question, but I
think
we do need to know what rang

Re: Hosting data stores for IO Transform testing

2016-11-23 Thread Etienne Chauchot

Hi Stephen,

I like your proposition very much and I also agree that docker + some 
orchestration software would be great !


On the elasticsearchIO (PR to be created this week) there is docker 
container creation scripts and logstash data ingestion script for IT 
environment available in contrib directory alongside with integration 
tests themselves. I'll be happy to make them compliant to new IT 
environment.


What you say bellow about the need for external IT environment is 
particularly true. As an example with ES what came out in first 
implementation was that there were problems starting at some high volume 
of data (timeouts, ES windowing overflow...) that could not have be seen 
on embedded ES version. Also there where some particularities to 
external instance like secondary (replica) shards that where not visible 
on embedded instance.


Besides, I also favor bringing up instances before test because it 
allows (amongst other things) to be sure to start on a fresh dataset for 
the test to be deterministic.


Etienne


Le 23/11/2016 à 02:00, Stephen Sisk a écrit :

Hi,

I'm excited we're getting lots of discussion going. There are many threads
of conversation here, we may choose to split some of them off into a
different email thread. I'm also betting I missed some of the questions in
this thread, so apologies ahead of time for that. Also apologies for the
amount of text, I provided some quick summaries at the top of each section.

Amit - thanks for your thoughts. I've responded in detail below.
Ismael - thanks for offering to help. There's plenty of work here to go
around. I'll try and think about how we can divide up some next steps
(probably in a separate thread.) The main next step I see is deciding
between kubernetes/mesos+marathon/docker swarm - I'm working on that, but
having lots of different thoughts on what the advantages/disadvantages of
those are would be helpful (I'm not entirely sure of the protocol for
collaborating on sub-projects like this.)

These issues are all related to what kind of tests we want to write. I
think a kubernetes/mesos/swarm cluster could support all the use cases
we've discussed here (and thus should not block moving forward with this),
but understanding what we want to test will help us understand how the
cluster will be used. I'm working on a proposed user guide for testing IO
Transforms, and I'm going to send out a link to that + a short summary to
the list shortly so folks can get a better sense of where I'm coming from.



Here's my thinking on the questions we've raised here -

Embedded versions of data stores for testing

Summary: yes! But we still need real data stores to test against.

I am a gigantic fan of using embedded versions of the various data stores.
I think we should test everything we possibly can using them, and do the
majority of our correctness testing using embedded versions + the direct
runner. However, it's also important to have at least one test that
actually connects to an actual instance, so we can get coverage for things
like credentials, real connection strings, etc...

The key point is that embedded versions definitely can't cover the
performance tests, so we need to host instances if we want to test that.

I consider the integration tests/performance benchmarks to be costly things
that we do only for the IO transforms with large amounts of community
support/usage. A random IO transform used by a few users doesn't
necessarily need integration & perf tests, but for heavily used IO
transforms, there's a lot of community value in these tests. The
maintenance proposal below scales with the amount of community support for
a particular IO transform.



Reusing data stores ("use the data stores across executions.")
--
Summary: I favor a hybrid approach: some frequently used, very small
instances that we keep up all the time + larger multi-container data store
instances that we spin up for perf tests.

I don't think we need to have a strong answer to this question, but I think
we do need to know what range of capabilities we need, and use that to
inform our requirements on the hosting infrastructure. I think
kubernetes/mesos + docker can support all the scenarios I discuss below.

I had been thinking of a hybrid approach - reuse some instances and don't
reuse others. Some tests require isolation from other tests (eg.
performance benchmarking), while others can easily re-use the same
database/data store instance over time, provided they are written in the
correct manner (eg. a simple read or write correctness integration tests)

To me, the question of whether to use one instance over time for a test vs
spin up an instance for each test comes down to a trade off between these
factors:
1. Flakiness of spin-up of an instance - if it's super flaky, we'll want to
keep more instances up and running rather than bring them up/down. (this
may also vary by the data store in question)
2. Frequency of testing - i

Re: Hosting data stores for IO Transform testing

2016-11-22 Thread Jean-Baptiste Onofré

Hi Ismaël,

FYI, we also test the IOs on spark and flink small clusters (not yet 
apex): it's where I'm using Mesos/Marathon.


It's not a large cluster, but the integration tests are performed (by 
hand) on clusters.


We already discussed with Stephan and Jason to use Marathon JSON and 
Mesos docker images bootstrapped by Jenkins for the itests.


Regards
JB

On 11/22/2016 04:58 PM, Ismaël Mejía wrote:

​Hello,

@Stephen Thanks for your proposal, it is really interesting, I would really
like to help with this. I have never played with Kubernetes but this seems
a really nice chance to do something useful with it.

We (at Talend) are testing most of the IOs using simple container images
and in some particular cases ‘clusters’ of containers using docker-compose
(a little bit like Amit’s (2) proposal). It would be really nice to have
this at the Beam level, in particular to try to test more complex
semantics, I don’t know how programmable kubernetes is to achieve this for
example:

Let’s think we have a cluster of Cassandra or Kafka nodes, I would like to
have programmatic tests to simulate failure (e.g. kill a node), or simulate
a really slow node, to ensure that the IO behaves as expected in the Beam
pipeline for the given runner.

Another related idea is to improve IO consistency: Today the different IOs
have small differences in their failure behavior, I really would like to be
able to predict with more precision what will happen in case of errors,
e.g. what is the correct behavior if I am writing to a Kafka node and there
is a network partition, does the Kafka sink retries or no ? and what if it
is the JdbcIO ?, will it work the same e.g. assuming checkpointing? Or do
we guarantee exactly once writes somehow?, today I am not sure about what
happens (or if the expected behavior depends on the runner), but well maybe
it is just that I don’t know and we have tests to ensure this.

Of course both are really hard problems, but I think with your proposal we
can try to tackle them, as well as the performance ones. And apart of the
data stores, I think it will be also really nice to be able to test the
runners in a distributed manner.

So what is the next step? How do you imagine such integration tests? ? Who
can provide the test machines so we can mount the cluster?

Maybe my ideas are a bit too far away for an initial setup, but it will be
really nice to start working on this.

Ismael​


On Tue, Nov 22, 2016 at 11:00 AM, Amit Sela  wrote:


Hi Stephen,

I was wondering about how we plan to use the data stores across executions.

Clearly, it's best to setup a new instance (container) for every test,
running a "standalone" store (say HBase/Cassandra for example), and once
the test is done, teardown the instance. It should also be agnostic to the
runtime environment (e.g., Docker on Kubernetes).
I'm wondering though what's the overhead of managing such a deployment
which could become heavy and complicated as more IOs are supported and more
test cases introduced.

Another way to go would be to have small clusters of different data stores
and run against new "namespaces" (while lazily evicting old ones), but I
think this is less likely as maintaining a distributed instance (even a
small one) for each data store sounds even more complex.

A third approach would be to to simply have an "embedded" in-memory
instance of a data store as part of a test that runs against it (such as an
embedded Kafka, though not a data store).
This is probably the simplest solution in terms of orchestration, but it
depends on having a proper "embedded" implementation for an IO.

Does this make sense to you ? have you considered it ?

Thanks,
Amit

On Tue, Nov 22, 2016 at 8:20 AM Jean-Baptiste Onofré 
wrote:


Hi Stephen,

as already discussed a bit together, it sounds great ! Especially I like
it as a both integration test platform and good coverage for IOs.

I'm very late on this but, as said, I will share with you my Marathon
JSON and Mesos docker images.

By the way, I started to experiment a bit kubernetes and swamp but it's
not yet complete. I will share what I have on the same github repo.

Thanks !
Regards
JB

On 11/16/2016 11:36 PM, Stephen Sisk wrote:

Hi everyone!

Currently we have a good set of unit tests for our IO Transforms -

those

tend to run against in-memory versions of the data stores. However,

we'd

like to further increase our test coverage to include running them

against

real instances of the data stores that the IO Transforms work against

(e.g.

cassandra, mongodb, kafka, etc…), which means we'll need to have real
instances of various data stores.

Additionally, if we want to do performance regression detection, it's
important to have instances of the services that behave realistically,
which isn't true of in-memory or dev versions of the services.


Proposed solution
-
If we accept this proposal, we would create an infrastructure for

running

real instances of data stores inside of 

Re: Hosting data stores for IO Transform testing

2016-11-22 Thread Stephen Sisk
Here's a link to the doc I mentioned that discusses implementing tests for
Beam IO transforms -
https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-NprQ7vbf1jNVRgdqeEE8I/edit?usp=sharing
This
doc is definitely not finished (the benchmark section is notably missing,
and other sections are definitely still draft :), but I thought it would be
useful to talk about the high-level goals there.

High level points from the doc
===
I think the most important part of this doc for now is the goals of testing
IO transforms that I propose, and how I propose that we cover them:
1. IO Transform is correct - corner cases incl. DWR, thread-safety (cover
with unit tests)
2. IO Transform works w/ real instance of data store (cover with
integration tests)
3.  Runners correctly run IO Transforms (integration tests)
4. IO Transform/Runner pairs scale - to "medium data" [1] (perf test)
5. IO Transform/Runner pairs - basic correctness at scale (basic output
validation on perf test)
[1] medium data = 5 data store instances and/or NNN GB data? TBD

I go into more detail about the goals in a table in the section "Test
Strategy for Beam IO"

I also discuss setting up mocks/fakes in unit tests, as well as a strategy
for testing network failures/retries (as discussed in my last email.)

Stephen


On Tue, Nov 22, 2016 at 5:00 PM Stephen Sisk  wrote:

Hi,

I'm excited we're getting lots of discussion going. There are many threads
of conversation here, we may choose to split some of them off into a
different email thread. I'm also betting I missed some of the questions in
this thread, so apologies ahead of time for that. Also apologies for the
amount of text, I provided some quick summaries at the top of each section.

Amit - thanks for your thoughts. I've responded in detail below.
Ismael - thanks for offering to help. There's plenty of work here to go
around. I'll try and think about how we can divide up some next steps
(probably in a separate thread.) The main next step I see is deciding
between kubernetes/mesos+marathon/docker swarm - I'm working on that, but
having lots of different thoughts on what the advantages/disadvantages of
those are would be helpful (I'm not entirely sure of the protocol for
collaborating on sub-projects like this.)

These issues are all related to what kind of tests we want to write. I
think a kubernetes/mesos/swarm cluster could support all the use cases
we've discussed here (and thus should not block moving forward with this),
but understanding what we want to test will help us understand how the
cluster will be used. I'm working on a proposed user guide for testing IO
Transforms, and I'm going to send out a link to that + a short summary to
the list shortly so folks can get a better sense of where I'm coming from.



Here's my thinking on the questions we've raised here -

Embedded versions of data stores for testing

Summary: yes! But we still need real data stores to test against.

I am a gigantic fan of using embedded versions of the various data stores.
I think we should test everything we possibly can using them, and do the
majority of our correctness testing using embedded versions + the direct
runner. However, it's also important to have at least one test that
actually connects to an actual instance, so we can get coverage for things
like credentials, real connection strings, etc...

The key point is that embedded versions definitely can't cover the
performance tests, so we need to host instances if we want to test that.

I consider the integration tests/performance benchmarks to be costly things
that we do only for the IO transforms with large amounts of community
support/usage. A random IO transform used by a few users doesn't
necessarily need integration & perf tests, but for heavily used IO
transforms, there's a lot of community value in these tests. The
maintenance proposal below scales with the amount of community support for
a particular IO transform.



Reusing data stores ("use the data stores across executions.")
--
Summary: I favor a hybrid approach: some frequently used, very small
instances that we keep up all the time + larger multi-container data store
instances that we spin up for perf tests.

I don't think we need to have a strong answer to this question, but I think
we do need to know what range of capabilities we need, and use that to
inform our requirements on the hosting infrastructure. I think
kubernetes/mesos + docker can support all the scenarios I discuss below.

I had been thinking of a hybrid approach - reuse some instances and don't
reuse others. Some tests require isolation from other tests (eg.
performance benchmarking), while others can easily re-use the same
database/data store instance over time, provided they are written in the
correct manner (eg. a simple read or write correctness integration tests)

To me, the question of whether to use one instance over time for a test vs
spin up an instance for each 

Re: Hosting data stores for IO Transform testing

2016-11-22 Thread Stephen Sisk
Hi,

I'm excited we're getting lots of discussion going. There are many threads
of conversation here, we may choose to split some of them off into a
different email thread. I'm also betting I missed some of the questions in
this thread, so apologies ahead of time for that. Also apologies for the
amount of text, I provided some quick summaries at the top of each section.

Amit - thanks for your thoughts. I've responded in detail below.
Ismael - thanks for offering to help. There's plenty of work here to go
around. I'll try and think about how we can divide up some next steps
(probably in a separate thread.) The main next step I see is deciding
between kubernetes/mesos+marathon/docker swarm - I'm working on that, but
having lots of different thoughts on what the advantages/disadvantages of
those are would be helpful (I'm not entirely sure of the protocol for
collaborating on sub-projects like this.)

These issues are all related to what kind of tests we want to write. I
think a kubernetes/mesos/swarm cluster could support all the use cases
we've discussed here (and thus should not block moving forward with this),
but understanding what we want to test will help us understand how the
cluster will be used. I'm working on a proposed user guide for testing IO
Transforms, and I'm going to send out a link to that + a short summary to
the list shortly so folks can get a better sense of where I'm coming from.



Here's my thinking on the questions we've raised here -

Embedded versions of data stores for testing

Summary: yes! But we still need real data stores to test against.

I am a gigantic fan of using embedded versions of the various data stores.
I think we should test everything we possibly can using them, and do the
majority of our correctness testing using embedded versions + the direct
runner. However, it's also important to have at least one test that
actually connects to an actual instance, so we can get coverage for things
like credentials, real connection strings, etc...

The key point is that embedded versions definitely can't cover the
performance tests, so we need to host instances if we want to test that.

I consider the integration tests/performance benchmarks to be costly things
that we do only for the IO transforms with large amounts of community
support/usage. A random IO transform used by a few users doesn't
necessarily need integration & perf tests, but for heavily used IO
transforms, there's a lot of community value in these tests. The
maintenance proposal below scales with the amount of community support for
a particular IO transform.



Reusing data stores ("use the data stores across executions.")
--
Summary: I favor a hybrid approach: some frequently used, very small
instances that we keep up all the time + larger multi-container data store
instances that we spin up for perf tests.

I don't think we need to have a strong answer to this question, but I think
we do need to know what range of capabilities we need, and use that to
inform our requirements on the hosting infrastructure. I think
kubernetes/mesos + docker can support all the scenarios I discuss below.

I had been thinking of a hybrid approach - reuse some instances and don't
reuse others. Some tests require isolation from other tests (eg.
performance benchmarking), while others can easily re-use the same
database/data store instance over time, provided they are written in the
correct manner (eg. a simple read or write correctness integration tests)

To me, the question of whether to use one instance over time for a test vs
spin up an instance for each test comes down to a trade off between these
factors:
1. Flakiness of spin-up of an instance - if it's super flaky, we'll want to
keep more instances up and running rather than bring them up/down. (this
may also vary by the data store in question)
2. Frequency of testing - if we are running tests every 5 minutes, it may
be wasteful to bring machines up/down every time. If we run tests once a
day or week, it seems wasteful to keep the machines up the whole time.
3. Isolation requirements - If tests must be isolated, it means we either
have to bring up the instances for each test, or we have to have some sort
of signaling mechanism to indicate that a given instance is in use. I
strongly favor bringing up an instance per test.
4. Number/size of containers - if we need a large number of machines for a
particular test, keeping them running all the time will use more resources.


The major unknown to me is how flaky it'll be to spin these up. I'm
hopeful/assuming they'll be pretty stable to bring up, but I think the best
way to test that is to start doing it.

I suspect the sweet spot is the following: have a set of very small data
store instances that stay up to support small-data-size post-commit end to
end tests (post-commits run frequently and the data size means the
instances would not use many resources), combined with the ability to spin
up larg

Re: Hosting data stores for IO Transform testing

2016-11-22 Thread Sourabh Bajaj
Makes sense, thanks for answering.

On Tue, Nov 22, 2016 at 11:24 AM Jean-Baptiste Onofré 
wrote:

> Hi Sourabh,
>
> We raised the IO versioning point couple of months ago on the mailing list.
>
> Basically, we have two options:
>
> 1. Same modules (for example sdks/java/io/kafka) with one branch per
> version (kafka-0.8 kafka-0.10)
> 2. Several modules: sdks/java/io/kafka-0.8 sdks/java/io/kafka-0.10
>
> My preferences is on 2:
> Pros:
> - the IO can still be part of the main Beam release
> - it's more visible for contribution
> Cons:
> - we might have code duplication
>
> Regards
> JB
>
> On 11/22/2016 08:12 PM, Sourabh Bajaj wrote:
> > Hi,
> >
> > One tangential question I had around the proposal was how do we currently
> > deal with versioning in IO sources/sinks.
> >
> > For example Cassandra 1.2 vs 2.1 have some differences between them, so
> the
> > checked in sources and sink probably supports a particular version right
> > now. If yes, follow questions would be around how do we handle updating ?
> > deprecating and documenting the supported versions.
> >
> > I can move this to a new thread if this seems like a different
> discussion.
> > Also if this has already been answered please feel free to direct me to a
> > doc or past thread.
> >
> > Thanks
> > Sourabh
> >
> > On Tue, Nov 22, 2016 at 7:59 AM Ismaël Mejía  wrote:
> >
> >> ​Hello,
> >>
> >> @Stephen Thanks for your proposal, it is really interesting, I would
> really
> >> like to help with this. I have never played with Kubernetes but this
> seems
> >> a really nice chance to do something useful with it.
> >>
> >> We (at Talend) are testing most of the IOs using simple container images
> >> and in some particular cases ‘clusters’ of containers using
> docker-compose
> >> (a little bit like Amit’s (2) proposal). It would be really nice to have
> >> this at the Beam level, in particular to try to test more complex
> >> semantics, I don’t know how programmable kubernetes is to achieve this
> for
> >> example:
> >>
> >> Let’s think we have a cluster of Cassandra or Kafka nodes, I would like
> to
> >> have programmatic tests to simulate failure (e.g. kill a node), or
> simulate
> >> a really slow node, to ensure that the IO behaves as expected in the
> Beam
> >> pipeline for the given runner.
> >>
> >> Another related idea is to improve IO consistency: Today the different
> IOs
> >> have small differences in their failure behavior, I really would like
> to be
> >> able to predict with more precision what will happen in case of errors,
> >> e.g. what is the correct behavior if I am writing to a Kafka node and
> there
> >> is a network partition, does the Kafka sink retries or no ? and what if
> it
> >> is the JdbcIO ?, will it work the same e.g. assuming checkpointing? Or
> do
> >> we guarantee exactly once writes somehow?, today I am not sure about
> what
> >> happens (or if the expected behavior depends on the runner), but well
> maybe
> >> it is just that I don’t know and we have tests to ensure this.
> >>
> >> Of course both are really hard problems, but I think with your proposal
> we
> >> can try to tackle them, as well as the performance ones. And apart of
> the
> >> data stores, I think it will be also really nice to be able to test the
> >> runners in a distributed manner.
> >>
> >> So what is the next step? How do you imagine such integration tests? ?
> Who
> >> can provide the test machines so we can mount the cluster?
> >>
> >> Maybe my ideas are a bit too far away for an initial setup, but it will
> be
> >> really nice to start working on this.
> >>
> >> Ismael​
> >>
> >>
> >> On Tue, Nov 22, 2016 at 11:00 AM, Amit Sela 
> wrote:
> >>
> >>> Hi Stephen,
> >>>
> >>> I was wondering about how we plan to use the data stores across
> >> executions.
> >>>
> >>> Clearly, it's best to setup a new instance (container) for every test,
> >>> running a "standalone" store (say HBase/Cassandra for example), and
> once
> >>> the test is done, teardown the instance. It should also be agnostic to
> >> the
> >>> runtime environment (e.g., Docker on Kubernetes).
> >>> I'm wondering though what's the overhead of managing such a deployment
> >>> which could become heavy and complicated as more IOs are supported and
> >> more
> >>> test cases introduced.
> >>>
> >>> Another way to go would be to have small clusters of different data
> >> stores
> >>> and run against new "namespaces" (while lazily evicting old ones), but
> I
> >>> think this is less likely as maintaining a distributed instance (even a
> >>> small one) for each data store sounds even more complex.
> >>>
> >>> A third approach would be to to simply have an "embedded" in-memory
> >>> instance of a data store as part of a test that runs against it (such
> as
> >> an
> >>> embedded Kafka, though not a data store).
> >>> This is probably the simplest solution in terms of orchestration, but
> it
> >>> depends on having a proper "embedded" implementation for an IO.
> >>>
> >>> Does this make sense t

Re: Hosting data stores for IO Transform testing

2016-11-22 Thread Jean-Baptiste Onofré

Hi Sourabh,

We raised the IO versioning point couple of months ago on the mailing list.

Basically, we have two options:

1. Same modules (for example sdks/java/io/kafka) with one branch per 
version (kafka-0.8 kafka-0.10)

2. Several modules: sdks/java/io/kafka-0.8 sdks/java/io/kafka-0.10

My preferences is on 2:
Pros:
- the IO can still be part of the main Beam release
- it's more visible for contribution
Cons:
- we might have code duplication

Regards
JB

On 11/22/2016 08:12 PM, Sourabh Bajaj wrote:

Hi,

One tangential question I had around the proposal was how do we currently
deal with versioning in IO sources/sinks.

For example Cassandra 1.2 vs 2.1 have some differences between them, so the
checked in sources and sink probably supports a particular version right
now. If yes, follow questions would be around how do we handle updating ?
deprecating and documenting the supported versions.

I can move this to a new thread if this seems like a different discussion.
Also if this has already been answered please feel free to direct me to a
doc or past thread.

Thanks
Sourabh

On Tue, Nov 22, 2016 at 7:59 AM Ismaël Mejía  wrote:


​Hello,

@Stephen Thanks for your proposal, it is really interesting, I would really
like to help with this. I have never played with Kubernetes but this seems
a really nice chance to do something useful with it.

We (at Talend) are testing most of the IOs using simple container images
and in some particular cases ‘clusters’ of containers using docker-compose
(a little bit like Amit’s (2) proposal). It would be really nice to have
this at the Beam level, in particular to try to test more complex
semantics, I don’t know how programmable kubernetes is to achieve this for
example:

Let’s think we have a cluster of Cassandra or Kafka nodes, I would like to
have programmatic tests to simulate failure (e.g. kill a node), or simulate
a really slow node, to ensure that the IO behaves as expected in the Beam
pipeline for the given runner.

Another related idea is to improve IO consistency: Today the different IOs
have small differences in their failure behavior, I really would like to be
able to predict with more precision what will happen in case of errors,
e.g. what is the correct behavior if I am writing to a Kafka node and there
is a network partition, does the Kafka sink retries or no ? and what if it
is the JdbcIO ?, will it work the same e.g. assuming checkpointing? Or do
we guarantee exactly once writes somehow?, today I am not sure about what
happens (or if the expected behavior depends on the runner), but well maybe
it is just that I don’t know and we have tests to ensure this.

Of course both are really hard problems, but I think with your proposal we
can try to tackle them, as well as the performance ones. And apart of the
data stores, I think it will be also really nice to be able to test the
runners in a distributed manner.

So what is the next step? How do you imagine such integration tests? ? Who
can provide the test machines so we can mount the cluster?

Maybe my ideas are a bit too far away for an initial setup, but it will be
really nice to start working on this.

Ismael​


On Tue, Nov 22, 2016 at 11:00 AM, Amit Sela  wrote:


Hi Stephen,

I was wondering about how we plan to use the data stores across

executions.


Clearly, it's best to setup a new instance (container) for every test,
running a "standalone" store (say HBase/Cassandra for example), and once
the test is done, teardown the instance. It should also be agnostic to

the

runtime environment (e.g., Docker on Kubernetes).
I'm wondering though what's the overhead of managing such a deployment
which could become heavy and complicated as more IOs are supported and

more

test cases introduced.

Another way to go would be to have small clusters of different data

stores

and run against new "namespaces" (while lazily evicting old ones), but I
think this is less likely as maintaining a distributed instance (even a
small one) for each data store sounds even more complex.

A third approach would be to to simply have an "embedded" in-memory
instance of a data store as part of a test that runs against it (such as

an

embedded Kafka, though not a data store).
This is probably the simplest solution in terms of orchestration, but it
depends on having a proper "embedded" implementation for an IO.

Does this make sense to you ? have you considered it ?

Thanks,
Amit

On Tue, Nov 22, 2016 at 8:20 AM Jean-Baptiste Onofré 
wrote:


Hi Stephen,

as already discussed a bit together, it sounds great ! Especially I

like

it as a both integration test platform and good coverage for IOs.

I'm very late on this but, as said, I will share with you my Marathon
JSON and Mesos docker images.

By the way, I started to experiment a bit kubernetes and swamp but it's
not yet complete. I will share what I have on the same github repo.

Thanks !
Regards
JB

On 11/16/2016 11:36 PM, Stephen Sisk wrote:

Hi everyone!

Currently we have a good

Re: Hosting data stores for IO Transform testing

2016-11-22 Thread Sourabh Bajaj
Hi,

One tangential question I had around the proposal was how do we currently
deal with versioning in IO sources/sinks.

For example Cassandra 1.2 vs 2.1 have some differences between them, so the
checked in sources and sink probably supports a particular version right
now. If yes, follow questions would be around how do we handle updating ?
deprecating and documenting the supported versions.

I can move this to a new thread if this seems like a different discussion.
Also if this has already been answered please feel free to direct me to a
doc or past thread.

Thanks
Sourabh

On Tue, Nov 22, 2016 at 7:59 AM Ismaël Mejía  wrote:

> ​Hello,
>
> @Stephen Thanks for your proposal, it is really interesting, I would really
> like to help with this. I have never played with Kubernetes but this seems
> a really nice chance to do something useful with it.
>
> We (at Talend) are testing most of the IOs using simple container images
> and in some particular cases ‘clusters’ of containers using docker-compose
> (a little bit like Amit’s (2) proposal). It would be really nice to have
> this at the Beam level, in particular to try to test more complex
> semantics, I don’t know how programmable kubernetes is to achieve this for
> example:
>
> Let’s think we have a cluster of Cassandra or Kafka nodes, I would like to
> have programmatic tests to simulate failure (e.g. kill a node), or simulate
> a really slow node, to ensure that the IO behaves as expected in the Beam
> pipeline for the given runner.
>
> Another related idea is to improve IO consistency: Today the different IOs
> have small differences in their failure behavior, I really would like to be
> able to predict with more precision what will happen in case of errors,
> e.g. what is the correct behavior if I am writing to a Kafka node and there
> is a network partition, does the Kafka sink retries or no ? and what if it
> is the JdbcIO ?, will it work the same e.g. assuming checkpointing? Or do
> we guarantee exactly once writes somehow?, today I am not sure about what
> happens (or if the expected behavior depends on the runner), but well maybe
> it is just that I don’t know and we have tests to ensure this.
>
> Of course both are really hard problems, but I think with your proposal we
> can try to tackle them, as well as the performance ones. And apart of the
> data stores, I think it will be also really nice to be able to test the
> runners in a distributed manner.
>
> So what is the next step? How do you imagine such integration tests? ? Who
> can provide the test machines so we can mount the cluster?
>
> Maybe my ideas are a bit too far away for an initial setup, but it will be
> really nice to start working on this.
>
> Ismael​
>
>
> On Tue, Nov 22, 2016 at 11:00 AM, Amit Sela  wrote:
>
> > Hi Stephen,
> >
> > I was wondering about how we plan to use the data stores across
> executions.
> >
> > Clearly, it's best to setup a new instance (container) for every test,
> > running a "standalone" store (say HBase/Cassandra for example), and once
> > the test is done, teardown the instance. It should also be agnostic to
> the
> > runtime environment (e.g., Docker on Kubernetes).
> > I'm wondering though what's the overhead of managing such a deployment
> > which could become heavy and complicated as more IOs are supported and
> more
> > test cases introduced.
> >
> > Another way to go would be to have small clusters of different data
> stores
> > and run against new "namespaces" (while lazily evicting old ones), but I
> > think this is less likely as maintaining a distributed instance (even a
> > small one) for each data store sounds even more complex.
> >
> > A third approach would be to to simply have an "embedded" in-memory
> > instance of a data store as part of a test that runs against it (such as
> an
> > embedded Kafka, though not a data store).
> > This is probably the simplest solution in terms of orchestration, but it
> > depends on having a proper "embedded" implementation for an IO.
> >
> > Does this make sense to you ? have you considered it ?
> >
> > Thanks,
> > Amit
> >
> > On Tue, Nov 22, 2016 at 8:20 AM Jean-Baptiste Onofré 
> > wrote:
> >
> > > Hi Stephen,
> > >
> > > as already discussed a bit together, it sounds great ! Especially I
> like
> > > it as a both integration test platform and good coverage for IOs.
> > >
> > > I'm very late on this but, as said, I will share with you my Marathon
> > > JSON and Mesos docker images.
> > >
> > > By the way, I started to experiment a bit kubernetes and swamp but it's
> > > not yet complete. I will share what I have on the same github repo.
> > >
> > > Thanks !
> > > Regards
> > > JB
> > >
> > > On 11/16/2016 11:36 PM, Stephen Sisk wrote:
> > > > Hi everyone!
> > > >
> > > > Currently we have a good set of unit tests for our IO Transforms -
> > those
> > > > tend to run against in-memory versions of the data stores. However,
> > we'd
> > > > like to further increase our test coverage to include running 

Re: Hosting data stores for IO Transform testing

2016-11-22 Thread Ismaël Mejía
​Hello,

@Stephen Thanks for your proposal, it is really interesting, I would really
like to help with this. I have never played with Kubernetes but this seems
a really nice chance to do something useful with it.

We (at Talend) are testing most of the IOs using simple container images
and in some particular cases ‘clusters’ of containers using docker-compose
(a little bit like Amit’s (2) proposal). It would be really nice to have
this at the Beam level, in particular to try to test more complex
semantics, I don’t know how programmable kubernetes is to achieve this for
example:

Let’s think we have a cluster of Cassandra or Kafka nodes, I would like to
have programmatic tests to simulate failure (e.g. kill a node), or simulate
a really slow node, to ensure that the IO behaves as expected in the Beam
pipeline for the given runner.

Another related idea is to improve IO consistency: Today the different IOs
have small differences in their failure behavior, I really would like to be
able to predict with more precision what will happen in case of errors,
e.g. what is the correct behavior if I am writing to a Kafka node and there
is a network partition, does the Kafka sink retries or no ? and what if it
is the JdbcIO ?, will it work the same e.g. assuming checkpointing? Or do
we guarantee exactly once writes somehow?, today I am not sure about what
happens (or if the expected behavior depends on the runner), but well maybe
it is just that I don’t know and we have tests to ensure this.

Of course both are really hard problems, but I think with your proposal we
can try to tackle them, as well as the performance ones. And apart of the
data stores, I think it will be also really nice to be able to test the
runners in a distributed manner.

So what is the next step? How do you imagine such integration tests? ? Who
can provide the test machines so we can mount the cluster?

Maybe my ideas are a bit too far away for an initial setup, but it will be
really nice to start working on this.

Ismael​


On Tue, Nov 22, 2016 at 11:00 AM, Amit Sela  wrote:

> Hi Stephen,
>
> I was wondering about how we plan to use the data stores across executions.
>
> Clearly, it's best to setup a new instance (container) for every test,
> running a "standalone" store (say HBase/Cassandra for example), and once
> the test is done, teardown the instance. It should also be agnostic to the
> runtime environment (e.g., Docker on Kubernetes).
> I'm wondering though what's the overhead of managing such a deployment
> which could become heavy and complicated as more IOs are supported and more
> test cases introduced.
>
> Another way to go would be to have small clusters of different data stores
> and run against new "namespaces" (while lazily evicting old ones), but I
> think this is less likely as maintaining a distributed instance (even a
> small one) for each data store sounds even more complex.
>
> A third approach would be to to simply have an "embedded" in-memory
> instance of a data store as part of a test that runs against it (such as an
> embedded Kafka, though not a data store).
> This is probably the simplest solution in terms of orchestration, but it
> depends on having a proper "embedded" implementation for an IO.
>
> Does this make sense to you ? have you considered it ?
>
> Thanks,
> Amit
>
> On Tue, Nov 22, 2016 at 8:20 AM Jean-Baptiste Onofré 
> wrote:
>
> > Hi Stephen,
> >
> > as already discussed a bit together, it sounds great ! Especially I like
> > it as a both integration test platform and good coverage for IOs.
> >
> > I'm very late on this but, as said, I will share with you my Marathon
> > JSON and Mesos docker images.
> >
> > By the way, I started to experiment a bit kubernetes and swamp but it's
> > not yet complete. I will share what I have on the same github repo.
> >
> > Thanks !
> > Regards
> > JB
> >
> > On 11/16/2016 11:36 PM, Stephen Sisk wrote:
> > > Hi everyone!
> > >
> > > Currently we have a good set of unit tests for our IO Transforms -
> those
> > > tend to run against in-memory versions of the data stores. However,
> we'd
> > > like to further increase our test coverage to include running them
> > against
> > > real instances of the data stores that the IO Transforms work against
> > (e.g.
> > > cassandra, mongodb, kafka, etc…), which means we'll need to have real
> > > instances of various data stores.
> > >
> > > Additionally, if we want to do performance regression detection, it's
> > > important to have instances of the services that behave realistically,
> > > which isn't true of in-memory or dev versions of the services.
> > >
> > >
> > > Proposed solution
> > > -
> > > If we accept this proposal, we would create an infrastructure for
> running
> > > real instances of data stores inside of containers, using container
> > > management software like mesos/marathon, kubernetes, docker swarm, etc…
> > to
> > > manage the instances.
> > >
> > > This would enable us to build integrati

Re: Hosting data stores for IO Transform testing

2016-11-22 Thread Amit Sela
Hi Stephen,

I was wondering about how we plan to use the data stores across executions.

Clearly, it's best to setup a new instance (container) for every test,
running a "standalone" store (say HBase/Cassandra for example), and once
the test is done, teardown the instance. It should also be agnostic to the
runtime environment (e.g., Docker on Kubernetes).
I'm wondering though what's the overhead of managing such a deployment
which could become heavy and complicated as more IOs are supported and more
test cases introduced.

Another way to go would be to have small clusters of different data stores
and run against new "namespaces" (while lazily evicting old ones), but I
think this is less likely as maintaining a distributed instance (even a
small one) for each data store sounds even more complex.

A third approach would be to to simply have an "embedded" in-memory
instance of a data store as part of a test that runs against it (such as an
embedded Kafka, though not a data store).
This is probably the simplest solution in terms of orchestration, but it
depends on having a proper "embedded" implementation for an IO.

Does this make sense to you ? have you considered it ?

Thanks,
Amit

On Tue, Nov 22, 2016 at 8:20 AM Jean-Baptiste Onofré 
wrote:

> Hi Stephen,
>
> as already discussed a bit together, it sounds great ! Especially I like
> it as a both integration test platform and good coverage for IOs.
>
> I'm very late on this but, as said, I will share with you my Marathon
> JSON and Mesos docker images.
>
> By the way, I started to experiment a bit kubernetes and swamp but it's
> not yet complete. I will share what I have on the same github repo.
>
> Thanks !
> Regards
> JB
>
> On 11/16/2016 11:36 PM, Stephen Sisk wrote:
> > Hi everyone!
> >
> > Currently we have a good set of unit tests for our IO Transforms - those
> > tend to run against in-memory versions of the data stores. However, we'd
> > like to further increase our test coverage to include running them
> against
> > real instances of the data stores that the IO Transforms work against
> (e.g.
> > cassandra, mongodb, kafka, etc…), which means we'll need to have real
> > instances of various data stores.
> >
> > Additionally, if we want to do performance regression detection, it's
> > important to have instances of the services that behave realistically,
> > which isn't true of in-memory or dev versions of the services.
> >
> >
> > Proposed solution
> > -
> > If we accept this proposal, we would create an infrastructure for running
> > real instances of data stores inside of containers, using container
> > management software like mesos/marathon, kubernetes, docker swarm, etc…
> to
> > manage the instances.
> >
> > This would enable us to build integration tests that run against those
> real
> > instances and performance tests that run against those real instances
> (like
> > those that Jason Kuster is proposing elsewhere.)
> >
> >
> > Why do we need one centralized set of instances vs just having various
> > people host their own instances?
> > -
> > Reducing flakiness of tests is key. By not having dependencies from the
> > core project on external services/instances of data stores we have
> > guaranteed access to the services and the group can fix issues that
> arise.
> >
> > An exception would be something that has an ops team supporting it (eg,
> > AWS, Google Cloud or other professionally managed service) - those we
> trust
> > will be stable.
> >
> >
> > There may be a lot of different data stores needed - how will we maintain
> > them?
> > -
> > It will take work above and beyond that of a normal set of unit tests to
> > build and maintain integration/performance tests & their data store
> > instances.
> >
> > Setup & maintenance of the data store containers and data store instances
> > on it must be automated. It also has to be as simple of a setup as
> > possible, and we should avoid hand tweaking the containers - expecting
> > checked in scripts/dockerfiles is key.
> >
> > Aligned with the community ownership approach of Apache, as members of
> the
> > community are excited to contribute & maintain those tests and the
> > integration/performance tests, people will be able to step up and do
> that.
> > If there is no longer support for maintaining a particular set of
> > integration & performance tests and their data store instances, then we
> can
> > disable those tests. We may document on the website what IO Transforms
> have
> > current integration/performance tests so users know what level of testing
> > the various IO Transforms have.
> >
> >
> > What about requirements for the container management software itself?
> > -
> > * We should have the data store instances themselves in Docker. Docker
> > allows new instances to be spun up in a quick, reproducible way and is
> > fairly platform independent. It has wide support from a variety of
>

Re: Hosting data stores for IO Transform testing

2016-11-21 Thread Jean-Baptiste Onofré

Hi Stephen,

as already discussed a bit together, it sounds great ! Especially I like 
it as a both integration test platform and good coverage for IOs.


I'm very late on this but, as said, I will share with you my Marathon 
JSON and Mesos docker images.


By the way, I started to experiment a bit kubernetes and swamp but it's 
not yet complete. I will share what I have on the same github repo.


Thanks !
Regards
JB

On 11/16/2016 11:36 PM, Stephen Sisk wrote:

Hi everyone!

Currently we have a good set of unit tests for our IO Transforms - those
tend to run against in-memory versions of the data stores. However, we'd
like to further increase our test coverage to include running them against
real instances of the data stores that the IO Transforms work against (e.g.
cassandra, mongodb, kafka, etc…), which means we'll need to have real
instances of various data stores.

Additionally, if we want to do performance regression detection, it's
important to have instances of the services that behave realistically,
which isn't true of in-memory or dev versions of the services.


Proposed solution
-
If we accept this proposal, we would create an infrastructure for running
real instances of data stores inside of containers, using container
management software like mesos/marathon, kubernetes, docker swarm, etc… to
manage the instances.

This would enable us to build integration tests that run against those real
instances and performance tests that run against those real instances (like
those that Jason Kuster is proposing elsewhere.)


Why do we need one centralized set of instances vs just having various
people host their own instances?
-
Reducing flakiness of tests is key. By not having dependencies from the
core project on external services/instances of data stores we have
guaranteed access to the services and the group can fix issues that arise.

An exception would be something that has an ops team supporting it (eg,
AWS, Google Cloud or other professionally managed service) - those we trust
will be stable.


There may be a lot of different data stores needed - how will we maintain
them?
-
It will take work above and beyond that of a normal set of unit tests to
build and maintain integration/performance tests & their data store
instances.

Setup & maintenance of the data store containers and data store instances
on it must be automated. It also has to be as simple of a setup as
possible, and we should avoid hand tweaking the containers - expecting
checked in scripts/dockerfiles is key.

Aligned with the community ownership approach of Apache, as members of the
community are excited to contribute & maintain those tests and the
integration/performance tests, people will be able to step up and do that.
If there is no longer support for maintaining a particular set of
integration & performance tests and their data store instances, then we can
disable those tests. We may document on the website what IO Transforms have
current integration/performance tests so users know what level of testing
the various IO Transforms have.


What about requirements for the container management software itself?
-
* We should have the data store instances themselves in Docker. Docker
allows new instances to be spun up in a quick, reproducible way and is
fairly platform independent. It has wide support from a variety of
different container management services.
* As little admin work required as possible. Crashing instances should be
restarted, setup should be simple, everything possible should be
scripted/scriptable.
* Logs and test output should be on a publicly available website, without
needing to log into test execution machine. Centralized capture of
monitoring info/logs from instances running in the containers would support
this. Ideally, this would just be supported by the container software out
of the box.
* It'd be useful to have good persistent volume in the container management
software so that databases don't have to reload large data sets every time.
* The containers may be a place to execute runners themselves if we need
larger runner instances, so it should play well with Spark, Flink, etc…

As I discussed earlier on the mailing list, it looks like hosting docker
containers on kubernetes, docker swarm or mesos+marathon would be a good
solution.

Thanks,
Stephen Sisk



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: Hosting data stores for IO Transform testing

2016-11-21 Thread Stephen Sisk
Thanks Aljoscha! I appreciate you taking the time to take a look.

I've opened [BEAM-1027] "Hosting data stores to enable IO Transform
testing" to track this work.

Stephen

On Mon, Nov 21, 2016 at 2:11 AM Aljoscha Krettek 
wrote:

Hi Stephen,
I really like your proposal! I don't have any comments because this seems
very well "researched" already.

I'm hoping others will also have a look at this as well because "real"
integration testing provides a new level of confidence in the code, IMHO.

Cheers,
Aljoscha


On Wed, 16 Nov 2016 at 23:36 Stephen Sisk  wrote:

> Hi everyone!
>
> Currently we have a good set of unit tests for our IO Transforms - those
> tend to run against in-memory versions of the data stores. However, we'd
> like to further increase our test coverage to include running them against
> real instances of the data stores that the IO Transforms work against
(e.g.
> cassandra, mongodb, kafka, etc…), which means we'll need to have real
> instances of various data stores.
>
> Additionally, if we want to do performance regression detection, it's
> important to have instances of the services that behave realistically,
> which isn't true of in-memory or dev versions of the services.
>
>
> Proposed solution
> -
> If we accept this proposal, we would create an infrastructure for running
> real instances of data stores inside of containers, using container
> management software like mesos/marathon, kubernetes, docker swarm, etc… to
> manage the instances.
>
> This would enable us to build integration tests that run against those
real
> instances and performance tests that run against those real instances
(like
> those that Jason Kuster is proposing elsewhere.)
>
>
> Why do we need one centralized set of instances vs just having various
> people host their own instances?
> -
> Reducing flakiness of tests is key. By not having dependencies from the
> core project on external services/instances of data stores we have
> guaranteed access to the services and the group can fix issues that arise.
>
> An exception would be something that has an ops team supporting it (eg,
> AWS, Google Cloud or other professionally managed service) - those we
trust
> will be stable.
>
>
> There may be a lot of different data stores needed - how will we maintain
> them?
> -
> It will take work above and beyond that of a normal set of unit tests to
> build and maintain integration/performance tests & their data store
> instances.
>
> Setup & maintenance of the data store containers and data store instances
> on it must be automated. It also has to be as simple of a setup as
> possible, and we should avoid hand tweaking the containers - expecting
> checked in scripts/dockerfiles is key.
>
> Aligned with the community ownership approach of Apache, as members of the
> community are excited to contribute & maintain those tests and the
> integration/performance tests, people will be able to step up and do that.
> If there is no longer support for maintaining a particular set of
> integration & performance tests and their data store instances, then we
can
> disable those tests. We may document on the website what IO Transforms
have
> current integration/performance tests so users know what level of testing
> the various IO Transforms have.
>
>
> What about requirements for the container management software itself?
> -
> * We should have the data store instances themselves in Docker. Docker
> allows new instances to be spun up in a quick, reproducible way and is
> fairly platform independent. It has wide support from a variety of
> different container management services.
> * As little admin work required as possible. Crashing instances should be
> restarted, setup should be simple, everything possible should be
> scripted/scriptable.
> * Logs and test output should be on a publicly available website, without
> needing to log into test execution machine. Centralized capture of
> monitoring info/logs from instances running in the containers would
support
> this. Ideally, this would just be supported by the container software out
> of the box.
> * It'd be useful to have good persistent volume in the container
management
> software so that databases don't have to reload large data sets every
time.
> * The containers may be a place to execute runners themselves if we need
> larger runner instances, so it should play well with Spark, Flink, etc…
>
> As I discussed earlier on the mailing list, it looks like hosting docker
> containers on kubernetes, docker swarm or mesos+marathon would be a good
> solution.
>
> Thanks,
> Stephen Sisk
>


Re: Hosting data stores for IO Transform testing

2016-11-21 Thread Aljoscha Krettek
Hi Stephen,
I really like your proposal! I don't have any comments because this seems
very well "researched" already.

I'm hoping others will also have a look at this as well because "real"
integration testing provides a new level of confidence in the code, IMHO.

Cheers,
Aljoscha


On Wed, 16 Nov 2016 at 23:36 Stephen Sisk  wrote:

> Hi everyone!
>
> Currently we have a good set of unit tests for our IO Transforms - those
> tend to run against in-memory versions of the data stores. However, we'd
> like to further increase our test coverage to include running them against
> real instances of the data stores that the IO Transforms work against (e.g.
> cassandra, mongodb, kafka, etc…), which means we'll need to have real
> instances of various data stores.
>
> Additionally, if we want to do performance regression detection, it's
> important to have instances of the services that behave realistically,
> which isn't true of in-memory or dev versions of the services.
>
>
> Proposed solution
> -
> If we accept this proposal, we would create an infrastructure for running
> real instances of data stores inside of containers, using container
> management software like mesos/marathon, kubernetes, docker swarm, etc… to
> manage the instances.
>
> This would enable us to build integration tests that run against those real
> instances and performance tests that run against those real instances (like
> those that Jason Kuster is proposing elsewhere.)
>
>
> Why do we need one centralized set of instances vs just having various
> people host their own instances?
> -
> Reducing flakiness of tests is key. By not having dependencies from the
> core project on external services/instances of data stores we have
> guaranteed access to the services and the group can fix issues that arise.
>
> An exception would be something that has an ops team supporting it (eg,
> AWS, Google Cloud or other professionally managed service) - those we trust
> will be stable.
>
>
> There may be a lot of different data stores needed - how will we maintain
> them?
> -
> It will take work above and beyond that of a normal set of unit tests to
> build and maintain integration/performance tests & their data store
> instances.
>
> Setup & maintenance of the data store containers and data store instances
> on it must be automated. It also has to be as simple of a setup as
> possible, and we should avoid hand tweaking the containers - expecting
> checked in scripts/dockerfiles is key.
>
> Aligned with the community ownership approach of Apache, as members of the
> community are excited to contribute & maintain those tests and the
> integration/performance tests, people will be able to step up and do that.
> If there is no longer support for maintaining a particular set of
> integration & performance tests and their data store instances, then we can
> disable those tests. We may document on the website what IO Transforms have
> current integration/performance tests so users know what level of testing
> the various IO Transforms have.
>
>
> What about requirements for the container management software itself?
> -
> * We should have the data store instances themselves in Docker. Docker
> allows new instances to be spun up in a quick, reproducible way and is
> fairly platform independent. It has wide support from a variety of
> different container management services.
> * As little admin work required as possible. Crashing instances should be
> restarted, setup should be simple, everything possible should be
> scripted/scriptable.
> * Logs and test output should be on a publicly available website, without
> needing to log into test execution machine. Centralized capture of
> monitoring info/logs from instances running in the containers would support
> this. Ideally, this would just be supported by the container software out
> of the box.
> * It'd be useful to have good persistent volume in the container management
> software so that databases don't have to reload large data sets every time.
> * The containers may be a place to execute runners themselves if we need
> larger runner instances, so it should play well with Spark, Flink, etc…
>
> As I discussed earlier on the mailing list, it looks like hosting docker
> containers on kubernetes, docker swarm or mesos+marathon would be a good
> solution.
>
> Thanks,
> Stephen Sisk
>


Hosting data stores for IO Transform testing

2016-11-16 Thread Stephen Sisk
Hi everyone!

Currently we have a good set of unit tests for our IO Transforms - those
tend to run against in-memory versions of the data stores. However, we'd
like to further increase our test coverage to include running them against
real instances of the data stores that the IO Transforms work against (e.g.
cassandra, mongodb, kafka, etc…), which means we'll need to have real
instances of various data stores.

Additionally, if we want to do performance regression detection, it's
important to have instances of the services that behave realistically,
which isn't true of in-memory or dev versions of the services.


Proposed solution
-
If we accept this proposal, we would create an infrastructure for running
real instances of data stores inside of containers, using container
management software like mesos/marathon, kubernetes, docker swarm, etc… to
manage the instances.

This would enable us to build integration tests that run against those real
instances and performance tests that run against those real instances (like
those that Jason Kuster is proposing elsewhere.)


Why do we need one centralized set of instances vs just having various
people host their own instances?
-
Reducing flakiness of tests is key. By not having dependencies from the
core project on external services/instances of data stores we have
guaranteed access to the services and the group can fix issues that arise.

An exception would be something that has an ops team supporting it (eg,
AWS, Google Cloud or other professionally managed service) - those we trust
will be stable.


There may be a lot of different data stores needed - how will we maintain
them?
-
It will take work above and beyond that of a normal set of unit tests to
build and maintain integration/performance tests & their data store
instances.

Setup & maintenance of the data store containers and data store instances
on it must be automated. It also has to be as simple of a setup as
possible, and we should avoid hand tweaking the containers - expecting
checked in scripts/dockerfiles is key.

Aligned with the community ownership approach of Apache, as members of the
community are excited to contribute & maintain those tests and the
integration/performance tests, people will be able to step up and do that.
If there is no longer support for maintaining a particular set of
integration & performance tests and their data store instances, then we can
disable those tests. We may document on the website what IO Transforms have
current integration/performance tests so users know what level of testing
the various IO Transforms have.


What about requirements for the container management software itself?
-
* We should have the data store instances themselves in Docker. Docker
allows new instances to be spun up in a quick, reproducible way and is
fairly platform independent. It has wide support from a variety of
different container management services.
* As little admin work required as possible. Crashing instances should be
restarted, setup should be simple, everything possible should be
scripted/scriptable.
* Logs and test output should be on a publicly available website, without
needing to log into test execution machine. Centralized capture of
monitoring info/logs from instances running in the containers would support
this. Ideally, this would just be supported by the container software out
of the box.
* It'd be useful to have good persistent volume in the container management
software so that databases don't have to reload large data sets every time.
* The containers may be a place to execute runners themselves if we need
larger runner instances, so it should play well with Spark, Flink, etc…

As I discussed earlier on the mailing list, it looks like hosting docker
containers on kubernetes, docker swarm or mesos+marathon would be a good
solution.

Thanks,
Stephen Sisk