Jenkins build is back to normal : beam_Release_NightlySnapshot #364

2017-03-21 Thread Apache Jenkins Server
See 




Re: [ANNOUNCEMENT] New committers, March 2017 edition!

2017-03-21 Thread Stephan Ewen
Welcome :-)

On Mon, Mar 20, 2017 at 11:17 PM, Ismaël Mejía  wrote:

> Thanks everyone, Feels great to be part of the team.
> Congratulations to the other new committers !
>
> -Ismaël
>
> On Mon, Mar 20, 2017 at 2:50 PM, Tyler Akidau
>  wrote:
> > Welcome!
> >
> > On Mon, Mar 20, 2017, 02:25 Jean-Baptiste Onofré 
> wrote:
> >
> >> Welcome aboard, and congrats !
> >>
> >> Really happy to count you all in the team ;)
> >>
> >> Regards
> >> JB
> >>
> >> On 03/17/2017 10:13 PM, Davor Bonaci wrote:
> >> > Please join me and the rest of Beam PMC in welcoming the following
> >> > contributors as our newest committers. They have significantly
> >> contributed
> >> > to the project in different ways, and we look forward to many more
> >> > contributions in the future.
> >> >
> >> > * Chamikara Jayalath
> >> > Chamikara has been contributing to Beam since inception, and
> previously
> >> to
> >> > Google Cloud Dataflow, accumulating a total of 51 commits (8,301 ++ /
> >> 3,892
> >> > --) since February 2016 [1]. He contributed broadly to the project,
> but
> >> > most significantly to the Python SDK, building the IO framework in
> this
> >> SDK
> >> > [2], [3].
> >> >
> >> > * Eugene Kirpichov
> >> > Eugene has been contributing to Beam since inception, and previously
> to
> >> > Google Cloud Dataflow, accumulating a total of 95 commits (22,122 ++ /
> >> > 18,407 --) since February 2016 [1]. In recent months, he’s been
> driving
> >> the
> >> > Splittable DoFn effort [4]. A true expert on IO subsystem, Eugene has
> >> > reviewed nearly every IO contributed to Beam. Finally, Eugene
> contributed
> >> > the Beam Style Guide, and is championing it across the project.
> >> >
> >> > * Ismaël Mejia
> >> > Ismaël has been contributing to Beam since mid-2016, accumulating a
> total
> >> > of 35 commits (3,137 ++ / 1,328 --) [1]. He authored the HBaseIO
> >> connector,
> >> > helped on the Spark runner, and contributed in other areas as well,
> >> > including cross-project collaboration with Apache Zeppelin. Ismaël
> >> reported
> >> > 24 Jira issues.
> >> >
> >> > * Aviem Zur
> >> > Aviem has been contributing to Beam since early fall, accumulating a
> >> total
> >> > of 49 commits (6,471 ++ / 3,185 --) [1]. He reported 43 Jira issues,
> and
> >> > resolved ~30 issues. Aviem improved the stability of the Spark runner
> a
> >> > lot, and introduced support for metrics. Finally, Aviem is championing
> >> > dependency management across the project.
> >> >
> >> > Congratulations to all four! Welcome!
> >> >
> >> > Davor
> >> >
> >> > [1]
> >> >
> >> https://github.com/apache/beam/graphs/contributors?from=
> 2016-02-01&to=2017-03-17&type=c
> >> > [2]
> >> >
> >> https://github.com/apache/beam/blob/v0.6.0/sdks/python/
> apache_beam/io/iobase.py#L70
> >> > [3]
> >> >
> >> https://github.com/apache/beam/blob/v0.6.0/sdks/python/
> apache_beam/io/iobase.py#L561
> >> > [4] https://s.apache.org/splittable-do-fn
> >> >
> >>
> >> --
> >> Jean-Baptiste Onofré
> >> jbono...@apache.org
> >> http://blog.nanthrax.net
> >> Talend - http://www.talend.com
> >>
>


Re: why Source#validate() is not declared to throw any exception

2017-03-21 Thread Jean-Baptiste Onofré
I just discussed with Daniel (from Infra), we've had a global networking issue 
affecting mail and LDAP.

That's why the Jira notifications are in the queue for now.

Regards
JB

On 03/21/2017 03:13 PM, Ted Yu wrote:

Looks like JIRA notification is temporarily not working.

I have logged BEAM-1773

FYI

On Mon, Mar 20, 2017 at 11:26 PM, Eugene Kirpichov <
kirpic...@google.com.invalid> wrote:


I think it would make sense to allow the validate method to throw
Exception.

On Mon, Mar 20, 2017, 11:21 PM Jean-Baptiste Onofré 
wrote:


Hi Ted,

validate() is supposed to throw runtime exception (IllegalStateException,
RuntimeException, ...) to "traverse" the executor.

Regards
JB

On 03/21/2017 01:56 AM, Ted Yu wrote:

Hi,
I was reading HDFSFileSource.java where:

  @Override
  public void validate() {
...
  } catch (IOException | InterruptedException e) {
throw new RuntimeException(e);
  }

Why is validate() not declared to throw any exception ?
If validation doesn't pass, there is nothing to clean up ?

Thanks



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com







--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: why Source#validate() is not declared to throw any exception

2017-03-21 Thread Jean-Baptiste Onofré

Got it.

Regarding Jira notification, we changed the notification schema to use 
comm...@beam.apache.org (it was using comm...@beam.incubator.apache.org), but I 
don't think it's related, I think it's a Jira service/mail issue (even if 
status.apache.org doesn't show anything). I gonna ping Infra on hipchat about that.


Regards
JB

On 03/21/2017 03:13 PM, Ted Yu wrote:

Looks like JIRA notification is temporarily not working.

I have logged BEAM-1773

FYI

On Mon, Mar 20, 2017 at 11:26 PM, Eugene Kirpichov <
kirpic...@google.com.invalid> wrote:


I think it would make sense to allow the validate method to throw
Exception.

On Mon, Mar 20, 2017, 11:21 PM Jean-Baptiste Onofré 
wrote:


Hi Ted,

validate() is supposed to throw runtime exception (IllegalStateException,
RuntimeException, ...) to "traverse" the executor.

Regards
JB

On 03/21/2017 01:56 AM, Ted Yu wrote:

Hi,
I was reading HDFSFileSource.java where:

  @Override
  public void validate() {
...
  } catch (IOException | InterruptedException e) {
throw new RuntimeException(e);
  }

Why is validate() not declared to throw any exception ?
If validation doesn't pass, there is nothing to clean up ?

Thanks



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com







--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: why Source#validate() is not declared to throw any exception

2017-03-21 Thread Ted Yu
Looks like JIRA notification is temporarily not working.

I have logged BEAM-1773

FYI

On Mon, Mar 20, 2017 at 11:26 PM, Eugene Kirpichov <
kirpic...@google.com.invalid> wrote:

> I think it would make sense to allow the validate method to throw
> Exception.
>
> On Mon, Mar 20, 2017, 11:21 PM Jean-Baptiste Onofré 
> wrote:
>
> > Hi Ted,
> >
> > validate() is supposed to throw runtime exception (IllegalStateException,
> > RuntimeException, ...) to "traverse" the executor.
> >
> > Regards
> > JB
> >
> > On 03/21/2017 01:56 AM, Ted Yu wrote:
> > > Hi,
> > > I was reading HDFSFileSource.java where:
> > >
> > >   @Override
> > >   public void validate() {
> > > ...
> > >   } catch (IOException | InterruptedException e) {
> > > throw new RuntimeException(e);
> > >   }
> > >
> > > Why is validate() not declared to throw any exception ?
> > > If validation doesn't pass, there is nothing to clean up ?
> > >
> > > Thanks
> > >
> >
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>


[DISCUSSION] using NexMark for Beam

2017-03-21 Thread Etienne Chauchot

Hi all,

Ismael and I are working on upgrading the Nexmark implementation for 
Beam. See https://github.com/iemejia/beam/tree/BEAM-160-nexmark and 
https://issues.apache.org/jira/browse/BEAM-160. We are continuing the 
work done by Mark Shields. See https://github.com/apache/beam/pull/366 
for the original PR.


The PR contains queries that have a wide coverage of the Beam model and 
that represent a realistic end user use case (some come from client 
experience on Google Cloud Dataflow).


So far, we have upgraded the implementation to the latest Beam snapshot. 
And we are able to execute a good subset of the queries in the different 
runners. We upgraded the nexmark drivers to do so: direct driver 
(upgraded from inProcessDriver) and flink driver and we added a new one 
for spark.


There is still a good amount of work to do and we would like to know if 
you think that this contribution can have its place into Beam eventually.


The interests of having Nexmark on Beam that we have seen so far are:

- Rich batch/streaming test

- A-B testing of runners or runtimes (non-regression, performance 
comparison between versions ...)


- Integration testing (sdk/runners, runner/runtime, ...)

- Validate beam capability matrix

- It can be used as part of the ongoing PerfKit work (if there is any 
interest).


As a final note, we are tracking the issues in the same repo. If someone 
is interested in contributing, or have more ideas, you are welcome :)


Etienne



Re: [DISCUSSION] using NexMark for Beam

2017-03-21 Thread Dan Halperin
Not a deep response, but this is awesome! We'd really like to have some
good benchmarks, and I'm excited you're updating Nexmark. This will be
great!

On Tue, Mar 21, 2017 at 9:38 AM, Etienne Chauchot 
wrote:

> Hi all,
>
> Ismael and I are working on upgrading the Nexmark implementation for Beam.
> See https://github.com/iemejia/beam/tree/BEAM-160-nexmark and
> https://issues.apache.org/jira/browse/BEAM-160. We are continuing the
> work done by Mark Shields. See https://github.com/apache/beam/pull/366
> for the original PR.
>
> The PR contains queries that have a wide coverage of the Beam model and
> that represent a realistic end user use case (some come from client
> experience on Google Cloud Dataflow).
>
> So far, we have upgraded the implementation to the latest Beam snapshot.
> And we are able to execute a good subset of the queries in the different
> runners. We upgraded the nexmark drivers to do so: direct driver (upgraded
> from inProcessDriver) and flink driver and we added a new one for spark.
>
> There is still a good amount of work to do and we would like to know if
> you think that this contribution can have its place into Beam eventually.
>
> The interests of having Nexmark on Beam that we have seen so far are:
>
> - Rich batch/streaming test
>
> - A-B testing of runners or runtimes (non-regression, performance
> comparison between versions ...)
>
> - Integration testing (sdk/runners, runner/runtime, ...)
>
> - Validate beam capability matrix
>
> - It can be used as part of the ongoing PerfKit work (if there is any
> interest).
>
> As a final note, we are tracking the issues in the same repo. If someone
> is interested in contributing, or have more ideas, you are welcome :)
>
> Etienne
>
>


Re: [DISCUSSION] using NexMark for Beam

2017-03-21 Thread Jean-Baptiste Onofré

Hi Etienne,

That's a great news and good job !

By "having Nexmark on Beam", I guess you mean the translation of the NEXMark 
queries in Beam, not NEXMark itself, right ?


If you mean the later, I'm not sure as NEXMark is not Beam related (it's more 
generic) and it could be tricky in terms of legal (license, SGA, ...).


Regards
JB

On 03/21/2017 05:38 PM, Etienne Chauchot wrote:

Hi all,

Ismael and I are working on upgrading the Nexmark implementation for Beam. See
https://github.com/iemejia/beam/tree/BEAM-160-nexmark and
https://issues.apache.org/jira/browse/BEAM-160. We are continuing the work done
by Mark Shields. See https://github.com/apache/beam/pull/366 for the original 
PR.

The PR contains queries that have a wide coverage of the Beam model and that
represent a realistic end user use case (some come from client experience on
Google Cloud Dataflow).

So far, we have upgraded the implementation to the latest Beam snapshot. And we
are able to execute a good subset of the queries in the different runners. We
upgraded the nexmark drivers to do so: direct driver (upgraded from
inProcessDriver) and flink driver and we added a new one for spark.

There is still a good amount of work to do and we would like to know if you
think that this contribution can have its place into Beam eventually.

The interests of having Nexmark on Beam that we have seen so far are:

- Rich batch/streaming test

- A-B testing of runners or runtimes (non-regression, performance comparison
between versions ...)

- Integration testing (sdk/runners, runner/runtime, ...)

- Validate beam capability matrix

- It can be used as part of the ongoing PerfKit work (if there is any interest).

As a final note, we are tracking the issues in the same repo. If someone is
interested in contributing, or have more ideas, you are welcome :)

Etienne



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: Beam spark 2.x runner status

2017-03-21 Thread Ted Yu
I have done some work over in HBASE-16179 where compatibility modules are
created to isolate changes in Spark 2.x API so that code in hbase-spark
module can be reused.

FYI


Re: Kafka Offset handling for Restart/failure scenarios.

2017-03-21 Thread Mingmin Xu
Move discuss to dev-list

Savepoint in Flink, also checkpoint in Spark, should be good enough to
handle this case.

When people don't enable these features, for example only need at-most-once
semantic, each unbounded IO should try its best to restore from last
offset, although CheckpointMark is null. Any ideas?

Mingmin

On Tue, Mar 21, 2017 at 9:39 AM, Dan Halperin  wrote:

> hey,
>
> The native Beam UnboundedSource API supports resuming from checkpoint --
> that specifically happens here
> 
>  when
> the KafkaCheckpointMark is non-null.
>
> The FlinkRunner should be providing the KafkaCheckpointMark from the most
> recent savepoint upon restore.
>
> There shouldn't be any "special" Flink runner support needed, nor is the
> State API involved.
>
> Dan
>
> On Tue, Mar 21, 2017 at 9:01 AM, Jean-Baptiste Onofré 
> wrote:
>
>> Would not it be Flink runner specific ?
>>
>> Maybe the State API could do the same in a runner agnostic way (just
>> thinking loud) ?
>>
>> Regards
>> JB
>>
>> On 03/21/2017 04:56 PM, Mingmin Xu wrote:
>>
>>> From KafkaIO itself, looks like it either start_from_beginning or
>>> start_from_latest. It's designed to leverage
>>> `UnboundedSource.CheckpointMark`
>>> during initialization, but so far I don't see it's provided by runners.
>>> At the
>>> moment Flink savepoints is a good option, created a JIRA(BEAM-1775
>>> )  to handle it in
>>> KafkaIO.
>>>
>>> Mingmin
>>>
>>> On Tue, Mar 21, 2017 at 3:40 AM, Aljoscha Krettek >> > wrote:
>>>
>>> Hi,
>>> Are you using Flink savepoints [1] when restoring your application?
>>> If you
>>> use this the Kafka offset should be stored in state and it should
>>> restart
>>> from the correct position.
>>>
>>> Best,
>>> Aljoscha
>>>
>>> [1]
>>> https://ci.apache.org/projects/flink/flink-docs-release-1.3/
>>> setup/savepoints.html
>>> >> /setup/savepoints.html>
>>> > On 21 Mar 2017, at 01:50, Jins George >> > wrote:
>>> >
>>> > Hello,
>>> >
>>> > I am writing a Beam pipeline(streaming) with Flink runner to
>>> consume data
>>> from Kafka and apply some transformations and persist to Hbase.
>>> >
>>> > If I restart the application ( due to failure/manual restart),
>>> consumer
>>> does not resume from the offset where it was prior to restart. It
>>> always
>>> resume from the latest offset.
>>> >
>>> > If I enable Flink checkpionting with hdfs state back-end, system
>>> appears
>>> to be resuming from the earliest offset
>>> >
>>> > Is there a recommended way to resume from the offset where it was
>>> stopped ?
>>> >
>>> > Thanks,
>>> > Jins George
>>>
>>>
>>>
>>>
>>> --
>>> 
>>> Mingmin
>>>
>>
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>
>


-- 

Mingmin


Re: Style: how much testing for transform builder classes?

2017-03-21 Thread Robert Bradshaw
On Wed, Mar 15, 2017 at 2:11 AM, Ismaël Mejía  wrote:

> +1 to Vikas point maybe the right place to enforce things correct
> build tests is in the validate and like this reduce the test
> boilerplate and only test the validate, but I wonder if this totally
> covers both cases (the buildsCorrectly and
> buildsCorrectlyInDifferentOrder ones).
>
> I answer Eugene’s question here even if you are aware now since you
> commented in the PR, so everyone understands the case.
>
> The case is pretty simple, when you extend an IO and add a new
> configuration parameter, suppose we have withFoo(String foo) and we
> want to add withBar(String bar). In some cases the implementation or
> even worse the combination of those are not built correctly, so the
> only way to guarantee that this works is to have code that tests the
> complete parameter combination or tests that at least assert that the
> object is built correctly.
>
> This is something that can happen both with or without AutoValue
> because the with method is hand-written and the natural tendency with
> boilerplate methods like this is to copy/paste, so we can end up doing
> silly things like:
>
> private Read(String foo, String bar) { … }
>
> public Read withBar(String bar) {
>   return new Read(foo, null);
> }
>
> in this case the reference to bar is not stored or assigned (this is
> similar to the case of the DatastoreIO PR), and AutoValue may seem to
> solve this issue but you can also end up with this situation if you
> copy paste the withFoo method and just change the method name:
>
> public Read withBar(String foo) {
>   return builder().setFoo(foo).build();
> }
>
> Of course both seem silly but both can happen and the tests at least
> help to discover those,
>

Such mistakes should be entirely discovered by tests of feature Bar. If Bar
is not actually being tested, that's a bigger problem with coverage that a
construction-only test actually obscures (giving it negative value).


>
> On Wed, Mar 15, 2017 at 1:05 AM, vikas rk  wrote:
> > Yes, what I meant is: Necessary tests are ones that blocks users if not
> > present. Trivial or non-trivial shouldn't be the issue in such cases.
> >
> > Some of the boilerplate code and tests is because IO PTransforms are
> > returned to the user before they are fully constructed and actual
> > validation happens in the validate method rather than at construction. I
> > understand that the reasoning here is that we want to support
> constructing
> > them with options in any order and using Builder pattern can be
> confusing.
> >
> > If validate method is where all the validation happens, then we should
> able
> > to eliminate some redundant checks and tests during construction time
> like
> > in *withOption* methods here
> >  google-cloud-platform/src/main/java/org/apache/beam/sdk/
> io/gcp/bigtable/BigtableIO.java#L199>
> >  and here
> >  google-cloud-platform/src/main/java/org/apache/beam/sdk/
> io/gcp/datastore/DatastoreV1.java#L387>
> > as
> > these are also checked in the validate method.
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > -Vikas
> >
> >
> >
> > On 14 March 2017 at 15:40, Eugene Kirpichov  >
> > wrote:
> >
> >> Thanks all. Looks like people are on board with the general direction
> >> though it remains to refine it to concrete guidelines to go into style
> >> guide.
> >>
> >> Ismaël, can you give more details about the situation you described in
> the
> >> first paragraph? Is it perhaps that really a RunnableOnService test was
> >> missing (and perhaps still is), rather than a builder test?
> >>
> >> Vikas, regarding trivial tests and user waiting for a work-around: in
> the
> >> situation I described, they don't really need a workaround - they
> specified
> >> an invalid value and have been minorly inconvenienced because the error
> >> they got about it was not very readable, so fixing their value took
> them a
> >> little longer than it could have, but they fixed it and their work is
> not
> >> blocked. I think Robert's arguments about the cost of trivial tests
> apply.
> >>
> >> I agree that the author should be at liberty to choose which validation
> to
> >> unit-test and which to skip as trivial, so documentation on this topic
> >> should be in the form of guidelines, high-quality example code (i.e.
> clean
> >> up the unit tests of IOs bundled with Beam SDK), and informal knowledge
> in
> >> the heads of readers of this thread, rather than hard rules.
> >>
> >> On Tue, Mar 14, 2017 at 8:07 AM Ismaël Mejía  wrote:
> >>
> >> > +0.5
> >> >
> >> > I used to think that some of those tests were not worth, for example
> >> > testBuildRead and
> >> > testBuildReadAlt. However the reality is that these tests allowed me
> to
> >> > find bugs both during the development of HBaseIO and just yesterday
> when
> >> I
> >> > tried to test the write support for the emula

[PROPOSAL] "Requires deterministic input"

2017-03-21 Thread Kenneth Knowles
Problem:

I will drop all nuance and say that the `Write` transform as it exists in
the SDK is incorrect until we add some specification and APIs. We can't
keep shipping an SDK with an unsafe transform in it, and IMO this certainly
blocks a stable release.

Specifically, there is pseudorandom data generated and once it has been
observed and used to produce a side effect, it cannot be regenerated
without erroneous results.

This generalizes: For some side-effecting user-defined functions, it is
vital that even across retries/replays they have a consistent view of the
contents of their input PCollection, because their effect on the outside
world cannot be retracted if/when they fail and are retried. Once the
runner ensures a consistent view of the input, it is then their own
responsibility to be idempotent.

Ideally we should specify this requirement for the user-defined function
without imposing any particular implementation strategy on Beam runners.

Proposal:

1. Let a DoFn declare (mechanism not important right now) that it "requires
deterministic input".

2. Each runner will need a way to induce deterministic input - the obvious
choice being a materialization.

I want to keep the discussion focused, so I'm leaving out any possibilities
of taking this further.

Regarding performance: Today places that require this tend to be already
paying the cost via GroupByKey / Reshuffle operations, since that was a
simple way to induce determinism in batch Dataflow* (doesn't work for most
other runners nor for streaming Dataflow). This change will replace a
hard-coded implementation strategy with a requirement that may be fulfilled
in the most efficient way available.

Thoughts?

Kenn (w/ lots of consult from colleagues, especially Ben)

* There is some overlap with the reshuffle/redistribute discussion because
of this historical situation, but I would like to leave that broader
discussion out of this correctness issue.


Re: Kafka Offset handling for Restart/failure scenarios.

2017-03-21 Thread Amit Sela
On Tue, Mar 21, 2017 at 7:26 PM Mingmin Xu  wrote:

> Move discuss to dev-list
>
> Savepoint in Flink, also checkpoint in Spark, should be good enough to
> handle this case.
>
> When people don't enable these features, for example only need at-most-once
>
The Spark runner forces checkpointing on any streaming (Beam) application,
mostly because it uses mapWithState for reading from UnboundedSource and
updateStateByKey form GroupByKey - so by design, Spark runner is
at-least-once. Generally, I always thought that applications that require
at-most-once are more focused on processing time only, as they only care
about whatever get's ingested into the pipeline at a specific time and
don't care (up to the point of losing data) about correctness.
I would be happy to hear more about your use case.

> semantic, each unbounded IO should try its best to restore from last
> offset, although CheckpointMark is null. Any ideas?
>
> Mingmin
>
> On Tue, Mar 21, 2017 at 9:39 AM, Dan Halperin  wrote:
>
> > hey,
> >
> > The native Beam UnboundedSource API supports resuming from checkpoint --
> > that specifically happens here
> > <
> https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L674>
> when
> > the KafkaCheckpointMark is non-null.
> >
> > The FlinkRunner should be providing the KafkaCheckpointMark from the most
> > recent savepoint upon restore.
> >
> > There shouldn't be any "special" Flink runner support needed, nor is the
> > State API involved.
> >
> > Dan
> >
> > On Tue, Mar 21, 2017 at 9:01 AM, Jean-Baptiste Onofré 
> > wrote:
> >
> >> Would not it be Flink runner specific ?
> >>
> >> Maybe the State API could do the same in a runner agnostic way (just
> >> thinking loud) ?
> >>
> >> Regards
> >> JB
> >>
> >> On 03/21/2017 04:56 PM, Mingmin Xu wrote:
> >>
> >>> From KafkaIO itself, looks like it either start_from_beginning or
> >>> start_from_latest. It's designed to leverage
> >>> `UnboundedSource.CheckpointMark`
> >>> during initialization, but so far I don't see it's provided by runners.
> >>> At the
> >>> moment Flink savepoints is a good option, created a JIRA(BEAM-1775
> >>> )  to handle it in
> >>> KafkaIO.
> >>>
> >>> Mingmin
> >>>
> >>> On Tue, Mar 21, 2017 at 3:40 AM, Aljoscha Krettek  >>> > wrote:
> >>>
> >>> Hi,
> >>> Are you using Flink savepoints [1] when restoring your application?
> >>> If you
> >>> use this the Kafka offset should be stored in state and it should
> >>> restart
> >>> from the correct position.
> >>>
> >>> Best,
> >>> Aljoscha
> >>>
> >>> [1]
> >>> https://ci.apache.org/projects/flink/flink-docs-release-1.3/
> >>> setup/savepoints.html
> >>>  >>> /setup/savepoints.html>
> >>> > On 21 Mar 2017, at 01:50, Jins George  >>> > wrote:
> >>> >
> >>> > Hello,
> >>> >
> >>> > I am writing a Beam pipeline(streaming) with Flink runner to
> >>> consume data
> >>> from Kafka and apply some transformations and persist to Hbase.
> >>> >
> >>> > If I restart the application ( due to failure/manual restart),
> >>> consumer
> >>> does not resume from the offset where it was prior to restart. It
> >>> always
> >>> resume from the latest offset.
> >>> >
> >>> > If I enable Flink checkpionting with hdfs state back-end, system
> >>> appears
> >>> to be resuming from the earliest offset
> >>> >
> >>> > Is there a recommended way to resume from the offset where it was
> >>> stopped ?
> >>> >
> >>> > Thanks,
> >>> > Jins George
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>> 
> >>> Mingmin
> >>>
> >>
> >> --
> >> Jean-Baptiste Onofré
> >> jbono...@apache.org
> >> http://blog.nanthrax.net
> >> Talend - http://www.talend.com
> >>
> >
> >
>
>
> --
> 
> Mingmin
>


Re: [PROPOSAL] "Requires deterministic input"

2017-03-21 Thread Stephen Sisk
Hey Kenn-

this seems important, but I don't have all the context on what the problem
is.

Can you explain this sentence "Specifically, there is pseudorandom data
generated and once it has been observed and used to produce a side effect,
it cannot be regenerated without erroneous results." ?

Where is the pseudorandom data coming from? Perhaps a concrete example
would help?

S


On Tue, Mar 21, 2017 at 1:22 PM Kenneth Knowles 
wrote:

> Problem:
>
> I will drop all nuance and say that the `Write` transform as it exists in
> the SDK is incorrect until we add some specification and APIs. We can't
> keep shipping an SDK with an unsafe transform in it, and IMO this certainly
> blocks a stable release.
>
> Specifically, there is pseudorandom data generated and once it has been
> observed and used to produce a side effect, it cannot be regenerated
> without erroneous results.
>
> This generalizes: For some side-effecting user-defined functions, it is
> vital that even across retries/replays they have a consistent view of the
> contents of their input PCollection, because their effect on the outside
> world cannot be retracted if/when they fail and are retried. Once the
> runner ensures a consistent view of the input, it is then their own
> responsibility to be idempotent.
>
> Ideally we should specify this requirement for the user-defined function
> without imposing any particular implementation strategy on Beam runners.
>
> Proposal:
>
> 1. Let a DoFn declare (mechanism not important right now) that it "requires
> deterministic input".
>
> 2. Each runner will need a way to induce deterministic input - the obvious
> choice being a materialization.
>
> I want to keep the discussion focused, so I'm leaving out any possibilities
> of taking this further.
>
> Regarding performance: Today places that require this tend to be already
> paying the cost via GroupByKey / Reshuffle operations, since that was a
> simple way to induce determinism in batch Dataflow* (doesn't work for most
> other runners nor for streaming Dataflow). This change will replace a
> hard-coded implementation strategy with a requirement that may be fulfilled
> in the most efficient way available.
>
> Thoughts?
>
> Kenn (w/ lots of consult from colleagues, especially Ben)
>
> * There is some overlap with the reshuffle/redistribute discussion because
> of this historical situation, but I would like to leave that broader
> discussion out of this correctness issue.
>


Re: [PROPOSAL] "Requires deterministic input"

2017-03-21 Thread vikas rk
+1 for the general idea of runners handling it over hard-coded
implementation strategy.

For the Write transform I believe you are talking about ApplyShardingKey

which
introduces non deterministic behavior when retried?


*Let a DoFn declare (mechanism not important right now) that it
"requiresdeterministic input"*



*Each runner will need a way to induce deterministic input - the
obviouschoice being a materialization.*

Does this mean that a runner will always materialize (or whatever the
strategy is) an input PCollection to this DoFn even though the PCollection
might have been produced by deterministic transforms? Would it make sense
to also let DoFns declare if they produce non-deterministic output?

-Vikas


On 21 March 2017 at 13:52, Stephen Sisk  wrote:

> Hey Kenn-
>
> this seems important, but I don't have all the context on what the problem
> is.
>
> Can you explain this sentence "Specifically, there is pseudorandom data
> generated and once it has been observed and used to produce a side effect,
> it cannot be regenerated without erroneous results." ?
>
> Where is the pseudorandom data coming from? Perhaps a concrete example
> would help?
>
> S
>
>
> On Tue, Mar 21, 2017 at 1:22 PM Kenneth Knowles 
> wrote:
>
> > Problem:
> >
> > I will drop all nuance and say that the `Write` transform as it exists in
> > the SDK is incorrect until we add some specification and APIs. We can't
> > keep shipping an SDK with an unsafe transform in it, and IMO this
> certainly
> > blocks a stable release.
> >
> > Specifically, there is pseudorandom data generated and once it has been
> > observed and used to produce a side effect, it cannot be regenerated
> > without erroneous results.
> >
> > This generalizes: For some side-effecting user-defined functions, it is
> > vital that even across retries/replays they have a consistent view of the
> > contents of their input PCollection, because their effect on the outside
> > world cannot be retracted if/when they fail and are retried. Once the
> > runner ensures a consistent view of the input, it is then their own
> > responsibility to be idempotent.
> >
> > Ideally we should specify this requirement for the user-defined function
> > without imposing any particular implementation strategy on Beam runners.
> >
> > Proposal:
> >
> > 1. Let a DoFn declare (mechanism not important right now) that it
> "requires
> > deterministic input".
> >
> > 2. Each runner will need a way to induce deterministic input - the
> obvious
> > choice being a materialization.
> >
> > I want to keep the discussion focused, so I'm leaving out any
> possibilities
> > of taking this further.
> >
> > Regarding performance: Today places that require this tend to be already
> > paying the cost via GroupByKey / Reshuffle operations, since that was a
> > simple way to induce determinism in batch Dataflow* (doesn't work for
> most
> > other runners nor for streaming Dataflow). This change will replace a
> > hard-coded implementation strategy with a requirement that may be
> fulfilled
> > in the most efficient way available.
> >
> > Thoughts?
> >
> > Kenn (w/ lots of consult from colleagues, especially Ben)
> >
> > * There is some overlap with the reshuffle/redistribute discussion
> because
> > of this historical situation, but I would like to leave that broader
> > discussion out of this correctness issue.
> >
>


First IO IT Running!

2017-03-21 Thread Jason Kuster
Hi all,

Exciting news! As of yesterday, we have checked in the Jenkins
configuration for our first continuously running IO Integration Test! You
can check it out in Jenkins here[1]. We’re also publishing results to a
database, and we’ve turned up a basic dashboarding system where you can see
the results here[2]. Caveat: there are only two runs, and we’ll be tweaking
the underlying system still, so don’t panic that we’re up and to the right
currently. ;)

This is the first test running continuously on top of the performance / IO
testing infrastructure described in this doc[3].  Initial support for Beam
is now present in PerfKit Benchmarker; given what they had already, it was
easiest to add support for Dataflow and Java. We need your help to add
additional support! The doc lists a number of JIRA issues to build out
support for other systems. I’m happy to work with people to help them
understand what is necessary for these tasks; just send an email to the
list if you need help and I’ll help you move forwards.

Looking forward to it!

Jason

[1] https://builds.apache.org/job/beam_PerformanceTests_JDBC/
[2]
https://apache-beam-testing.appspot.com/explore?dashboard=5714163003293696
[3]
https://docs.google.com/document/d/1PsjGPSN6FuorEEPrKEP3u3m16tyOzph5FnL2DhaRDz0/edit?ts=58a78e73

-- 
---
Jason Kuster
Apache Beam / Google Cloud Dataflow


Re: Kafka Offset handling for Restart/failure scenarios.

2017-03-21 Thread Mingmin Xu
In SparkRunner, the default checkpoint storage is TmpCheckpointDirFactory.
Can it restore during job restart? --Not test the runner in streaming for
some time.

Regarding to data-completeness, I would use at-most-once when few data
missing(mostly tasknode failure) is tolerated, compared to the performance
cost introduced by 'state'/'checkpoint'.

On Tue, Mar 21, 2017 at 1:36 PM, Amit Sela  wrote:

> On Tue, Mar 21, 2017 at 7:26 PM Mingmin Xu  wrote:
>
> > Move discuss to dev-list
> >
> > Savepoint in Flink, also checkpoint in Spark, should be good enough to
> > handle this case.
> >
> > When people don't enable these features, for example only need
> at-most-once
> >
> The Spark runner forces checkpointing on any streaming (Beam) application,
> mostly because it uses mapWithState for reading from UnboundedSource and
> updateStateByKey form GroupByKey - so by design, Spark runner is
> at-least-once. Generally, I always thought that applications that require
> at-most-once are more focused on processing time only, as they only care
> about whatever get's ingested into the pipeline at a specific time and
> don't care (up to the point of losing data) about correctness.
> I would be happy to hear more about your use case.
>
> > semantic, each unbounded IO should try its best to restore from last
> > offset, although CheckpointMark is null. Any ideas?
> >
> > Mingmin
> >
> > On Tue, Mar 21, 2017 at 9:39 AM, Dan Halperin 
> wrote:
> >
> > > hey,
> > >
> > > The native Beam UnboundedSource API supports resuming from checkpoint
> --
> > > that specifically happens here
> > > <
> > https://github.com/apache/beam/blob/master/sdks/java/io/kafk
> a/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L674>
> > when
> > > the KafkaCheckpointMark is non-null.
> > >
> > > The FlinkRunner should be providing the KafkaCheckpointMark from the
> most
> > > recent savepoint upon restore.
> > >
> > > There shouldn't be any "special" Flink runner support needed, nor is
> the
> > > State API involved.
> > >
> > > Dan
> > >
> > > On Tue, Mar 21, 2017 at 9:01 AM, Jean-Baptiste Onofré  >
> > > wrote:
> > >
> > >> Would not it be Flink runner specific ?
> > >>
> > >> Maybe the State API could do the same in a runner agnostic way (just
> > >> thinking loud) ?
> > >>
> > >> Regards
> > >> JB
> > >>
> > >> On 03/21/2017 04:56 PM, Mingmin Xu wrote:
> > >>
> > >>> From KafkaIO itself, looks like it either start_from_beginning or
> > >>> start_from_latest. It's designed to leverage
> > >>> `UnboundedSource.CheckpointMark`
> > >>> during initialization, but so far I don't see it's provided by
> runners.
> > >>> At the
> > >>> moment Flink savepoints is a good option, created a JIRA(BEAM-1775
> > >>> )  to handle it in
> > >>> KafkaIO.
> > >>>
> > >>> Mingmin
> > >>>
> > >>> On Tue, Mar 21, 2017 at 3:40 AM, Aljoscha Krettek <
> aljos...@apache.org
> > >>> > wrote:
> > >>>
> > >>> Hi,
> > >>> Are you using Flink savepoints [1] when restoring your
> application?
> > >>> If you
> > >>> use this the Kafka offset should be stored in state and it should
> > >>> restart
> > >>> from the correct position.
> > >>>
> > >>> Best,
> > >>> Aljoscha
> > >>>
> > >>> [1]
> > >>> https://ci.apache.org/projects/flink/flink-docs-release-1.3/
> > >>> setup/savepoints.html
> > >>>  > >>> /setup/savepoints.html>
> > >>> > On 21 Mar 2017, at 01:50, Jins George  > >>> > wrote:
> > >>> >
> > >>> > Hello,
> > >>> >
> > >>> > I am writing a Beam pipeline(streaming) with Flink runner to
> > >>> consume data
> > >>> from Kafka and apply some transformations and persist to Hbase.
> > >>> >
> > >>> > If I restart the application ( due to failure/manual restart),
> > >>> consumer
> > >>> does not resume from the offset where it was prior to restart. It
> > >>> always
> > >>> resume from the latest offset.
> > >>> >
> > >>> > If I enable Flink checkpionting with hdfs state back-end,
> system
> > >>> appears
> > >>> to be resuming from the earliest offset
> > >>> >
> > >>> > Is there a recommended way to resume from the offset where it
> was
> > >>> stopped ?
> > >>> >
> > >>> > Thanks,
> > >>> > Jins George
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> --
> > >>> 
> > >>> Mingmin
> > >>>
> > >>
> > >> --
> > >> Jean-Baptiste Onofré
> > >> jbono...@apache.org
> > >> http://blog.nanthrax.net
> > >> Talend - http://www.talend.com
> > >>
> > >
> > >
> >
> >
> > --
> > 
> > Mingmin
> >
>



-- 

Mingmin


Re: First IO IT Running!

2017-03-21 Thread Stephen Sisk
I'm really excited to see these tests are running!

These Jdbc tests are testing against a postgres instance - that instance is
running on the kubernetes cluster I've set up for beam IO ITs as discussed
in the "Hosting data stores for IO transform testing" thread[0]. I set up
that postgres instance using the kubernetes scripts for Jdbc[1]. Anyone can
run their own kubernetes cluster and do the same thing for themselves to
run the ITs. (I'd actually to love to hear about that if anyone does it.)

I'm excited to get a few more ITs using this infrastructure so we can test
it out/smooth out the remaining rough edges in creating ITs. I'm happy to
answer questions about that on the mailing list, but we obviously have to
have the process written down - the Testing IO Transforms in Apache Beam
doc [2] covers how to do this, but is still rough. I'm working on getting
that up on the website and ironing out the rough edges [3], but generally
reading that doc plus checking out how the JdbcIO or ElasticsearchIO tests
work should give you a sense of how to get it working. I'm also thinking we
might want to simplify the way we do data loading, so I don't consider this
process fully stabilized, but I'll port code written according to the
current standards to the new standards if we make changes.

ElasticsearchIO has all the prerequisites, so I'd like to get them going in
the near future. I know JB has started on this in his RedisIO PR, and the
HadoopInputFormatIO also has ITs & k8 scripts, so there's more in the pipe.
For now, each datastore has to be manually set up, but I'd like to automate
that process - I'll file a JIRA ticket shortly for that.

Thanks,
Stephen
[0] Hosting data stores for IO transform testing -
https://lists.apache.org/thread.html/9fd3c51cb679706efa4d0df2111a6ac438b851818b639aba644607af@%3Cdev.beam.apache.org%3E
[1] Postgres k8 scripts -
https://github.com/apache/beam/tree/master/sdks/java/io/jdbc/src/test/resources/kubernetes
[2] IO testing guide -
https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-NprQ7vbf1jNVRgdqeEE8I/edit?usp=sharing
[3] Jira for IO guide - https://issues.apache.org/jira/browse/BEAM-1025

On Tue, Mar 21, 2017 at 2:28 PM Jason Kuster 
wrote:

> Hi all,
>
> Exciting news! As of yesterday, we have checked in the Jenkins
> configuration for our first continuously running IO Integration Test! You
> can check it out in Jenkins here[1]. We’re also publishing results to a
> database, and we’ve turned up a basic dashboarding system where you can see
> the results here[2]. Caveat: there are only two runs, and we’ll be tweaking
> the underlying system still, so don’t panic that we’re up and to the right
> currently. ;)
>
> This is the first test running continuously on top of the performance / IO
> testing infrastructure described in this doc[3].  Initial support for Beam
> is now present in PerfKit Benchmarker; given what they had already, it was
> easiest to add support for Dataflow and Java. We need your help to add
> additional support! The doc lists a number of JIRA issues to build out
> support for other systems. I’m happy to work with people to help them
> understand what is necessary for these tasks; just send an email to the
> list if you need help and I’ll help you move forwards.
>
> Looking forward to it!
>
> Jason
>
> [1] https://builds.apache.org/job/beam_PerformanceTests_JDBC/
> [2]
> https://apache-beam-testing.appspot.com/explore?dashboard=5714163003293696
> [3]
>
> https://docs.google.com/document/d/1PsjGPSN6FuorEEPrKEP3u3m16tyOzph5FnL2DhaRDz0/edit?ts=58a78e73
>
> --
> ---
> Jason Kuster
> Apache Beam / Google Cloud Dataflow
>


Re: [PROPOSAL] "Requires deterministic input"

2017-03-21 Thread Kenneth Knowles
Good points & questions. I'll try to be more clear.



> On 21 March 2017 at 13:52, Stephen Sisk  wrote:
>
> > Hey Kenn-
> >
> > this seems important, but I don't have all the context on what the
> problem
> > is.
> >
> > Can you explain this sentence "Specifically, there is pseudorandom data
> > generated and once it has been observed and used to produce a side
> effect,
> > it cannot be regenerated without erroneous results." ?
>

On Tue, Mar 21, 2017 at 2:04 PM, vikas rk  wrote:

>
> For the Write transform I believe you are talking about ApplyShardingKey
>  c3e7fd3433/sdks/java/core/src/main/java/org/apache/beam/sdk/
> io/Write.java#L304>
> which
> introduces non deterministic behavior when retried?


Yes, exactly this. If the sharding key changes, then the rest of the
transform doesn't function correctly.


> Where is the pseudorandom data coming from? Perhaps a concrete example
> > would help?
>

I think the Write transform is a particularly complex example because of
the layers of abstraction. A simplified strawman might be:

Transform 1: Build RPC write descriptors identified by pseudo-random UUIDs.
Transform 2: Issue RPCs with those identifiers, so the endpoint will ignore
repeats of the same UUID (I tend to call this an "idempotency key" so I
might slip into that terminology sometimes)

In this case, transform 2 requires deterministic input because if the write
fails and is retried a new UUID means the endpoint won't know it is a
retry, resulting in duplicate data.

Is this clearer?

Kenn


Re: Docker image dependencies

2017-03-21 Thread Stephen Sisk
Hey Ismael,

I definitely agree with you that we want something that developers will
actually be able to/want to use.

in my experience *all* the container orchestration engines are non-trivial
to set up. When I started examining solutions for beam hosting, I did
installs of mesos, kubernetes and docker. Docker is easier in the "run only
on my local machine" case if devs have it set up, but to do anything
interesting (ie, interact with machines that aren't already yours), they
all involve work to get them setup on each machine you want to use[4].

Kubernetes has some options that make it extremely simple to setup - both
AWS[2] and GCE[3] seem to be straightforward to set up for simple dev
clusters, with scripts to automate the process (I'm assuming docker has
similar setups.)

Once kubernetes is set up, it's also a simple yaml file + command to set up
multiple machines. The kubernetes setup for postgres[5] shows a simple one
machine example, and the kubernetes setups for HIFIO[6] show multi-machine
examples.

We've spent a lot of time discussing the various options - when we talked
about this earlier [1] we decided we would move forward with investigating
kubernetes, so that's what I used for the IO ITs work I've been doing,
which we've now gotten working.

Do you feel the advantages of docker are such that we should re-open the
discussion and potentially re-do the work we've done so far to get k8
working?

I took a genuine look at docker earlier in the process and it didn't seem
like it was better than the other options in any dimensions (other than
"developers usually have it installed already"), and kubernetes/mesos
seemed to be more stable/have more of the features discussed in [1].
Perhaps that's changed?

I think we are just starting to use container orchestration engines, and so
while I don't want to throw away the work we've done so far, I also don't
want to have to do it later if there are reasons we knew about now. :)

S

[1]
https://lists.apache.org/thread.html/9fd3c51cb679706efa4d0df2111a6ac438b851818b639aba644607af@%3Cdev.beam.apache.org%3E

[2] k8 AWS - https://kubernetes.io/docs/getting-started-guides/aws/
[3] k8 GKE - https://cloud.google.com/container-engine/docs/quickstart or
https://kubernetes.io/docs/getting-started-guides/gce/
[4] docker swarm on GCE -
https://rominirani.com/docker-swarm-on-google-compute-engine-364765b400ed#.gzvruzis9

[5] postgres k8 script -
https://github.com/apache/beam/tree/master/sdks/java/io/jdbc/src/test/resources/kubernetes

[6]
https://github.com/diptikul/incubator-beam/tree/HIFIO-CS-ES/sdks/java/io/hadoop/jdk1.8-tests/src/test/resources/kubernetes


On Mon, Mar 20, 2017 at 3:25 PM Ismaël Mejía  wrote:

I have somehow forgotten this one.

> Basically - I'm trying to keep number of tools at a minimum while still
> providing good support for the functionality we need. Does docker-compose
> provide something beyond the functionality that k8 does? I'm not familiar
> with docker-compose, but looking at
> https://docs.docker.com/ it doesn't
> seem to provide anything that k8 doesn't already.

I agree to have the most minimal set of tools, I mentioned
docker-compose because I consider also its advantages because its
installation is trivial compared to kubernetes (or even minikube for a
local install), docker-compose does not have any significant advantage
over kubernetes apart of been easier to install/use.

But well, better to be consistent and go full with kubernetes, however
we need to find a way to help IO authors to bootstrap this, because
from my experience creating a cluster with docker-compose is a yaml
file + a command, not sure if the basic installation and run of
kubernetes is that easy.

Ismaël

On Wed, Mar 15, 2017 at 8:09 PM, Stephen Sisk 
wrote:
> thanks for the discussion! In general, I agree with the sentiments
> expressed here. I updated
>
https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-NprQ7vbf1jNVRgdqeEE8I/edit#heading=h.hlirex1vus1a
> to
> reflect this discussion. (The plan is still that I will put that on the
> website.)
>
> Apache Docker Repository - are you talking about
> https://hub.docker.com/u/apache/ ? If not, can you point me at more info?
I
> can't seem to find info about this on the publicly visible apache-infra
> mailing lists thatI could find, and the apache infra website doesn't seem
> to mention a docker repository.
>
>
>
>> However the current Beam Elasticsearch IO does not support Elasticsearch
> 5, and elastic does not have an image for version 2, so in this
particular case
> following the priority order we should use the official docker image (2)
> for the tests (assuming that both require the same version). Do you agree
> with this ?
>
> Yup, that makes sense to me.
>
>
>
>> How do we deal with IOs that require more than one base image, this is
a  common
> scenario for projects that depend on Zookeeper?
>
> Is there a reason not to just run a kubernetes ReplicaController+Service
> for these cases? k8 can easily sup

Re: [PROPOSAL] "Requires deterministic input"

2017-03-21 Thread Ben Chambers
Allowing an annotation on DoFn's that produce deterministic output could be
added in the future, but doesn't seem like a great option.

1. It is a correctness issue to assume a DoFn is deterministic and be
wrong, so we would need to assume all transform outputs are
non-deterministic unless annotated. Getting this correct is difficult (for
example, GBK is surprisingly non-deterministic except in specific cases).

2. It is unlikely to be a major performance improvement, given that any
non-deterministic transform prior to a sink (which are most likely to
require deterministic input) will cause additional work to be needed.

Based on this, it seems like the risk of allowing an annotation is high
while the potential for performance improvements is low. The current
proposal (not allowing an annotation) makes sense for now, until we can
demonstrate that the impact on performance is high in cases that could be
avoided with an annotation (in real-world use).

-- Ben

On Tue, Mar 21, 2017 at 2:05 PM vikas rk  wrote:

+1 for the general idea of runners handling it over hard-coded
implementation strategy.

For the Write transform I believe you are talking about ApplyShardingKey
<
https://github.com/apache/beam/blob/d66029cafde152c0a46ebd276ddfa4c3e7fd3433/sdks/java/core/src/main/java/org/apache/beam/sdk/io/Write.java#L304
>
which
introduces non deterministic behavior when retried?


*Let a DoFn declare (mechanism not important right now) that it
"requiresdeterministic input"*



*Each runner will need a way to induce deterministic input - the
obviouschoice being a materialization.*

Does this mean that a runner will always materialize (or whatever the
strategy is) an input PCollection to this DoFn even though the PCollection
might have been produced by deterministic transforms? Would it make sense
to also let DoFns declare if they produce non-deterministic output?

-Vikas


On 21 March 2017 at 13:52, Stephen Sisk  wrote:

> Hey Kenn-
>
> this seems important, but I don't have all the context on what the problem
> is.
>
> Can you explain this sentence "Specifically, there is pseudorandom data
> generated and once it has been observed and used to produce a side effect,
> it cannot be regenerated without erroneous results." ?
>
> Where is the pseudorandom data coming from? Perhaps a concrete example
> would help?
>
> S
>
>
> On Tue, Mar 21, 2017 at 1:22 PM Kenneth Knowles 
> wrote:
>
> > Problem:
> >
> > I will drop all nuance and say that the `Write` transform as it exists
in
> > the SDK is incorrect until we add some specification and APIs. We can't
> > keep shipping an SDK with an unsafe transform in it, and IMO this
> certainly
> > blocks a stable release.
> >
> > Specifically, there is pseudorandom data generated and once it has been
> > observed and used to produce a side effect, it cannot be regenerated
> > without erroneous results.
> >
> > This generalizes: For some side-effecting user-defined functions, it is
> > vital that even across retries/replays they have a consistent view of
the
> > contents of their input PCollection, because their effect on the outside
> > world cannot be retracted if/when they fail and are retried. Once the
> > runner ensures a consistent view of the input, it is then their own
> > responsibility to be idempotent.
> >
> > Ideally we should specify this requirement for the user-defined function
> > without imposing any particular implementation strategy on Beam runners.
> >
> > Proposal:
> >
> > 1. Let a DoFn declare (mechanism not important right now) that it
> "requires
> > deterministic input".
> >
> > 2. Each runner will need a way to induce deterministic input - the
> obvious
> > choice being a materialization.
> >
> > I want to keep the discussion focused, so I'm leaving out any
> possibilities
> > of taking this further.
> >
> > Regarding performance: Today places that require this tend to be already
> > paying the cost via GroupByKey / Reshuffle operations, since that was a
> > simple way to induce determinism in batch Dataflow* (doesn't work for
> most
> > other runners nor for streaming Dataflow). This change will replace a
> > hard-coded implementation strategy with a requirement that may be
> fulfilled
> > in the most efficient way available.
> >
> > Thoughts?
> >
> > Kenn (w/ lots of consult from colleagues, especially Ben)
> >
> > * There is some overlap with the reshuffle/redistribute discussion
> because
> > of this historical situation, but I would like to leave that broader
> > discussion out of this correctness issue.
> >
>


Re: Style: how much testing for transform builder classes?

2017-03-21 Thread Dan Halperin
https://github.com/apache/beam/commit/b202548323b4d59b11bbdf06c99d0f99e6a947ef
is one example where tests of feature Bar exist but did not discover bugs
that could be introduced by builders.

AutoValue like alleviates many, but not all, of these concerns - as Ismael
points out.



On Tue, Mar 21, 2017 at 1:18 PM, Robert Bradshaw <
rober...@google.com.invalid> wrote:

> On Wed, Mar 15, 2017 at 2:11 AM, Ismaël Mejía  wrote:
>
> > +1 to Vikas point maybe the right place to enforce things correct
> > build tests is in the validate and like this reduce the test
> > boilerplate and only test the validate, but I wonder if this totally
> > covers both cases (the buildsCorrectly and
> > buildsCorrectlyInDifferentOrder ones).
> >
> > I answer Eugene’s question here even if you are aware now since you
> > commented in the PR, so everyone understands the case.
> >
> > The case is pretty simple, when you extend an IO and add a new
> > configuration parameter, suppose we have withFoo(String foo) and we
> > want to add withBar(String bar). In some cases the implementation or
> > even worse the combination of those are not built correctly, so the
> > only way to guarantee that this works is to have code that tests the
> > complete parameter combination or tests that at least assert that the
> > object is built correctly.
> >
> > This is something that can happen both with or without AutoValue
> > because the with method is hand-written and the natural tendency with
> > boilerplate methods like this is to copy/paste, so we can end up doing
> > silly things like:
> >
> > private Read(String foo, String bar) { … }
> >
> > public Read withBar(String bar) {
> >   return new Read(foo, null);
> > }
> >
> > in this case the reference to bar is not stored or assigned (this is
> > similar to the case of the DatastoreIO PR), and AutoValue may seem to
> > solve this issue but you can also end up with this situation if you
> > copy paste the withFoo method and just change the method name:
> >
> > public Read withBar(String foo) {
> >   return builder().setFoo(foo).build();
> > }
> >
> > Of course both seem silly but both can happen and the tests at least
> > help to discover those,
> >
>
> Such mistakes should be entirely discovered by tests of feature Bar. If Bar
> is not actually being tested, that's a bigger problem with coverage that a
> construction-only test actually obscures (giving it negative value).
>
>
> >
> > On Wed, Mar 15, 2017 at 1:05 AM, vikas rk  wrote:
> > > Yes, what I meant is: Necessary tests are ones that blocks users if not
> > > present. Trivial or non-trivial shouldn't be the issue in such cases.
> > >
> > > Some of the boilerplate code and tests is because IO PTransforms are
> > > returned to the user before they are fully constructed and actual
> > > validation happens in the validate method rather than at construction.
> I
> > > understand that the reasoning here is that we want to support
> > constructing
> > > them with options in any order and using Builder pattern can be
> > confusing.
> > >
> > > If validate method is where all the validation happens, then we should
> > able
> > > to eliminate some redundant checks and tests during construction time
> > like
> > > in *withOption* methods here
> > >  > google-cloud-platform/src/main/java/org/apache/beam/sdk/
> > io/gcp/bigtable/BigtableIO.java#L199>
> > >  and here
> > >  > google-cloud-platform/src/main/java/org/apache/beam/sdk/
> > io/gcp/datastore/DatastoreV1.java#L387>
> > > as
> > > these are also checked in the validate method.
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > -Vikas
> > >
> > >
> > >
> > > On 14 March 2017 at 15:40, Eugene Kirpichov
>  > >
> > > wrote:
> > >
> > >> Thanks all. Looks like people are on board with the general direction
> > >> though it remains to refine it to concrete guidelines to go into style
> > >> guide.
> > >>
> > >> Ismaël, can you give more details about the situation you described in
> > the
> > >> first paragraph? Is it perhaps that really a RunnableOnService test
> was
> > >> missing (and perhaps still is), rather than a builder test?
> > >>
> > >> Vikas, regarding trivial tests and user waiting for a work-around: in
> > the
> > >> situation I described, they don't really need a workaround - they
> > specified
> > >> an invalid value and have been minorly inconvenienced because the
> error
> > >> they got about it was not very readable, so fixing their value took
> > them a
> > >> little longer than it could have, but they fixed it and their work is
> > not
> > >> blocked. I think Robert's arguments about the cost of trivial tests
> > apply.
> > >>
> > >> I agree that the author should be at liberty to choose which
> validation
> > to
> > >> unit-test and which to skip as trivial, so documentation on this topic
> > >> should be in the form

Re: Style: how much testing for transform builder classes?

2017-03-21 Thread Robert Bradshaw
On Tue, Mar 21, 2017 at 5:14 PM, Dan Halperin 
wrote:

> https://github.com/apache/beam/commit/b202548323b4d59b11bbdf06c99d0f
> 99e6a947ef
> is one example where tests of feature Bar exist but did not discover bugs
> that could be introduced by builders.
>

True, though one would need to test the full cross-product of builder
orderings to discover this, a simple test of setValidate() wouldn't suffice
either. (And if not the full cross product, what subset?) Maybe some
automated fuzz testing would be a fair thing to do here (cheaper and more
comprehensive than manual tests...)?

AutoValue like alleviates many, but not all, of these concerns - as Ismael
> points out.
>

If two features are not orthogonal, that perhaps merits more test (and
documentation).


>
>
>
> On Tue, Mar 21, 2017 at 1:18 PM, Robert Bradshaw <
> rober...@google.com.invalid> wrote:
>
> > On Wed, Mar 15, 2017 at 2:11 AM, Ismaël Mejía  wrote:
> >
> > > +1 to Vikas point maybe the right place to enforce things correct
> > > build tests is in the validate and like this reduce the test
> > > boilerplate and only test the validate, but I wonder if this totally
> > > covers both cases (the buildsCorrectly and
> > > buildsCorrectlyInDifferentOrder ones).
> > >
> > > I answer Eugene’s question here even if you are aware now since you
> > > commented in the PR, so everyone understands the case.
> > >
> > > The case is pretty simple, when you extend an IO and add a new
> > > configuration parameter, suppose we have withFoo(String foo) and we
> > > want to add withBar(String bar). In some cases the implementation or
> > > even worse the combination of those are not built correctly, so the
> > > only way to guarantee that this works is to have code that tests the
> > > complete parameter combination or tests that at least assert that the
> > > object is built correctly.
> > >
> > > This is something that can happen both with or without AutoValue
> > > because the with method is hand-written and the natural tendency with
> > > boilerplate methods like this is to copy/paste, so we can end up doing
> > > silly things like:
> > >
> > > private Read(String foo, String bar) { … }
> > >
> > > public Read withBar(String bar) {
> > >   return new Read(foo, null);
> > > }
> > >
> > > in this case the reference to bar is not stored or assigned (this is
> > > similar to the case of the DatastoreIO PR), and AutoValue may seem to
> > > solve this issue but you can also end up with this situation if you
> > > copy paste the withFoo method and just change the method name:
> > >
> > > public Read withBar(String foo) {
> > >   return builder().setFoo(foo).build();
> > > }
> > >
> > > Of course both seem silly but both can happen and the tests at least
> > > help to discover those,
> > >
> >
> > Such mistakes should be entirely discovered by tests of feature Bar. If
> Bar
> > is not actually being tested, that's a bigger problem with coverage that
> a
> > construction-only test actually obscures (giving it negative value).
> >
> >
> > >
> > > On Wed, Mar 15, 2017 at 1:05 AM, vikas rk  wrote:
> > > > Yes, what I meant is: Necessary tests are ones that blocks users if
> not
> > > > present. Trivial or non-trivial shouldn't be the issue in such cases.
> > > >
> > > > Some of the boilerplate code and tests is because IO PTransforms are
> > > > returned to the user before they are fully constructed and actual
> > > > validation happens in the validate method rather than at
> construction.
> > I
> > > > understand that the reasoning here is that we want to support
> > > constructing
> > > > them with options in any order and using Builder pattern can be
> > > confusing.
> > > >
> > > > If validate method is where all the validation happens, then we
> should
> > > able
> > > > to eliminate some redundant checks and tests during construction time
> > > like
> > > > in *withOption* methods here
> > > >  > > google-cloud-platform/src/main/java/org/apache/beam/sdk/
> > > io/gcp/bigtable/BigtableIO.java#L199>
> > > >  and here
> > > >  > > google-cloud-platform/src/main/java/org/apache/beam/sdk/
> > > io/gcp/datastore/DatastoreV1.java#L387>
> > > > as
> > > > these are also checked in the validate method.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > -Vikas
> > > >
> > > >
> > > >
> > > > On 14 March 2017 at 15:40, Eugene Kirpichov
> >  > > >
> > > > wrote:
> > > >
> > > >> Thanks all. Looks like people are on board with the general
> direction
> > > >> though it remains to refine it to concrete guidelines to go into
> style
> > > >> guide.
> > > >>
> > > >> Ismaël, can you give more details about the situation you described
> in
> > > the
> > > >> first paragraph? Is it perhaps that really a RunnableOnService test
> > was
> > > >> missing (and perhaps still is), rather than a builder tes

Re: Kafka Offset handling for Restart/failure scenarios.

2017-03-21 Thread Dan Halperin
[We should keep user list involved if that's where the discussion
originally was :)]

Jins George's original question was a good one. The right way to resume
from the previous offset here is what we're already doing – use the
KafkaCheckpointMark. In Beam, the runner maintains the state and not the
external system. Beam runners are responsible for maintaining the
checkpoint marks, and for redoing all uncommitted (uncheckpointed) work. If
a user disables checkpointing, then they are explicitly opting into "redo
all work" on restart.

--> If checkpointing is enabled but the KafkaCheckpointMark is not being
provided, then I'm inclined to agree with Amit that there may simply be a
bug in the FlinkRunner. (+aljoscha)

For what Mingmin Xu asked about: presumably if the Kafka source is
initially configured to "read from latest offset", when it restarts with no
checkpoint this will automatically go find the latest offset. That would
mimic at-most-once semantics in a buggy runner that did not provide
checkpointing.

Dan

On Tue, Mar 21, 2017 at 2:59 PM, Mingmin Xu  wrote:

> In SparkRunner, the default checkpoint storage is TmpCheckpointDirFactory.
> Can it restore during job restart? --Not test the runner in streaming for
> some time.
>
> Regarding to data-completeness, I would use at-most-once when few data
> missing(mostly tasknode failure) is tolerated, compared to the performance
> cost introduced by 'state'/'checkpoint'.
>
> On Tue, Mar 21, 2017 at 1:36 PM, Amit Sela  wrote:
>
> > On Tue, Mar 21, 2017 at 7:26 PM Mingmin Xu  wrote:
> >
> > > Move discuss to dev-list
> > >
> > > Savepoint in Flink, also checkpoint in Spark, should be good enough to
> > > handle this case.
> > >
> > > When people don't enable these features, for example only need
> > at-most-once
> > >
> > The Spark runner forces checkpointing on any streaming (Beam)
> application,
> > mostly because it uses mapWithState for reading from UnboundedSource and
> > updateStateByKey form GroupByKey - so by design, Spark runner is
> > at-least-once. Generally, I always thought that applications that require
> > at-most-once are more focused on processing time only, as they only care
> > about whatever get's ingested into the pipeline at a specific time and
> > don't care (up to the point of losing data) about correctness.
> > I would be happy to hear more about your use case.
> >
> > > semantic, each unbounded IO should try its best to restore from last
> > > offset, although CheckpointMark is null. Any ideas?
> > >
> > > Mingmin
> > >
> > > On Tue, Mar 21, 2017 at 9:39 AM, Dan Halperin 
> > wrote:
> > >
> > > > hey,
> > > >
> > > > The native Beam UnboundedSource API supports resuming from checkpoint
> > --
> > > > that specifically happens here
> > > > <
> > > https://github.com/apache/beam/blob/master/sdks/java/io/kafk
> > a/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L674>
> > > when
> > > > the KafkaCheckpointMark is non-null.
> > > >
> > > > The FlinkRunner should be providing the KafkaCheckpointMark from the
> > most
> > > > recent savepoint upon restore.
> > > >
> > > > There shouldn't be any "special" Flink runner support needed, nor is
> > the
> > > > State API involved.
> > > >
> > > > Dan
> > > >
> > > > On Tue, Mar 21, 2017 at 9:01 AM, Jean-Baptiste Onofré <
> j...@nanthrax.net
> > >
> > > > wrote:
> > > >
> > > >> Would not it be Flink runner specific ?
> > > >>
> > > >> Maybe the State API could do the same in a runner agnostic way (just
> > > >> thinking loud) ?
> > > >>
> > > >> Regards
> > > >> JB
> > > >>
> > > >> On 03/21/2017 04:56 PM, Mingmin Xu wrote:
> > > >>
> > > >>> From KafkaIO itself, looks like it either start_from_beginning or
> > > >>> start_from_latest. It's designed to leverage
> > > >>> `UnboundedSource.CheckpointMark`
> > > >>> during initialization, but so far I don't see it's provided by
> > runners.
> > > >>> At the
> > > >>> moment Flink savepoints is a good option, created a JIRA(BEAM-1775
> > > >>> )  to handle it
> in
> > > >>> KafkaIO.
> > > >>>
> > > >>> Mingmin
> > > >>>
> > > >>> On Tue, Mar 21, 2017 at 3:40 AM, Aljoscha Krettek <
> > aljos...@apache.org
> > > >>> > wrote:
> > > >>>
> > > >>> Hi,
> > > >>> Are you using Flink savepoints [1] when restoring your
> > application?
> > > >>> If you
> > > >>> use this the Kafka offset should be stored in state and it
> should
> > > >>> restart
> > > >>> from the correct position.
> > > >>>
> > > >>> Best,
> > > >>> Aljoscha
> > > >>>
> > > >>> [1]
> > > >>> https://ci.apache.org/projects/flink/flink-docs-release-1.3/
> > > >>> setup/savepoints.html
> > > >>>  > > >>> /setup/savepoints.html>
> > > >>> > On 21 Mar 2017, at 01:50, Jins George  > > >>> > wrote:
> > > >>> >
> > > >>> > Hello,
> > > >>> >
> > > >>> > I am writing a

Re: Kafka Offset handling for Restart/failure scenarios.

2017-03-21 Thread Raghu Angadi
Expanding a bit more on what Dan wrote:

   - In Dataflow, there are two modes of restarting a job : regular stop
   and then start & an *update*. The checkpoint is carried over only in the
   case of update.
   - Update is the only to keep 'exactly-once' semantics.
   - If the requirements are not very strict, you can enable offset commits
   in Kafka itself. KafkaIO lets you configure this. Here the pipeline would
   start reading from approximately where it left off in the previous run.
  - When a offset commits are enabled, KafkaIO could this by
  implementing 'finalize()' API on KafkaCheckpointMark [1].
  - This is runner independent.
  - The compromise is that this might skip a few records or read a few
  old records when the pipeline is restarted.
  - This does not override 'resume from checkpoint' support when runner
  provides KafkaCheckpointMark. Externally committed offsets are used only
  when KafkaIO's own CheckpointMark is not available.

[1]:
https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaCheckpointMark.java#L50

On Tue, Mar 21, 2017 at 5:28 PM, Dan Halperin  wrote:

> [We should keep user list involved if that's where the discussion
> originally was :)]
>
> Jins George's original question was a good one. The right way to resume
> from the previous offset here is what we're already doing – use the
> KafkaCheckpointMark. In Beam, the runner maintains the state and not the
> external system. Beam runners are responsible for maintaining the
> checkpoint marks, and for redoing all uncommitted (uncheckpointed) work. If
> a user disables checkpointing, then they are explicitly opting into "redo
> all work" on restart.
>
> --> If checkpointing is enabled but the KafkaCheckpointMark is not being
> provided, then I'm inclined to agree with Amit that there may simply be a
> bug in the FlinkRunner. (+aljoscha)
>
> For what Mingmin Xu asked about: presumably if the Kafka source is
> initially configured to "read from latest offset", when it restarts with no
> checkpoint this will automatically go find the latest offset. That would
> mimic at-most-once semantics in a buggy runner that did not provide
> checkpointing.
>
> Dan
>
> On Tue, Mar 21, 2017 at 2:59 PM, Mingmin Xu  wrote:
>
>> In SparkRunner, the default checkpoint storage is TmpCheckpointDirFactory.
>> Can it restore during job restart? --Not test the runner in streaming for
>> some time.
>>
>> Regarding to data-completeness, I would use at-most-once when few data
>> missing(mostly tasknode failure) is tolerated, compared to the performance
>> cost introduced by 'state'/'checkpoint'.
>>
>> On Tue, Mar 21, 2017 at 1:36 PM, Amit Sela  wrote:
>>
>> > On Tue, Mar 21, 2017 at 7:26 PM Mingmin Xu  wrote:
>> >
>> > > Move discuss to dev-list
>> > >
>> > > Savepoint in Flink, also checkpoint in Spark, should be good enough to
>> > > handle this case.
>> > >
>> > > When people don't enable these features, for example only need
>> > at-most-once
>> > >
>> > The Spark runner forces checkpointing on any streaming (Beam)
>> application,
>> > mostly because it uses mapWithState for reading from UnboundedSource and
>> > updateStateByKey form GroupByKey - so by design, Spark runner is
>> > at-least-once. Generally, I always thought that applications that
>> require
>> > at-most-once are more focused on processing time only, as they only care
>> > about whatever get's ingested into the pipeline at a specific time and
>> > don't care (up to the point of losing data) about correctness.
>> > I would be happy to hear more about your use case.
>> >
>> > > semantic, each unbounded IO should try its best to restore from last
>> > > offset, although CheckpointMark is null. Any ideas?
>> > >
>> > > Mingmin
>> > >
>> > > On Tue, Mar 21, 2017 at 9:39 AM, Dan Halperin 
>> > wrote:
>> > >
>> > > > hey,
>> > > >
>> > > > The native Beam UnboundedSource API supports resuming from
>> checkpoint
>> > --
>> > > > that specifically happens here
>> > > > <
>> > > https://github.com/apache/beam/blob/master/sdks/java/io/kafk
>> > a/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L674>
>> > > when
>> > > > the KafkaCheckpointMark is non-null.
>> > > >
>> > > > The FlinkRunner should be providing the KafkaCheckpointMark from the
>> > most
>> > > > recent savepoint upon restore.
>> > > >
>> > > > There shouldn't be any "special" Flink runner support needed, nor is
>> > the
>> > > > State API involved.
>> > > >
>> > > > Dan
>> > > >
>> > > > On Tue, Mar 21, 2017 at 9:01 AM, Jean-Baptiste Onofré <
>> j...@nanthrax.net
>> > >
>> > > > wrote:
>> > > >
>> > > >> Would not it be Flink runner specific ?
>> > > >>
>> > > >> Maybe the State API could do the same in a runner agnostic way
>> (just
>> > > >> thinking loud) ?
>> > > >>
>> > > >> Regards
>> > > >> JB
>> > > >>
>> > > >> On 03/21/2017 04:56 PM, Mingmin Xu wrote:
>> > > >>
>> > > >>> From KafkaIO itself, looks like it either start_from

Re: First IO IT Running!

2017-03-21 Thread Kenneth Knowles
This is a really exciting development! I would definitely like to help out.
Still ingesting the docs and JIRAs.

On Tue, Mar 21, 2017 at 3:01 PM, Stephen Sisk 
wrote:

> I'm really excited to see these tests are running!
>
> These Jdbc tests are testing against a postgres instance - that instance is
> running on the kubernetes cluster I've set up for beam IO ITs as discussed
> in the "Hosting data stores for IO transform testing" thread[0]. I set up
> that postgres instance using the kubernetes scripts for Jdbc[1]. Anyone can
> run their own kubernetes cluster and do the same thing for themselves to
> run the ITs. (I'd actually to love to hear about that if anyone does it.)
>
> I'm excited to get a few more ITs using this infrastructure so we can test
> it out/smooth out the remaining rough edges in creating ITs. I'm happy to
> answer questions about that on the mailing list, but we obviously have to
> have the process written down - the Testing IO Transforms in Apache Beam
> doc [2] covers how to do this, but is still rough. I'm working on getting
> that up on the website and ironing out the rough edges [3], but generally
> reading that doc plus checking out how the JdbcIO or ElasticsearchIO tests
> work should give you a sense of how to get it working. I'm also thinking we
> might want to simplify the way we do data loading, so I don't consider this
> process fully stabilized, but I'll port code written according to the
> current standards to the new standards if we make changes.
>
> ElasticsearchIO has all the prerequisites, so I'd like to get them going in
> the near future. I know JB has started on this in his RedisIO PR, and the
> HadoopInputFormatIO also has ITs & k8 scripts, so there's more in the pipe.
> For now, each datastore has to be manually set up, but I'd like to automate
> that process - I'll file a JIRA ticket shortly for that.
>
> Thanks,
> Stephen
> [0] Hosting data stores for IO transform testing -
> https://lists.apache.org/thread.html/9fd3c51cb679706efa4d0df2111a6a
> c438b851818b639aba644607af@%3Cdev.beam.apache.org%3E
> [1] Postgres k8 scripts -
> https://github.com/apache/beam/tree/master/sdks/java/io/
> jdbc/src/test/resources/kubernetes
> [2] IO testing guide -
> https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-
> NprQ7vbf1jNVRgdqeEE8I/edit?usp=sharing
> [3] Jira for IO guide - https://issues.apache.org/jira/browse/BEAM-1025
>
> On Tue, Mar 21, 2017 at 2:28 PM Jason Kuster  invalid>
> wrote:
>
> > Hi all,
> >
> > Exciting news! As of yesterday, we have checked in the Jenkins
> > configuration for our first continuously running IO Integration Test! You
> > can check it out in Jenkins here[1]. We’re also publishing results to a
> > database, and we’ve turned up a basic dashboarding system where you can
> see
> > the results here[2]. Caveat: there are only two runs, and we’ll be
> tweaking
> > the underlying system still, so don’t panic that we’re up and to the
> right
> > currently. ;)
> >
> > This is the first test running continuously on top of the performance /
> IO
> > testing infrastructure described in this doc[3].  Initial support for
> Beam
> > is now present in PerfKit Benchmarker; given what they had already, it
> was
> > easiest to add support for Dataflow and Java. We need your help to add
> > additional support! The doc lists a number of JIRA issues to build out
> > support for other systems. I’m happy to work with people to help them
> > understand what is necessary for these tasks; just send an email to the
> > list if you need help and I’ll help you move forwards.
> >
> > Looking forward to it!
> >
> > Jason
> >
> > [1] https://builds.apache.org/job/beam_PerformanceTests_JDBC/
> > [2]
> > https://apache-beam-testing.appspot.com/explore?dashboard=
> 5714163003293696
> > [3]
> >
> > https://docs.google.com/document/d/1PsjGPSN6FuorEEPrKEP3u3m16tyOz
> ph5FnL2DhaRDz0/edit?ts=58a78e73
> >
> > --
> > ---
> > Jason Kuster
> > Apache Beam / Google Cloud Dataflow
> >
>


Re: [DISCUSSION] using NexMark for Beam

2017-03-21 Thread Kenneth Knowles
This is great! Having a variety of realistic-ish pipelines running on all
runners complements the validation suite and IO IT work.

If I recall, some of these involve heavy and esoteric uses of state, so
definitely give me a ping if you hit any trouble.

Kenn

On Tue, Mar 21, 2017 at 9:38 AM, Etienne Chauchot 
wrote:

> Hi all,
>
> Ismael and I are working on upgrading the Nexmark implementation for Beam.
> See https://github.com/iemejia/beam/tree/BEAM-160-nexmark and
> https://issues.apache.org/jira/browse/BEAM-160. We are continuing the
> work done by Mark Shields. See https://github.com/apache/beam/pull/366
> for the original PR.
>
> The PR contains queries that have a wide coverage of the Beam model and
> that represent a realistic end user use case (some come from client
> experience on Google Cloud Dataflow).
>
> So far, we have upgraded the implementation to the latest Beam snapshot.
> And we are able to execute a good subset of the queries in the different
> runners. We upgraded the nexmark drivers to do so: direct driver (upgraded
> from inProcessDriver) and flink driver and we added a new one for spark.
>
> There is still a good amount of work to do and we would like to know if
> you think that this contribution can have its place into Beam eventually.
>
> The interests of having Nexmark on Beam that we have seen so far are:
>
> - Rich batch/streaming test
>
> - A-B testing of runners or runtimes (non-regression, performance
> comparison between versions ...)
>
> - Integration testing (sdk/runners, runner/runtime, ...)
>
> - Validate beam capability matrix
>
> - It can be used as part of the ongoing PerfKit work (if there is any
> interest).
>
> As a final note, we are tracking the issues in the same repo. If someone
> is interested in contributing, or have more ideas, you are welcome :)
>
> Etienne
>
>


Re: First IO IT Running!

2017-03-21 Thread Jean-Baptiste Onofré

Awesome !!! Great news !

Thanks guys for that !

I started to implement IT in JMS, MQTT, Redis, Cassandra IOs. I keep you posted.

Regards
JB

On 03/21/2017 11:01 PM, Stephen Sisk wrote:

I'm really excited to see these tests are running!

These Jdbc tests are testing against a postgres instance - that instance is
running on the kubernetes cluster I've set up for beam IO ITs as discussed
in the "Hosting data stores for IO transform testing" thread[0]. I set up
that postgres instance using the kubernetes scripts for Jdbc[1]. Anyone can
run their own kubernetes cluster and do the same thing for themselves to
run the ITs. (I'd actually to love to hear about that if anyone does it.)

I'm excited to get a few more ITs using this infrastructure so we can test
it out/smooth out the remaining rough edges in creating ITs. I'm happy to
answer questions about that on the mailing list, but we obviously have to
have the process written down - the Testing IO Transforms in Apache Beam
doc [2] covers how to do this, but is still rough. I'm working on getting
that up on the website and ironing out the rough edges [3], but generally
reading that doc plus checking out how the JdbcIO or ElasticsearchIO tests
work should give you a sense of how to get it working. I'm also thinking we
might want to simplify the way we do data loading, so I don't consider this
process fully stabilized, but I'll port code written according to the
current standards to the new standards if we make changes.

ElasticsearchIO has all the prerequisites, so I'd like to get them going in
the near future. I know JB has started on this in his RedisIO PR, and the
HadoopInputFormatIO also has ITs & k8 scripts, so there's more in the pipe.
For now, each datastore has to be manually set up, but I'd like to automate
that process - I'll file a JIRA ticket shortly for that.

Thanks,
Stephen
[0] Hosting data stores for IO transform testing -
https://lists.apache.org/thread.html/9fd3c51cb679706efa4d0df2111a6ac438b851818b639aba644607af@%3Cdev.beam.apache.org%3E
[1] Postgres k8 scripts -
https://github.com/apache/beam/tree/master/sdks/java/io/jdbc/src/test/resources/kubernetes
[2] IO testing guide -
https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-NprQ7vbf1jNVRgdqeEE8I/edit?usp=sharing
[3] Jira for IO guide - https://issues.apache.org/jira/browse/BEAM-1025

On Tue, Mar 21, 2017 at 2:28 PM Jason Kuster 
wrote:


Hi all,

Exciting news! As of yesterday, we have checked in the Jenkins
configuration for our first continuously running IO Integration Test! You
can check it out in Jenkins here[1]. We’re also publishing results to a
database, and we’ve turned up a basic dashboarding system where you can see
the results here[2]. Caveat: there are only two runs, and we’ll be tweaking
the underlying system still, so don’t panic that we’re up and to the right
currently. ;)

This is the first test running continuously on top of the performance / IO
testing infrastructure described in this doc[3].  Initial support for Beam
is now present in PerfKit Benchmarker; given what they had already, it was
easiest to add support for Dataflow and Java. We need your help to add
additional support! The doc lists a number of JIRA issues to build out
support for other systems. I’m happy to work with people to help them
understand what is necessary for these tasks; just send an email to the
list if you need help and I’ll help you move forwards.

Looking forward to it!

Jason

[1] https://builds.apache.org/job/beam_PerformanceTests_JDBC/
[2]
https://apache-beam-testing.appspot.com/explore?dashboard=5714163003293696
[3]

https://docs.google.com/document/d/1PsjGPSN6FuorEEPrKEP3u3m16tyOzph5FnL2DhaRDz0/edit?ts=58a78e73

--
---
Jason Kuster
Apache Beam / Google Cloud Dataflow





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


[DISCUSSION] Consistent use of loggers

2017-03-21 Thread Aviem Zur
Hi all,

There have been a few reports lately (On JIRA [1] and on Slack) from users
regarding inconsistent loggers used across Beam's modules.

While we use SLF4J, different modules use a different logger behind it
(JUL, log4j, etc)
So when people add a log4j.properties file to their classpath for instance,
they expect this to affect all of their dependencies on Beam modules, but
it doesn’t and they miss out on some logs they thought they would see.

I think we should strive for consistency in which logger is used behind
SLF4J, and try to enforce this in our modules.
I for one think it should be slf4j-log4j. However, if performance of
logging is critical we might want to consider logback.

Note: SLF4J will still be the facade for logging across the project. The
only change would be the logger SLF4J delegates to.

Once we have something like this it would also be useful to add
documentation on logging in Beam to the website.

[1] https://issues.apache.org/jira/browse/BEAM-1757


Re: [DISCUSSION] Consistent use of loggers

2017-03-21 Thread Jean-Baptiste Onofré

Hi Aviem,

Good point.

I think, in our dependencies set, we should just depend to slf4j-api and let the 
user provides the binding he wants (slf4j-log4j12, slf4j-simple, whatever).


We define a binding only with test scope in our modules.

Regards
JB

On 03/22/2017 04:58 AM, Aviem Zur wrote:

Hi all,

There have been a few reports lately (On JIRA [1] and on Slack) from users
regarding inconsistent loggers used across Beam's modules.

While we use SLF4J, different modules use a different logger behind it
(JUL, log4j, etc)
So when people add a log4j.properties file to their classpath for instance,
they expect this to affect all of their dependencies on Beam modules, but
it doesn’t and they miss out on some logs they thought they would see.

I think we should strive for consistency in which logger is used behind
SLF4J, and try to enforce this in our modules.
I for one think it should be slf4j-log4j. However, if performance of
logging is critical we might want to consider logback.

Note: SLF4J will still be the facade for logging across the project. The
only change would be the logger SLF4J delegates to.

Once we have something like this it would also be useful to add
documentation on logging in Beam to the website.

[1] https://issues.apache.org/jira/browse/BEAM-1757



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: [DISCUSSION] Consistent use of loggers

2017-03-21 Thread Aviem Zur
+1 to what JB said.

Will just have to be documented well as if we provide no binding there will
be no logging out of the box unless the user adds a binding.

On Wed, Mar 22, 2017 at 6:24 AM Jean-Baptiste Onofré 
wrote:

> Hi Aviem,
>
> Good point.
>
> I think, in our dependencies set, we should just depend to slf4j-api and
> let the
> user provides the binding he wants (slf4j-log4j12, slf4j-simple, whatever).
>
> We define a binding only with test scope in our modules.
>
> Regards
> JB
>
> On 03/22/2017 04:58 AM, Aviem Zur wrote:
> > Hi all,
> >
> > There have been a few reports lately (On JIRA [1] and on Slack) from
> users
> > regarding inconsistent loggers used across Beam's modules.
> >
> > While we use SLF4J, different modules use a different logger behind it
> > (JUL, log4j, etc)
> > So when people add a log4j.properties file to their classpath for
> instance,
> > they expect this to affect all of their dependencies on Beam modules, but
> > it doesn’t and they miss out on some logs they thought they would see.
> >
> > I think we should strive for consistency in which logger is used behind
> > SLF4J, and try to enforce this in our modules.
> > I for one think it should be slf4j-log4j. However, if performance of
> > logging is critical we might want to consider logback.
> >
> > Note: SLF4J will still be the facade for logging across the project. The
> > only change would be the logger SLF4J delegates to.
> >
> > Once we have something like this it would also be useful to add
> > documentation on logging in Beam to the website.
> >
> > [1] https://issues.apache.org/jira/browse/BEAM-1757
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>