Re: [VOTE] Release 0.4.0-incubating, release candidate #3

2016-12-20 Thread Aljoscha Krettek
+1

- verified signatures
- ran Quickstart using the staging repository on Flink cluster
- verified build form source

(We'll probably do a 0.5.0 release shortly after this so we can fix the
BigQuery issues.)

On Mon, 19 Dec 2016 at 21:22 Dan Halperin 
wrote:

> I vetted the binary artifacts accompanying the release by running several
> jobs on the Dataflow and Direct runners. At a high level, the release looks
> fine -- I ran some of my favorite jobs and they all worked swimmingly.
>
> There are some severe bugs in BigQueryIO in the release. Specifically, we
> broke the ability to write to BigQuery using different tables for every
> window. To a large degree, this makes BigQuery useless when working with
> unbounded data (streaming pipelines). The bugs have been fixed (and
> accompanying tests added) in PRs #1651 and #1400.
>
> Conclusion: +0.8
>
> * 0.4.0-incubating RC3 is largely an improvement over 0.3.0-incubating,
> especially in the user getting started experience.
> * The bugs in BigQueryIO are blockers for BigQuery users, but this is
> likely a relatively small fraction of the Beam community. I would not
> retract RC3 based on this alone. Unless we plan to cut an RC4 for other
> reasons, we should move forward with RC3.
>
> I'd hope that we hear from key users of the Apex, Flink, and Spark runners
> before closing the vote, even though it's technically been 72+ hours. I
> suggest we wait to ensure they have an opportunity to chime in.
>
> Thanks,
> Dan
>
>
> Appendix: pom.xml changes to use binary releases from Apache Staging:
>
>   
> 
>   apache.staging
>   Apache Development Staging Repository
>   https://repository.apache.org/content/repositories/staging/
> 
>   
> true
>   
>   
> false
>   
> 
>   
>
> On Sun, Dec 18, 2016 at 10:14 PM, Jean-Baptiste Onofré 
> wrote:
>
> > Hi guys,
> >
> > The good thing is that my issue to access repository.apache.org Nexus is
> > now fixed.
> >
> > To update the signature files, we have to drop the Nexus repository to
> > stage a new one,
> > meaning cancel the current vote to do a new RC4.
> >
> > I can do that, up to you.
> >
> > Anyway, regarding the release content, +1 (binding).
> >
> > Regards
> > JB
> >
> >
> > On 12/18/2016 06:56 PM, Davor Bonaci wrote:
> >
> >> Indeed -- I did help JB with the release ever so slightly, due to the
> >> networking connectivity issue reaching repository.apache.org, which JB
> >> further described and is tracked in INFRA-13086 [1]. This is not
> >> Beam-specific.
> >>
> >> The current signature shouldn't be a problem at all, but, since others
> are
> >> asking about it, I think it would be the best to simply re-sign the
> source
> >> .zip archive and continuing this vote. JB, what do you think?
> >>
> >> Regarding the release itself, I think we need to keep raising the
> quality
> >> and maturity release-over-release, and test signals are an excellent way
> >> to
> >> demonstrate that. Due to the recent upgrades to Jenkins, usage of the
> DSL,
> >> etc. (thanks INFRA and Jason Kuster), we can now, for the first time,
> >> formally show that the release candidate clearly passes all Jenkins
> suites
> >> that we have:
> >> * All unit tests across the project, plus example ITs across all runners
> >> [2], [3].
> >> * All integration tests on the Apex runner [4].
> >> * All integration tests on the Flink runner [5].
> >> * All integration tests on the Spark runner [6].
> >> * All integration tests on the Dataflow runner [7].
> >>
> >> That said, I know of a few issues/regressions in the areas that are not
> >> well tested today. I think Dan Halperin has more context, so I'll let
> him
> >> speak of the details, and quote relevant JIRA issues.
> >>
> >> With the known issues in 0.3.0-incubating, such as trouble running
> >> examples
> >> out-of-the-box, I think this release candidate is a clear win. Of
> course,
> >> that may change if more issues are discovered.
> >>
> >> For me, this release candidate is +1 (at this time), contingent upon no
> >> known major issues affecting Apex, Flink and Spark runners.
> >>
> >> Davor
> >>
> >> [1] https://issues.apache.org/jira/browse/INFRA-13086
> >> [2]
> >> https://builds.apache.org/view/Beam/job/beam_PreCommit_Java_
> >> MavenInstall/5994/
> >> [3]
> >> https://builds.apache.org/view/Beam/job/beam_PostCommit_Java
> >> _MavenInstall/2116/
> >> [4]
> >> https://builds.apache.org/view/Beam/job/beam_PostCommit_Java
> >> _RunnableOnService_Apex/10/
> >> [5]
> >> https://builds.apache.org/view/Beam/job/beam_PostCommit_Java
> >> _RunnableOnService_Flink/1120/
> >> [6]
> >> https://builds.apache.org/view/Beam/job/beam_PostCommit_Java
> >> _RunnableOnService_Spark/430/
> >> [7]
> >> https://builds.apache.org/view/Beam/job/beam_PostCommit_Java
> >> _RunnableOnService_Dataflow/1830/
> >>
> >>
> >> On Sat, Dec 17, 2016 at 4:13 PM, Kenneth Knowles  >
> >> wrote:
> >>
> 

Re: [DISCUSS] [BEAM-438] Rename one of PTransform.apply or PInput.apply

2016-12-07 Thread Aljoscha Krettek
+1

I've seen this mistake myself in some PRs.

On Thu, 8 Dec 2016 at 06:10 Ben Chambers 
wrote:

> +1 -- This seems like the best option. It's a mechanical change, and the
> compiler will let users know it needs to be made. It will make the mistake
> much less common, and when it occurs it will be much clearer what is wrong.
>
> It would be great if we could make the mis-use a compiler problem or a
> pipeline construction time error without changing the names, but both of
> these options are not practical. We can't hide the expansion method, since
> it is what PTransform implementations need to override. We can't make this
> a construction time exception since it would require adding code to every
> PTransform implementation.
>
> On Wed, Dec 7, 2016 at 1:55 PM Thomas Groh 
> wrote:
>
> > +1; This is probably the best way to make sure users don't reverse the
> > polarity of the PCollection flow.
> >
> > This also brings PInput.expand(), POutput.expand(), and
> > PTransform.expand(PInput) into line - namely, for some composite thing,
> > "represent yourself as some collection of primitives" (potentially
> > recursively).
> >
> > On Wed, Dec 7, 2016 at 1:37 PM, Kenneth Knowles 
> > wrote:
> >
> > > Hi all,
> > >
> > > I want to bring up another major backwards-incompatible change before
> it
> > is
> > > too late, to resolve [BEAM-438].
> > >
> > > Summary: Leave PInput.apply the same but rename PTransform.apply to
> > > PTransform.expand. I have opened [PR #1538] just for reference (it took
> > 30
> > > seconds using IDE automated refactor)
> > >
> > > This change affects *PTransform authors* but does *not* affect pipeline
> > > authors.
> > >
> > > This issue was filed a long time ago. It has been a problem many times
> > with
> > > actual users since before Beam started incubating. This is what goes
> > wrong
> > > (often):
> > >
> > >PCollection input = ...
> > >PTransform transform = ...
> > >
> > >transform.apply(input)
> > >
> > > This type checks and even looks perfectly normal. Do you see the error?
> > >
> > > ... what we need the user to write is:
> > >
> > > input.apply(transform)
> > >
> > > What a confusing difference! After all, the first one type-checks and
> the
> > > first one is how you apply a Function or Predicate or
> > SerializableFunction,
> > > etc. But it is broken. With transform.apply(input) the transform is not
> > > registered with the pipeline at all.
> > >
> > > We obviously can't (and don't want to) change the most core way that
> > > pipeline authors use Beam, so PInput.apply (aka PCollection.apply) must
> > > remain the same. But we do need a way to make it impossible to mix
> these
> > > up.
> > >
> > > The simplest way I can think of is to choose a new name for the other
> > > method involved. Users probably won't write transform.expand(input)
> since
> > > they will never have seen it in any examples, etc. This will just make
> > > PTransform authors need to do a global rename, and the type system will
> > > direct them to all cases so there is no silent failure possible.
> > >
> > > What do you think?
> > >
> > > Kenn
> > >
> > > [BEAM-438] https://issues.apache.org/jira/browse/BEAM-438
> > > [PR #1538] https://github.com/apache/incubator-beam/pull/1538
> > >
> > > p.s. there is a really amusing and confusing call chain:
> > PCollection.apply
> > > -> Pipeline.applyTransform -> Pipeline.applyInternal ->
> > > PipelineRunner.apply -> PTransform.apply
> > >
> > > After this change and work to get the runner out of the loop, it
> becomes
> > > PCollection.apply -> Pipeline.applyTransform -> PTransform.expand
> > >
> >
>


Re: Flink runner. Optimization for sideOutput with tags

2016-12-06 Thread Aljoscha Krettek
I'm having a look at your PRs now. I think the change is good, and it's
actually quite simple too.

Thanks for looking into this!

On Mon, 5 Dec 2016 at 05:48 Alexey Demin <diomi...@gmail.com> wrote:

> Aljoscha
>
> I mistaken with flink runtime =)
>
> What do you think about some modification FlinkStreamingTransformTransla
> tors:
>
> move split out of for-loop:
>
> SplitStream splitStream = unionOutputStream.split(new
> OutputSelector() {
> @Override
> public Iterable select(RawUnionValue value) {
>   return Lists.newArrayList(String.valueOf(value.getUnionTag()));
> }
>   });
>
> and change filtered to
>
> DataStream filtered = splitStream.select(String.valueOf(outputTag))
> .flatMap(new FlatMapFunction<RawUnionValue,
> Object>() {
> @Override
> public void flatMap(RawUnionValue value,
> Collector out) throws Exception {
> out.collect(value.getValue());
> }
>   }).returns(outputTypeInfo);
>
> In this implementations we always transfer data only for necessary output
> without broadcast every type by all output.
>
> p.s. I know this code not production ready, only idea for discuss.
> but for people who use side output only for alerting it's can reduce cpu
> usage (serialization will apply only on targeted value, not for all
> elements for every outputs)
>
> Thanks,
> Alexey Diomin
>
>
> 2016-12-04 23:57 GMT+04:00 Alexey Demin <diomi...@gmail.com>:
>
> > Hi
> >
> > very simple example
> > https://gist.github.com/xhumanoid/287af191314d5d867acf509129bd4931
> >
> > Sometime we need meta-information about processing element
> >
> > If i correctly understood code in FlinkStreamingTransformTranslators line
> > 557:
> > main problem not in translators, but in flink runtime, which don't know
> > about tags and simple does broadcast when have 2 output from one
> > transformation
> >
> > Correct me if I mistaken
> >
> >
> > >> this is a bit of a dangerous setting
> >
> > I know about dangerous with object-reuse, but we never use object after
> > collect.
> > In some cases we need more performance and serialization on every
> > transformation very expensive,
> > but try merge all business logic in one DoFn it to make processing
> > unsupportable.
> >
> > >> I think your stack trace is not complete, at least I can't seem to see
> > the root exception.
> >
> > We made this stacktrace on live system with jstack. It's not exception.
> >
> > Thanks,
> > Alexey Diomin
> >
> >
> > 2016-11-29 21:33 GMT+04:00 Aljoscha Krettek <aljos...@apache.org>:
> >
> >> Hi Alexey,
> >> I think it should be possible to optimise this particular transformation
> >> by
> >> using a split/select pattern in Flink. (See split and select here:
> >> https://ci.apache.org/projects/flink/flink-docs-release-1.2/
> >> dev/datastream_api.html#datastream-transformations).
> >> The current implementation is not very optimised, my main goal was to
> make
> >> all features of Beam work before going into individual optimisations.
> >>
> >> About object-reuse in Flink Streaming: this is a bit of a dangerous
> >> setting
> >> and can lead to unexpected results with certain patterns. I think your
> >> stack trace is not complete, at least I can't seem to see the root
> >> exception.
> >>
> >> Cheers,
> >> Aljoscha
> >>
> >> On Mon, 28 Nov 2016 at 07:33 Alexey Demin <diomi...@gmail.com> wrote:
> >>
> >> > Hi
> >> >
> >> > If we try use sideOutput with TupleTag and flink config
> >> enableObjectReuse
> >> > then we have stacktrace
> >> >
> >> > at
> >> >
> >> > org.apache.beam.sdk.transforms.DoFnAdapters$SimpleDoFnAdapte
> >> r.processElement(DoFnAdapters.java:234)
> >> > at
> >> >
> >> > org.apache.beam.runners.core.SimpleOldDoFnRunner.invokeProce
> >> ssElement(SimpleOldDoFnRunner.java:118)
> >> > at
> >> >
> >> > org.apache.beam.runners.core.SimpleOldDoFnRunner.processElem
> >> ent(SimpleOldDoFnRunner.java:104)
> >> > at
> >> >
> >> > org.apache.beam.runners.core.PushbackSideInputDoFn

Re: Increase stream parallelism after reading from UnboundedSource

2016-12-05 Thread Aljoscha Krettek
Hi,
I can only speak for Flink, there you usually fan-out/parallelise the
stream after a non-parallel source.

Cheers,
Aljoscha

On Mon, 5 Dec 2016 at 15:48 Amit Sela  wrote:

> Hi all,
>
> I have a general question about how stream-processing frameworks/engines
> usually behave in the following scenario:
>
> Say I have a Pipeline that consumes from 1 Kafka partition, so that my
> initial (optimal) parallelism is 1 as well.
>
> For any downstream computation, is it common for stream processors to
> "fan-out/parallelise" the stream by shuffling the data into more
> streams/partitions/bundles ?
>
> Thanks,
> Amit
>


Re: How to create a Pipeline with Cycles

2016-11-30 Thread Aljoscha Krettek
Don't worry, we're all constantly learning. :-)

On Wed, 30 Nov 2016 at 12:00 Ismaël Mejía <ieme...@gmail.com> wrote:

> Hello, thanks for the clarifying Aljoscha, you are absolutely right, sorry
> for my imprecision. I should do my homework and work more with flink :).
>
> For ref, the iteration API:
>
>
> https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/datastream_api.html#iterations
>
> And the implementation of the benchmark that effectively does not use
> iterations:
>
>
> https://github.com/yahoo/streaming-benchmarks/blob/master/flink-benchmarks/src/main/java/flink/benchmark/AdvertisingTopologyNative.java
>
> My excuses again,
> Ismael
>
> On Wed, Nov 30, 2016 at 11:30 AM, Aljoscha Krettek <aljos...@apache.org>
> wrote:
>
> > Hi,
> > there is support for cycles in Flink but the Yahoo benchmark is not
> making
> > use of that feature, if I'm not completely mistaken.
> >
> > Cheers,
> > Aljoscha
> >
> > On Wed, 30 Nov 2016 at 09:57 Ismaël Mejía <ieme...@gmail.com> wrote:
> >
> > > Hello,
> > >
> > > Shen you should probably first check the benchmark implementation at
> > > github, I am not sure you need cycles to implement the yahoo benchmark.
> > > Notice for example that there is a flink based implementation and AFAIK
> > > there is no support for cycles in Flink (or at least there wasn't at
> the
> > > moment they published the benchmarks).
> > >
> > > However if you are implementing the yahoo benchmark on Beam, that would
> > be
> > > a nice scenario to test the runner performance (vs the native
> > > implementations), so it would be nice if you can share this.
> > >
> > > Regards,
> > > Ismael
> > >
> > > On Tue, Nov 29, 2016 at 7:32 PM, Shen LI <cs.she...@gmail.com> wrote:
> > >
> > > > Hi Maria, Bobby,
> > > >
> > > > Thanks for the explanation.
> > > >
> > > > Regards,
> > > >
> > > > Shen
> > > >
> > > > On Tue, Nov 29, 2016 at 12:37 PM, Bobby Evans
> > > <ev...@yahoo-inc.com.invalid
> > > > >
> > > > wrote:
> > > >
> > > > > In my experience almost all of the time cycles are bad and cause a
> > lot
> > > of
> > > > > debugging problems. Most of the time you can implement what you
> want
> > by
> > > > > using a windowed join or group by instead.
> > > > > - Bobby
> > > > >
> > > > > On Tuesday, November 29, 2016, 11:06:44 AM CST, María García
> Herrero
> > > > > <mari...@google.com.INVALID> wrote:Hi Shen,
> > > > >
> > > > > No. Beam pipelines are DAGs:
> > > > > http://beam.incubator.apache.org/documentation/sdks/
> > > > > javadoc/0.3.0-incubating/org/apache/beam/sdk/Pipeline.html
> > > > > Best,
> > > > >
> > > > > María
> > > > >
> > > > > On Tue, Nov 29, 2016 at 7:44 AM, Shen LI <cs.she...@gmail.com>
> > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > Can I use Beam to create a pipeline with cycles? For example, to
> > > > > implement
> > > > > > the Yahoo! Streaming benchmark(
> > > > > >
> > > https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-
> > > > > > computation-engines-at),
> > > > > > can a ParDo transform consume a downstream output as a side
> input?
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Shen
> > > > > >
> > > > >
> > > >
> > >
> >
>


Re: How to create a Pipeline with Cycles

2016-11-30 Thread Aljoscha Krettek
Hi,
there is support for cycles in Flink but the Yahoo benchmark is not making
use of that feature, if I'm not completely mistaken.

Cheers,
Aljoscha

On Wed, 30 Nov 2016 at 09:57 Ismaël Mejía  wrote:

> Hello,
>
> Shen you should probably first check the benchmark implementation at
> github, I am not sure you need cycles to implement the yahoo benchmark.
> Notice for example that there is a flink based implementation and AFAIK
> there is no support for cycles in Flink (or at least there wasn't at the
> moment they published the benchmarks).
>
> However if you are implementing the yahoo benchmark on Beam, that would be
> a nice scenario to test the runner performance (vs the native
> implementations), so it would be nice if you can share this.
>
> Regards,
> Ismael
>
> On Tue, Nov 29, 2016 at 7:32 PM, Shen LI  wrote:
>
> > Hi Maria, Bobby,
> >
> > Thanks for the explanation.
> >
> > Regards,
> >
> > Shen
> >
> > On Tue, Nov 29, 2016 at 12:37 PM, Bobby Evans
>  > >
> > wrote:
> >
> > > In my experience almost all of the time cycles are bad and cause a lot
> of
> > > debugging problems. Most of the time you can implement what you want by
> > > using a windowed join or group by instead.
> > > - Bobby
> > >
> > > On Tuesday, November 29, 2016, 11:06:44 AM CST, María García Herrero
> > >  wrote:Hi Shen,
> > >
> > > No. Beam pipelines are DAGs:
> > > http://beam.incubator.apache.org/documentation/sdks/
> > > javadoc/0.3.0-incubating/org/apache/beam/sdk/Pipeline.html
> > > Best,
> > >
> > > María
> > >
> > > On Tue, Nov 29, 2016 at 7:44 AM, Shen LI  wrote:
> > >
> > > > Hi,
> > > >
> > > > Can I use Beam to create a pipeline with cycles? For example, to
> > > implement
> > > > the Yahoo! Streaming benchmark(
> > > >
> https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-
> > > > computation-engines-at),
> > > > can a ParDo transform consume a downstream output as a side input?
> > > >
> > > > Thanks,
> > > >
> > > > Shen
> > > >
> > >
> >
>


Re: Meet up at Strata+Hadoop World in Singapore

2016-11-29 Thread Aljoscha Krettek
Hi,
I'll also be there to give a talk (and also at the Beam tutorial).

Cheers,
Aljoscha

On Wed, Nov 30, 2016, 00:51 Dan Halperin  wrote:

> Hey folks,
>
> Who will be attending Strata+Hadoop World next week in Singapore? Tyler and
> I will be there, giving a Beam tutorial [0] and some talks [2,3].
>
> I'd love to sync in person with anyone who wants to talk Beam. Please reach
> out to me directly if you'd like to meet.
>
> Thanks!
> Dan
>
> [0]
>
> http://conferences.oreilly.com/strata/hadoop-big-data-sg/public/schedule/detail/54331
> [1]
>
> http://conferences.oreilly.com/strata/hadoop-big-data-sg/public/schedule/detail/54343
> [2]
>
> http://conferences.oreilly.com/strata/hadoop-big-data-sg/public/schedule/detail/54325
> (Slava Chernyak, our Google colleague)
>


Re: [DISCUSS] Graduation to a top-level project

2016-11-22 Thread Aljoscha Krettek
+1

I'm quite enthusiastic about the growth of the community and the open
discussions!

On Tue, 22 Nov 2016 at 19:51 Jason Kuster 
wrote:

> An enthusiastic +1!
>
> In particular it's been really great to see the commitment and interest of
> the community in different kinds of testing. Between what we currently have
> on Jenkins and Travis and the in-progress work on IO integration tests and
> performance tests (plus, I'm sure, other things I'm not aware of) we're in
> a really good place.
>
> On Tue, Nov 22, 2016 at 10:49 AM, Amit Sela  wrote:
>
> > +1, super exciting!
> >
> > Thanks to JB, Davor and the whole team for creating this community. I
> think
> > we've achieved a lot in a short time.
> >
> > Amit.
> >
> > On Tue, Nov 22, 2016, 20:36 Tyler Akidau 
> > wrote:
> >
> > > +1, thanks to everyone who's invested time getting us to this point.
> :-)
> > >
> > > -Tyler
> > >
> > > On Tue, Nov 22, 2016 at 10:33 AM Jean-Baptiste Onofré  >
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > First of all, I would like to thank the whole team, and especially
> > Davor
> > > > for the great work and commitment to Apache and the community.
> > > >
> > > > Of course, a big +1 to move forward on graduation !
> > > >
> > > > Regards
> > > > JB
> > > >
> > > > On 11/22/2016 07:19 PM, Davor Bonaci wrote:
> > > > > Hi everyone,
> > > > > With all the progress we’ve had recently in Apache Beam, I think it
> > is
> > > > time
> > > > > we start the discussion about graduation as a new top-level project
> > at
> > > > the
> > > > > Apache Software Foundation.
> > > > >
> > > > > Graduation means we are a self-sustaining and self-governing
> > community,
> > > > and
> > > > > ready to be a full participant in the Apache Software Foundation.
> It
> > > does
> > > > > not imply that our community growth is complete or that a
> particular
> > > > level
> > > > > of technical maturity has been reached, rather that we are on a
> solid
> > > > > trajectory in those areas. After graduation, we will still
> > periodically
> > > > > report to, and be overseen by, the ASF Board to ensure continued
> > growth
> > > > of
> > > > > a healthy community.
> > > > >
> > > > > Graduation is an important milestone for the project. It is also
> key
> > to
> > > > > further grow the user community: many users (incorrectly) see
> > > incubation
> > > > as
> > > > > a sign of instability and are much less likely to consider us for a
> > > > > production use.
> > > > >
> > > > > A way to think about graduation readiness is through the Apache
> > > Maturity
> > > > > Model [1]. I think we clearly satisfy all the requirements [2]. It
> is
> > > > > probably worth emphasizing the recent community growth: over each
> of
> > > the
> > > > > past three months, no single organization contributing to Beam has
> > had
> > > > more
> > > > > than ~50% of the unique contributors per month [2, see
> assumptions].
> > > > That’s
> > > > > a great statistic that shows how much we’ve grown our diversity!
> > > > >
> > > > > Process-wise, graduation consists of drafting a board resolution,
> > which
> > > > > needs to identify the full Project Management Committee, and
> getting
> > it
> > > > > approved by the community, the Incubator, and the Board. Within the
> > > Beam
> > > > > community, most of these discussions and votes have to be on the
> > > private@
> > > > > mailing list, but, as usual, we’ll try to keep dev@ updated as
> much
> > as
> > > > > possible.
> > > > >
> > > > > With that in mind, let’s use this discussion on dev@ for two
> things:
> > > > > * Collect additional data points on our progress that we may want
> to
> > > > > present to the Incubator as a part of the proposal to accept our
> > > > graduation.
> > > > > * Determine whether the community supports graduation. Please reply
> > > +1/-1
> > > > > with any additional comments, as appropriate. I’d encourage
> everyone
> > to
> > > > > participate -- regardless whether you are an occasional visitor or
> > > have a
> > > > > specific role in the project -- we’d love to hear your perspective.
> > > > >
> > > > > Data points so far:
> > > > > * Project’s maturity self-assessment [2].
> > > > > * 1500 pull requests in incubation, which makes us one of the most
> > > active
> > > > > project across all of ASF on this metric.
> > > > > * 3 releases, each driven by a different release manager.
> > > > > * 120+ individual contributors.
> > > > > * 3 new committers added, 2 of which aren’t from the largest
> > > > organization.
> > > > > * 1027 issues created, 515 resolved.
> > > > > * 442 dev@ emails in October alone, sent by 51 individuals.
> > > > > * 50 user@ emails in the last 30 days, sent by 22 individuals.
> > > > >
> > > > > Thanks!
> > > > >
> > > > > Davor
> > > > >
> > > > > [1] http://community.apache.org/apache-way/apache-project-
> > > > > maturity-model.html
> > > > > [2] 

Re: Flink runner. Wrapper for DoFn

2016-11-21 Thread Aljoscha Krettek
I think we should change the Flink runner to always enable checkpointing
and have a default checkpointing interval. But this has some tricky
implications where I'm not yet sure if it will be possible to do this.

On Sun, 20 Nov 2016 at 13:43 Alexey Demin <diomi...@gmail.com> wrote:

Hi, Aljoscha

I returned with question about wrapper =)

kafka1 -> transformer1 -> kafka2

Load data from kafka1, split in 10+ events and push result in kafka2

processing use PushbackSideInputDoFnRunner and chaining

but Pushback use streaming wrapper DoFnOperator which invoke finishBundle
on every element and as result I have:

1) load from kafka
2) parsing and invoke context.collect(element1)
3) chaining to kafka
4) finishBundle kafka => kafka.flush()
5) context.collect(element2)
6) chaining to kafka
7) finishBundle kafka => kafka.flush()
...
etc

Do you have idea how I can prevent flush() on every element, because now
it's bottleneck for me?

Thanks,
Alexey Diomin


2016-11-19 11:59 GMT+04:00 Aljoscha Krettek <aljos...@apache.org>:

@amir, what do you mean? Naming a ParDo "startBundle" is not the same thing
as having a @StartBundle or startBundle() (for OldDoFn) method in your
ParDo.

On Sat, 19 Nov 2016 at 00:22 amir bahmanyari <amirto...@yahoo.com.invalid>
wrote:

> Interesting. I have been including "startBundle" in KafkaIO() thus
> far.What could happen as far as Flink cluster performance in the
> following?Thanks Aljoscha.
> PCollection<KV<String, String>> kafkarecords = p
>
.apply(KafkaIO.read().withBootstrapServers("kafka01:9092").withTopics(topics)
> .withValueCoder(StringUtf8Coder.of()).withoutMetadata())
> .apply("startBundle", ParDo.of( new DoFn<KV<byte[], String>, KV<String,
> String>>() {
> Amir-
>
>   From: Aljoscha Krettek <aljos...@apache.org>
>  To: amir bahmanyari <amirto...@yahoo.com>; Eugene Kirpichov <
> kirpic...@google.com>; "dev@beam.incubator.apache.org" <
> dev@beam.incubator.apache.org>
>  Sent: Friday, November 18, 2016 2:54 PM
>  Subject: Re: Flink runner. Wrapper for DoFn
>
> Regarding the Flink runner and how it calls startBundle()/finishBundle():
> it's currently done like this because it is correct and because there is
no
> other "natural" point where it could be called. Flink continuously
> processes elements and at some (user defined) interval performs
checkpoints
> to persist state. We could call startBundle()/finishBundle() when this
> happens but I chose not to (for the time being) because this could lead to
> problems if the user sets a rather large interval. Users can even disable
> checkpointing, in which case we would never call these methods.
>
> --
> Aljoscha
>
> On Fri, 18 Nov 2016 at 22:17 amir bahmanyari <amirto...@yahoo.com.invalid>
> wrote:
>
> > Oops! sorry :-) Thanks Eugene ...
> >
> >  From: Eugene Kirpichov <kirpic...@google.com>
> >  To: dev@beam.incubator.apache.org; amir bahmanyari <amirto...@yahoo.com
> >
> >  Sent: Friday, November 18, 2016 1:09 PM
> >  Subject: Re: Flink runner. Wrapper for DoFn
> >
> > Amir - @Setup is a regular Java annotation; class names in Java
> (including
> > names of annotation classes), like all other names, are case-sensitive.
> >
> > On Fri, Nov 18, 2016 at 12:54 PM amir bahmanyari
> > <amirto...@yahoo.com.invalid> wrote:
> >
> > Thanks Alexey.I just fired up the whole thing. With @Setup. BTW, does it
> > matter if lowercase @setup or @Setup?I hope not. :-))Will update you
when
> > its done and share my observations.Cheers+have a great weekend.Amir-
> >
> >  From: Alexey Demin <diomi...@gmail.com>
> >  To: dev@beam.incubator.apache.org; amir bahmanyari <amirto...@yahoo.com
> >
> >  Sent: Friday, November 18, 2016 12:38 PM
> >  Subject: Re: Flink runner. Wrapper for DoFn
> >
> > In my case it's:
> > 1) i don't rebuild index by filters every time, only one time on start
> > processing
> > 2) connection for remote db does not open hundreds times in second
> >
> > as result all pipeline work more stable and faster
> >
> > 2016-11-19 0:06 GMT+04:00 amir bahmanyari <amirto...@yahoo.com.invalid>:
> >
> > > Hi Alexey,What improvements do you expect by replacing @StartBundle
> > > with @Setup?I am going to give it a try & see what diff it
> > > makes.Interesting & thanks for bringing it up...
> > > Cheers
> > >
> > >  From: Demin Alexey <diomi...@gmail.com>
> > >  To: dev@beam.incubator.apache.org
> > >  Sent: Friday, November 18, 2016 11:12 AM
> > >  Subject: Re: Fl

Re: Hosting data stores for IO Transform testing

2016-11-21 Thread Aljoscha Krettek
Hi Stephen,
I really like your proposal! I don't have any comments because this seems
very well "researched" already.

I'm hoping others will also have a look at this as well because "real"
integration testing provides a new level of confidence in the code, IMHO.

Cheers,
Aljoscha


On Wed, 16 Nov 2016 at 23:36 Stephen Sisk  wrote:

> Hi everyone!
>
> Currently we have a good set of unit tests for our IO Transforms - those
> tend to run against in-memory versions of the data stores. However, we'd
> like to further increase our test coverage to include running them against
> real instances of the data stores that the IO Transforms work against (e.g.
> cassandra, mongodb, kafka, etc…), which means we'll need to have real
> instances of various data stores.
>
> Additionally, if we want to do performance regression detection, it's
> important to have instances of the services that behave realistically,
> which isn't true of in-memory or dev versions of the services.
>
>
> Proposed solution
> -
> If we accept this proposal, we would create an infrastructure for running
> real instances of data stores inside of containers, using container
> management software like mesos/marathon, kubernetes, docker swarm, etc… to
> manage the instances.
>
> This would enable us to build integration tests that run against those real
> instances and performance tests that run against those real instances (like
> those that Jason Kuster is proposing elsewhere.)
>
>
> Why do we need one centralized set of instances vs just having various
> people host their own instances?
> -
> Reducing flakiness of tests is key. By not having dependencies from the
> core project on external services/instances of data stores we have
> guaranteed access to the services and the group can fix issues that arise.
>
> An exception would be something that has an ops team supporting it (eg,
> AWS, Google Cloud or other professionally managed service) - those we trust
> will be stable.
>
>
> There may be a lot of different data stores needed - how will we maintain
> them?
> -
> It will take work above and beyond that of a normal set of unit tests to
> build and maintain integration/performance tests & their data store
> instances.
>
> Setup & maintenance of the data store containers and data store instances
> on it must be automated. It also has to be as simple of a setup as
> possible, and we should avoid hand tweaking the containers - expecting
> checked in scripts/dockerfiles is key.
>
> Aligned with the community ownership approach of Apache, as members of the
> community are excited to contribute & maintain those tests and the
> integration/performance tests, people will be able to step up and do that.
> If there is no longer support for maintaining a particular set of
> integration & performance tests and their data store instances, then we can
> disable those tests. We may document on the website what IO Transforms have
> current integration/performance tests so users know what level of testing
> the various IO Transforms have.
>
>
> What about requirements for the container management software itself?
> -
> * We should have the data store instances themselves in Docker. Docker
> allows new instances to be spun up in a quick, reproducible way and is
> fairly platform independent. It has wide support from a variety of
> different container management services.
> * As little admin work required as possible. Crashing instances should be
> restarted, setup should be simple, everything possible should be
> scripted/scriptable.
> * Logs and test output should be on a publicly available website, without
> needing to log into test execution machine. Centralized capture of
> monitoring info/logs from instances running in the containers would support
> this. Ideally, this would just be supported by the container software out
> of the box.
> * It'd be useful to have good persistent volume in the container management
> software so that databases don't have to reload large data sets every time.
> * The containers may be a place to execute runners themselves if we need
> larger runner instances, so it should play well with Spark, Flink, etc…
>
> As I discussed earlier on the mailing list, it looks like hosting docker
> containers on kubernetes, docker swarm or mesos+marathon would be a good
> solution.
>
> Thanks,
> Stephen Sisk
>


Re: Flink runner. Wrapper for DoFn

2016-11-19 Thread Aljoscha Krettek
@amir, what do you mean? Naming a ParDo "startBundle" is not the same thing
as having a @StartBundle or startBundle() (for OldDoFn) method in your
ParDo.

On Sat, 19 Nov 2016 at 00:22 amir bahmanyari <amirto...@yahoo.com.invalid>
wrote:

> Interesting. I have been including "startBundle" in KafkaIO() thus
> far.What could happen as far as Flink cluster performance in the
> following?Thanks Aljoscha.
> PCollection<KV<String, String>> kafkarecords = p
> .apply(KafkaIO.read().withBootstrapServers("kafka01:9092").withTopics(topics)
> .withValueCoder(StringUtf8Coder.of()).withoutMetadata())
> .apply("startBundle", ParDo.of( new DoFn<KV<byte[], String>, KV<String,
> String>>() {
> Amir-
>
>   From: Aljoscha Krettek <aljos...@apache.org>
>  To: amir bahmanyari <amirto...@yahoo.com>; Eugene Kirpichov <
> kirpic...@google.com>; "dev@beam.incubator.apache.org" <
> dev@beam.incubator.apache.org>
>  Sent: Friday, November 18, 2016 2:54 PM
>  Subject: Re: Flink runner. Wrapper for DoFn
>
> Regarding the Flink runner and how it calls startBundle()/finishBundle():
> it's currently done like this because it is correct and because there is no
> other "natural" point where it could be called. Flink continuously
> processes elements and at some (user defined) interval performs checkpoints
> to persist state. We could call startBundle()/finishBundle() when this
> happens but I chose not to (for the time being) because this could lead to
> problems if the user sets a rather large interval. Users can even disable
> checkpointing, in which case we would never call these methods.
>
> --
> Aljoscha
>
> On Fri, 18 Nov 2016 at 22:17 amir bahmanyari <amirto...@yahoo.com.invalid>
> wrote:
>
> > Oops! sorry :-) Thanks Eugene ...
> >
> >  From: Eugene Kirpichov <kirpic...@google.com>
> >  To: dev@beam.incubator.apache.org; amir bahmanyari <amirto...@yahoo.com
> >
> >  Sent: Friday, November 18, 2016 1:09 PM
> >  Subject: Re: Flink runner. Wrapper for DoFn
> >
> > Amir - @Setup is a regular Java annotation; class names in Java
> (including
> > names of annotation classes), like all other names, are case-sensitive.
> >
> > On Fri, Nov 18, 2016 at 12:54 PM amir bahmanyari
> > <amirto...@yahoo.com.invalid> wrote:
> >
> > Thanks Alexey.I just fired up the whole thing. With @Setup. BTW, does it
> > matter if lowercase @setup or @Setup?I hope not. :-))Will update you when
> > its done and share my observations.Cheers+have a great weekend.Amir-
> >
> >  From: Alexey Demin <diomi...@gmail.com>
> >  To: dev@beam.incubator.apache.org; amir bahmanyari <amirto...@yahoo.com
> >
> >  Sent: Friday, November 18, 2016 12:38 PM
> >  Subject: Re: Flink runner. Wrapper for DoFn
> >
> > In my case it's:
> > 1) i don't rebuild index by filters every time, only one time on start
> > processing
> > 2) connection for remote db does not open hundreds times in second
> >
> > as result all pipeline work more stable and faster
> >
> > 2016-11-19 0:06 GMT+04:00 amir bahmanyari <amirto...@yahoo.com.invalid>:
> >
> > > Hi Alexey,What improvements do you expect by replacing @StartBundle
> > > with @Setup?I am going to give it a try & see what diff it
> > > makes.Interesting & thanks for bringing it up...
> > > Cheers
> > >
> > >  From: Demin Alexey <diomi...@gmail.com>
> > >  To: dev@beam.incubator.apache.org
> > >  Sent: Friday, November 18, 2016 11:12 AM
> > >  Subject: Re: Flink runner. Wrapper for DoFn
> > >
> > > Oh, this is my mistake
> > >
> > > Yes correct way its use @Setup.
> > >
> > > Thank you Eugene.
> > >
> > >
> > > 2016-11-18 22:54 GMT+04:00 Eugene Kirpichov
> <kirpic...@google.com.invalid
> > >
> > > :
> > >
> > > > Hi Alexey,
> > > >
> > > > In general, things like establishing connections and initializing
> > caches
> > > > are better done in @Setup and @TearDown methods, rather than
> > @StartBundle
> > > > and @FinishBundle, because DoFn's can be reused between bundles and
> > this
> > > > way you get more benefit from reuse.
> > > >
> > > > Bundles can be pretty small, especially in streaming pipelines. That
> > > said,
> > > > they normally shouldn't be 1-element-small. Hopefully someone working
> > on
> > > > the Flink runner

Fwd: Jenkins build became unstable: beam_PostCommit_RunnableOnService_FlinkLocal #813

2016-11-11 Thread Aljoscha Krettek
This looks like it was introduced by the new commit mentioned here but it's
actually caused by a pre-existing (unknown) bug in the Flink runner:
https://issues.apache.org/jira/browse/BEAM-965.

I also have a fix ready.

Btw, how should we deal with these messages from Jenkins? I'm writing to
the dev list so that no-one else spends the energy for looking into this.

-- Forwarded message -
From: Apache Jenkins Server 
Date: Fri, 11 Nov 2016 at 20:17
Subject: Jenkins build became unstable:
beam_PostCommit_RunnableOnService_FlinkLocal #813
To: , 


See <
https://builds.apache.org/job/beam_PostCommit_RunnableOnService_FlinkLocal/813/changes
>


[VOTE] Apache Beam release 0.3.0-incubating

2016-10-28 Thread Aljoscha Krettek
Hi everyone,
Please review and vote on the release candidate #1 for the Apache Beam
version 0.3.0-incubating, as follows:
[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)


The complete staging area is available for your review, which includes:
* JIRA release notes [1],
* the official Apache source release to be deployed to dist.apache.org [2],
* all artifacts to be deployed to the Maven Central Repository [3],
* source code tag "v0.3.0-incubating-RC1" [4],
* website pull request listing the release and publishing the API reference
manual [5].

The Apache Beam community has unanimously approved this release [6].

As customary, the vote will be open for at least 72 hours. It is adopted by
a majority approval with at least three PMC affirmative votes. If approved,
we will proceed with the release.

Thanks!

[1]
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12338051
[2] https://dist.apache.org/repos/dist/dev/incubator/beam/0.3.0-incubating/
[3]
https://repository.apache.org/content/repositories/staging/org/apache/beam/
[4]
https://git-wip-us.apache.org/repos/asf?p=incubator-beam.git;a=tag;h=5d86ff7f04862444c266142b0d5acecb5a6b7144
[5] https://github.com/apache/incubator-beam-site/pull/52
[6]
https://lists.apache.org/thread.html/b3736acb5edcea247a5a6a64c09ecacab794461bf1ea628152faeb82@%3Cdev.beam.apache.org%3E


[RESULT] [VOTE] Release 0.3.0-incubating, release candidate #1

2016-10-28 Thread Aljoscha Krettek
I'm happy to announce that we have unanimously approved this release.

There are 10 approving votes 6 of which are binding:
 * Jean-Baptiste (binding)
 * Ismaël
 * Sergio
 * Seetharam
 * Davor (binding)
 * Dan (binding)
 * Kenneth (binding)
 * Maximilian (binding)
 * Amit (binding)
 * Neelesh

There are no disapproving votes.

As Sergio pointed out, the PPMC votes are not really binding while we are
still in incubator, only the IPMC votes will be. I'm nevertheless listing
them here as binding because we also did it on past votes.

I'll now go ahead and post this result to the IPMC for final voting.

Thanks everyone!

On Fri, 28 Oct 2016 at 09:09 Aljoscha Krettek <aljos...@apache.org> wrote:

> The voting time has elapsed. I'm hereby closing this vote and will tally
> the results in a separate thread.
>
> On Thu, 27 Oct 2016 at 17:38 Neelesh Salian <nsal...@cloudera.com> wrote:
>
> +1 (non-binding)
> Thank you for putting this together
>
>
> On Thu, Oct 27, 2016 at 12:00 AM, Jean-Baptiste Onofré <j...@nanthrax.net>
> wrote:
>
> > No problem for the vote.
> >
> > For graduation, we are already thinking about it yes.
> >
> > Regards
> > JB
> >
> > ⁣​
> >
> > On Oct 27, 2016, 08:54, at 08:54, "Sergio Fernández" <wik...@apache.org>
> > wrote:
> > >Hi JB,
> > >
> > >On Tue, Oct 25, 2016 at 12:00 PM, Jean-Baptiste Onofré
> > ><j...@nanthrax.net>
> > >wrote:
> > >
> > >> Thanks Sergio ;)
> > >>
> > >
> > >You are welcome.
> > >
> > >
> > >> Just tried to explain to the others what is a binding vote ;)
> > >>
> > >
> > >It's a common mistake in many podlings that PPMC members thing they
> > >have
> > >binding votes over developers who are not part of the project. But
> > >during
> > >incubation only IPMC are binding votes. I hope that's clear.
> > >
> > >In theory it's simple. So sorry if I've made some noise with that. I'll
> > >repeat my vote later at general@incubator if you prefer it in that way.
> > >
> > >Cheers,
> > >
> > >P.S.: after 0.3.0-incubating, are you thinking about graduation? I
> > >think
> > >you should ;-)
> > >
> > >
> > >
> > >On Oct 25, 2016, 11:53, at 11:53, "Sergio Fernández"
> > ><wik...@apache.org>
> > >> wrote:
> > >> >On Tue, Oct 25, 2016 at 11:36 AM, Jean-Baptiste Onofré
> > >> ><j...@nanthrax.net>
> > >> >wrote:
> > >> >
> > >> >> By the way, your vote is not binding from a podling perspective
> > >(you
> > >> >are
> > >> >> not PPMC). Your vote is binding from IPMC perspective (so when you
> > >> >will
> > >> >> vote on the incubator mailing list).
> > >> >>
> > >> >
> > >> >Well, PPMC are never binding votes, only IPMC are actually binding.
> > >> >That
> > >> >I'm not part of the PPMC is not much relevant. Therefore I think my
> > >> >vote is
> > >> >still a valid binding one; but I can vote again on
> > >general@incubator,
> > >> >no
> > >> >problem.
> > >> >
> > >> >Sorry for jumping-in too early. Besides a IPMC, I'm also a developer
> > >> >interested in Beam ;-)
> > >> >
> > >> >Cheers,
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >> On Oct 25, 2016, 11:33, at 11:33, "Sergio Fernández"
> > >> ><wik...@apache.org>
> > >> >> wrote:
> > >> >> >+1 (binding)
> > >> >> >
> > >> >> >So far I've successfully checked:
> > >> >> >* signatures and digests
> > >> >> >* source releases file layouts
> > >> >> >* matched git tags and commit ids
> > >> >> >* incubator suffix and disclaimer
> > >> >> >* NOTICE and LICENSE files
> > >> >> >* license headers
> > >> >> >* clean build (Java 1.8.0_91, Scala, 2.11.7, SBT 0.13.9, Debian
> > >> >amd64)
> > >> >> >
> > >> >> >
> > >> >> >Couple of minor issues I've seen it'd be great to have fixed in
> > >> >> >upcoming
> > >> >> >rel

Re: Start of release 0.3.0-incubating

2016-10-26 Thread Aljoscha Krettek
The release guide [1] has a section about that. Before doing a release we
check whether there are blocker issues or issues that have the
to-be-released version as the fix version. If there are any those have to
be resolved before going forward with the release.

[1] http://beam.incubator.apache.org/contribute/release-guide/

On Wed, 26 Oct 2016 at 10:00 Maximilian Michels <m...@apache.org> wrote:

> For releases, legal matters have top priority, e.g. licensing issues
> can really get a project into trouble. Apart from that, what about
> testing various functionality of Beam with different runners before an
> actual release? Also, should we have a look at the list of open issues
> and decide whether we want to fix some of those for the upcoming
> release?
>
> For example, it would have been nice to update the Flink version of
> the Flink Runner to 1.1.3. Perhaps we can do that for the first minor
> release :)
>
> -Max
>
>
> On Mon, Oct 24, 2016 at 4:28 PM, Dan Halperin
> <dhalp...@google.com.invalid> wrote:
> > Thanks JB! (et al.) Excellent suggestions.
> >
> > Thanks,
> > Dan
> >
> > On Thu, Oct 20, 2016 at 9:32 PM, Jean-Baptiste Onofré <j...@nanthrax.net>
> > wrote:
> >
> >> Hi Dan,
> >>
> >> No problem, MQTT and other IOs will be in the next release..
> >>
> >> IMHO, it would be great to have:
> >> 1. A release reminder couple of days before a release. Just to ask
> >> everyone if there's no objection (something like this:
> >> https://lists.apache.org/thread.html/80de75df0115940ca402132
> >> 338b221e5dd5f669fd1bf915cd95e15c3@%3Cdev.karaf.apache.org%3E)
> >> 2. A roughly release schedule on the website (something like this:
> >> http://karaf.apache.org/download.html#container-schedule for instance).
> >>
> >> Just my $0.01 ;)
> >>
> >> Regards
> >> JB
> >>
> >>
> >> On 10/20/2016 06:30 PM, Dan Halperin wrote:
> >>
> >>> Hi JB,
> >>>
> >>> This is a great discussion to have! IMO, there's no special
> functionality
> >>> requirements for these pre-TLP releases. It's more important to make
> sure
> >>> we keep the process going. (I think we should start the release as
> soon as
> >>> possible, because it's been 2 months since the last one.)
> >>>
> >>> If we hold a release a week for MQTT, we'll hold it another week for
> some
> >>> other new feature, and then hold it again for some other new feature.
> >>>
> >>> Can you make a strong argument for why MQTT in particular should be
> >>> release
> >>> blocking?
> >>>
> >>> Dan
> >>>
> >>> On Thu, Oct 20, 2016 at 9:26 AM, Jean-Baptiste Onofré <j...@nanthrax.net
> >
> >>> wrote:
> >>>
> >>> +1
> >>>>
> >>>> Thanks Aljosha !!
> >>>>
> >>>> Do you mind to wait the week end or Monday to start the release ? I
> would
> >>>> like to include MqttIO if possible.
> >>>>
> >>>> Thanks !
> >>>> Regards
> >>>> JB
> >>>>
> >>>> ⁣
> >>>>
> >>>> On Oct 20, 2016, 18:07, at 18:07, Dan Halperin
> >>>> <dhalp...@google.com.INVALID>
> >>>> wrote:
> >>>>
> >>>>> On Thu, Oct 20, 2016 at 12:37 AM, Aljoscha Krettek
> >>>>> <aljos...@apache.org>
> >>>>> wrote:
> >>>>>
> >>>>> Hi,
> >>>>>> thanks for taking the time and writing this extensive doc!
> >>>>>>
> >>>>>> If no-one is against this I would like to be the release manager for
> >>>>>>
> >>>>> the
> >>>>>
> >>>>>> next (0.3.0-incubating) release. I would work with the guide and
> >>>>>>
> >>>>> update it
> >>>>>
> >>>>>> with anything that I learn along the way. Should I open a new thread
> >>>>>>
> >>>>> for
> >>>>>
> >>>>>> this or is it ok of nobody objects here?
> >>>>>>
> >>>>>> Cheers,
> >>>>>> Aljoscha
> >>>>>>
> >>>>>>
> >>>>> Spinning this out as a separate thread.
> >>>>>
> >&g

Re: [DISCUSS] Current ongoing work on runners

2016-10-25 Thread Aljoscha Krettek
I think we might need to update the capability matrix with some of the new
features that have popped up. Immediate things that come to mind are:
 * Timer/State API for user DoFns (coupled with new-style DoFn) (not yet
completely in master)
 * SplittableDoFn

This would allow tracking the process in each of these for each runner and
would not require hunting for that information in email threads.

On Tue, 25 Oct 2016 at 08:12 Jean-Baptiste Onofré  wrote:

> +1. For me it's one of the most important point for the new website. We
> should give a clear and exhaustive list of what we have, both for runners
> and IOs (with supported features).
>
> Regards
> JB
>
> ⁣​
>
> On Oct 24, 2016, 21:52, at 21:52, "Ismaël Mejía" 
> wrote:
> >Hello,
> >
> >I am really happy to see new runners been contributed to our community
> >(e.g. GearPump and Apex recently). We have not discussed a lot about
> >the
> >current capabilities of both runners.
> >
> >Following the recent discussion about making ongoing work more explicit
> >in
> >the mailing list, I would like to ask the people involved about the
> >current
> >status of them, I think it is important to discuss this (apart of
> >creating
> >the given JIRAs + updating the capability matrix docs) because more
> >people
> >can eventually jump and give a hand on open issues.
> >
> >I remember there was a google doc for the  capabilities of each runner,
> >is
> >this doc still available (sorry I lost the link). I suppose that once
> >these
> >ongoing runners mature we can add this doc also to the website.
> >https://beam.apache.org/learn/runners/capability-matrix/
> >
> >Regards,
> >Ismaël
> >
> >ps. @Amit, given that the spark 2 (Dataset based) runner has also a
> >feature
> >branch, if you consider it worth, can you please share a bit about that
> >work too.
> >
> >ps2. Can anyone please share the link to the google doc I was talking
> >about, I can't find it after the recent changes to the website.
> >​
>


Re: [VOTE] Release 0.3.0-incubating, release candidate #1

2016-10-25 Thread Aljoscha Krettek
Hi Sergio,
I used this line from the Apache release signing doc (
https://www.apache.org/dev/release-signing.html#md5):
$ gpg --print-md MD5 [fileName] > [fileName].md5

What is normally used to checksum/verify the md5 and sha hash? As it is
now, they can be manually verified by comparing the checksum in the file
with one that was derived from the zip. I hope that's still ok for the
release to go through?

Cheers,
Aljoscha

On Tue, 25 Oct 2016 at 11:53 Sergio Fernández <wik...@apache.org> wrote:

> On Tue, Oct 25, 2016 at 11:36 AM, Jean-Baptiste Onofré <j...@nanthrax.net>
> wrote:
>
> > By the way, your vote is not binding from a podling perspective (you are
> > not PPMC). Your vote is binding from IPMC perspective (so when you will
> > vote on the incubator mailing list).
> >
>
> Well, PPMC are never binding votes, only IPMC are actually binding. That
> I'm not part of the PPMC is not much relevant. Therefore I think my vote is
> still a valid binding one; but I can vote again on general@incubator, no
> problem.
>
> Sorry for jumping-in too early. Besides a IPMC, I'm also a developer
> interested in Beam ;-)
>
> Cheers,
>
>
>
>
> > On Oct 25, 2016, 11:33, at 11:33, "Sergio Fernández" <wik...@apache.org>
> > wrote:
> > >+1 (binding)
> > >
> > >So far I've successfully checked:
> > >* signatures and digests
> > >* source releases file layouts
> > >* matched git tags and commit ids
> > >* incubator suffix and disclaimer
> > >* NOTICE and LICENSE files
> > >* license headers
> > >* clean build (Java 1.8.0_91, Scala, 2.11.7, SBT 0.13.9, Debian amd64)
> > >
> > >
> > >Couple of minor issues I've seen it'd be great to have fixed in
> > >upcoming
> > >releases:
> > >* MongoDbIOTest fails (addr already in use) when a Mongo service is
> > >locally
> > >running. I'd say the port should be random in the test suite.
> > >* How did you generated the checksums? Because both SHA1/MD5 can't be
> > >automatically checked because "no properly formatted SHA1/MD5 checksum
> > >lines found".
> > >
> > >Great to see the project moving forward at this speed :-)
> > >
> > >Cheers,
> > >
> > >
> > >
> > >On Mon, Oct 24, 2016 at 11:30 PM, Aljoscha Krettek
> > ><aljos...@apache.org>
> > >wrote:
> > >
> > >> Hi Team!
> > >>
> > >> Please review and vote at your leisure on release candidate #1 for
> > >version
> > >> 0.3.0-incubating, as follows:
> > >> [ ] +1, Approve the release
> > >> [ ] -1, Do not approve the release (please provide specific comments)
> > >>
> > >> The complete staging area is available for your review, which
> > >includes:
> > >> * JIRA release notes [1],
> > >> * the official Apache source release to be deployed to
> > >dist.apache.org
> > >> [2],
> > >> * all artifacts to be deployed to the Maven Central Repository [3],
> > >> * source code tag "v0.3.0-incubating-RC1" [4],
> > >> * website pull request listing the release and publishing the API
> > >reference
> > >> manual [5].
> > >>
> > >> Please keep in mind that this release is not focused on providing new
> > >> functionality. We want to refine the release process and make stable
> > >source
> > >> and binary artefacts available to our users.
> > >>
> > >> The vote will be open for at least 72 hours. It is adopted by
> > >majority
> > >> approval, with at least 3 PPMC affirmative votes.
> > >>
> > >> Cheers,
> > >> Aljoscha
> > >>
> > >> [1]
> > >> https://issues.apache.org/jira/secure/ReleaseNote.jspa?
> > >> projectId=12319527=12338051
> > >> [2] https://dist.apache.org/repos/dist/dev/incubator/beam/0.3.0-
> > >> incubating/
> > >> [3]
> > >> https://repository.apache.org/content/repositories/staging/
> > >> org/apache/beam/
> > >> [4]
> > >> https://git-wip-us.apache.org/repos/asf?p=incubator-beam.git;a=tag;h=
> > >> 5d86ff7f04862444c266142b0d5acecb5a6b7144
> > >> [5] https://github.com/apache/incubator-beam-site/pull/52
> > >>
> > >
> > >
> > >
> > >--
> > >Sergio Fernández
> > >Partner Technology Manager
> > >Redlink GmbH
> > >m: +43 6602747925 <+43%20660%202747925>
> > >e: sergio.fernan...@redlink.co
> > >w: http://redlink.co
> >
>
>
>
> --
> Sergio Fernández
> Partner Technology Manager
> Redlink GmbH
> m: +43 6602747925 <+43%20660%202747925>
> e: sergio.fernan...@redlink.co
> w: http://redlink.co
>


Re: The Availability of PipelineOptions

2016-10-25 Thread Aljoscha Krettek
+1 This sounds quite straightforward.

On Tue, 25 Oct 2016 at 01:36 Thomas Groh  wrote:

> Hey everyone,
>
> I've been working on a declaration of intent for how we want to use
> PipelineOptions and an API change to be consistent with that intent. This
> is generally part of the move to the Runner API, specifically the desire to
> be able to reuse Pipelines and the ability to choose runner at the time of
> the call to run.
>
> The high-level summary is I wan to remove the Pipeline.getPipelineOptions
> method.
>
> I believe this will be compatible with other in-flight proposals,
> especially Dynamic PipelineOptions, but would love to see what everyone
> else thinks. The document is available at the link below.
>
>
> https://docs.google.com/document/d/1Wr05cYdqnCfrLLqSk--XmGMGgDwwNwWZaFbxLKvPqEQ/edit?usp=sharing
>
> Thanks,
>
> Thomas
>


[VOTE] Release 0.3.0-incubating, release candidate #1

2016-10-24 Thread Aljoscha Krettek
Hi Team!

Please review and vote at your leisure on release candidate #1 for version
0.3.0-incubating, as follows:
[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)

The complete staging area is available for your review, which includes:
* JIRA release notes [1],
* the official Apache source release to be deployed to dist.apache.org [2],
* all artifacts to be deployed to the Maven Central Repository [3],
* source code tag "v0.3.0-incubating-RC1" [4],
* website pull request listing the release and publishing the API reference
manual [5].

Please keep in mind that this release is not focused on providing new
functionality. We want to refine the release process and make stable source
and binary artefacts available to our users.

The vote will be open for at least 72 hours. It is adopted by majority
approval, with at least 3 PPMC affirmative votes.

Cheers,
Aljoscha

[1]
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12338051
[2] https://dist.apache.org/repos/dist/dev/incubator/beam/0.3.0-incubating/
[3]
https://repository.apache.org/content/repositories/staging/org/apache/beam/
[4]
https://git-wip-us.apache.org/repos/asf?p=incubator-beam.git;a=tag;h=5d86ff7f04862444c266142b0d5acecb5a6b7144
[5] https://github.com/apache/incubator-beam-site/pull/52


Re: Maven Release Plugin Does Not Update Version of Archetypes

2016-10-24 Thread Aljoscha Krettek
Hi,
to unblock the release I'm changing the version manually now, yes. Would be
good to fix though.

Cheers,
Aljoscha

On Mon, 24 Oct 2016 at 20:30 Dan Halperin <dhalp...@google.com.invalid>
wrote:

> Hmm, this is new in 0.3.0, looks caused by
>
> https://github.com/apache/incubator-beam/commit/1f30255edcdd9c1e445b69248191c8552724f086#diff-4795b1d27449c01332aad192348eL111
> <
> https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-beam%2Fcommit%2F1f30255edcdd9c1e445b69248191c8552724f086%23diff-4795b1d27449c01332aad192348eL111=D=1=AFQjCNGOYTW7DSiNZuGnOKWuHhggzsnztQ
> >
>
> Thinking if we can revert this part of the commit. Pei, Luke -- remember
> what's up?
>
> On Mon, Oct 24, 2016 at 11:17 AM, Dan Halperin <dhalp...@google.com>
> wrote:
>
> > Would it unblock the release to manually configure the version in the
> > 0.3.0-release branch?
> >
> > On Mon, Oct 24, 2016 at 11:09 AM, Dan Halperin <dhalp...@google.com>
> > wrote:
> >
> >> Correct issue link: https://issues.apache.org/jira/browse/BEAM-806
> >>
> >> No answers, but looking around.
> >>
> >> On Mon, Oct 24, 2016 at 10:10 AM, Aljoscha Krettek <aljos...@apache.org
> >
> >> wrote:
> >>
> >>> Hi,
> >>> are there any Maven mavens who happen to know how
> >>> https://issues.apache.org/jira/browse/BEAM-108 can be fixed? By the
> way,
> >>> the release plugin does also not update the version of the archetypes
> >>> when
> >>> setting the next SNAPSHOT version.
> >>>
> >>> IMHO, it's a bit of a release blocker so I'm hoping we can get this
> >>> sorted
> >>> quickly. I did some preliminary research but couldn't find a solution
> but
> >>> if no-one knows how to fix it it seems I have to dig deeper myself.
> >>>
> >>> Cheers,
> >>> Aljoscha
> >>>
> >>
> >>
> >
>


Re: [DISCUSS] Deferring (pre) combine for merging windows.

2016-10-24 Thread Aljoscha Krettek
@Amit: Yes, Flink is more "what you write is what you get". For example, in
Flink we have a Fold function for windows which cannot be efficiently
computed with merging windows (it would require using a "group by" window
and then folding the iterable). We just don't allow this.

For Beam, I think it's ok if we clearly define Combine in terms of
GroupByKey | CombineValues (which we do). With different runners it's hard
to enforce common optimisation strategies.

On Sun, 23 Oct 2016 at 06:02 Robert Bradshaw 
wrote:

> On Sat, Oct 22, 2016 at 2:38 AM, Amit Sela  wrote:
> > I understand the semantics, but I feel like there might be a different
> > point of view for open-source runners.
>
> It seems we're losing a major promise of the runner interchangeability
> story if different runners can give different results for a
> well-defined transformation. I strongly feel we should avoid that path
> whenever possible. Specifically in this case Combine.perKey should
> mean the same thing on all runners (namely its composite definition),
> and only be executed differently when it's safe to do so.
>
> > Dataflow is a service, and it tries to do it's best to optimize execution
> > while users don't have to worry about internal implementation (they are
> not
> > aware of it).
> > I can assure
> > <
> https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html
> >
> > you that for Spark users, applying groupByKey instead of combinePerKey is
> > an important note.
>
> For sure. Dataflow calls this out too. See the second star at
> https://cloud.google.com/dataflow/model/combine#using-combine-perkey
> (though it's not called out as prominently as it is for Spark
> users--likely should be more). Beam documentation should make this
> point as well.
>
> > @Aljoscha do Flink users (working on Flink native API) usually care about
> > this difference of implementation ?
> > Any other runners that can provide input ?
>
> IIRC, Flink and Dataflow (and, trivially, the direct runner) all avoid
> this unsafe optimization when merging windows are mixed with
> non-global side inputs.
>
> Note also that the user of the Combine.perKey transform may not know
> the choice of windowing of the main or side inputs, so can't make this
> determination of whether it's safe to use this optimization. (As a
> concrete example, suppose I created a TopNPercent transform that did a
> global count and passed that as a side input to the Top CombineFn.)
>
> > On Sat, Oct 22, 2016 at 2:25 AM Robert Bradshaw
> 
> > wrote:
> >
> > Combine.perKey() is defined as GroupByKey() | Combine.values().
> >
> > A runner is free, in fact encouraged, to take advantage of the
> > associative properties of CombineFn to compute the result of
> > GroupByKey() | Combine.values() as cheaply as possible, but it is
> > incorrect to produce something that could not have been produced by
> > this composite implementation. (In the case of deterministic trigger
> > firing, (e.g. the default trigger), plus assuming of course a
> > associative, deterministic CombineFn, there is exactly one correct
> > output for every input no matter the WindowFns).
> >
> > A corollary to this is that we cannot apply combining operations that
> > inspect the main input window (including side inputs where the mapping
> > is anything but the constant map (like to GlobalWindow)) until the
> > main input window is known.
> >
> >
> > On Fri, Oct 21, 2016 at 3:50 PM, Amit Sela  wrote:
> >> Please excuse my typos and apply "s/differ/defer/g" ;-).
> >> Amit.
> >>
> >> On Fri, Oct 21, 2016 at 2:59 PM Amit Sela  wrote:
> >>
> >>> I'd like to raise an issue that was discussed in BEAM-696
> >>> .
> >>> I won't recap here because it would be extensive (and probably
> >>> exhaustive), and I'd also like to restart the discussion here rather
> then
> >>> summarize it.
> >>>
> >>> *The problem*
> >>> In the case of (main) input in a merging window (e.g. Sessions) with
> >>> sideInputs, pre-combining might lead to non-deterministic behaviour,
> for
> >>> example:
> >>> Main input: e1 (time: 3), e2 (time: 5)
> >>> Session: gap duration of 3 -> e1 alone belongs to [3, 6), e2 alone [5,
> > 8),
> >>> combined together the merging of their windows yields [3, 8).
> >>> Matching SideInputs with FixedWindows of size 2 should yield - e1
> > matching
> >>> sideInput window [4, 6), e2 [6, 8), merged [6, 8).
> >>> Now, if the sideInput is used in a merging step of the combine, and
> both
> >>> elements are a part of the same bundle, the sideInput accessed will
> >>> correspond to [6, 8) which is the expected behaviour, but if e1 is
> >>> pre-combined in a separate bundle, it will access sideInput for [4, 6)
> >>> which is wrong.
> >>> ** this can tends to be a bit confusing, so any
> > 

Re: Tracking backward-incompatible changes for Beam

2016-10-22 Thread Aljoscha Krettek
Very good idea!

Should we already start thinking about automatic tests for backwards
compatibility of the API?

On Fri, 21 Oct 2016 at 10:56 Jean-Baptiste Onofré  wrote:

> Hi Dan,
>
> +1, good idea.
>
> Regards
> JB
>
> On 10/21/2016 02:21 AM, Dan Halperin wrote:
> > Hey everyone,
> >
> > In the Beam codebase, we’ve improved, rewritten, or deleted many APIs.
> > While this has improved the model and gives us great freedom to
> experiment,
> > we are also causing churn on users authoring Beam libraries and
> pipelines.
> >
> > To really kick off Beam as something users can depend on, we need to
> > stabilize the Beam API. Stabilizing means a commitment to not making
> > breaking changes -- except between major versions as per standard
> semantic
> > versioning.
> >
> > To get there, I’ve started a process for tracking these changes by
> applying
> > the `backward-incompatible` label [1] to the corresponding JIRA issues.
> > Naturally, open `backward-incompatible` changes are “blocking issues” for
> > the first stable release. (Or we’ll have to put them off for the next
> major
> > version!)
> >
> > So here are some requests for help:
> > * Please review and appropriately label the components I skipped:
> > runner-{apex, flink, gearpump, spark}, sdk-py.
> > * Please proactively file JIRA issues for breaking API changes you still
> > want to make, and label them.
> >
> > Thanks everyone!
> > Dan
> >
> >
> > [1]
> >
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20labels%20%3D%20backward-incompatible
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: [ANNOUNCEMENT] New committers!

2016-10-22 Thread Aljoscha Krettek
Welcome everyone! +3 :-)

On Sat, 22 Oct 2016 at 06:43 Jean-Baptiste Onofré  wrote:

> Just a small thing.
>
> If it's not already done, don't forget to sign a ICLA and let us know
> your apache ID.
>
> Thanks,
> Regards
> JB
>
> On 10/22/2016 12:18 AM, Davor Bonaci wrote:
> > Hi everyone,
> > Please join me and the rest of Beam PPMC in welcoming the following
> > contributors as our newest committers. They have significantly
> contributed
> > to the project in different ways, and we look forward to many more
> > contributions in the future.
> >
> > * Thomas Weise
> > Thomas authored the Apache Apex runner for Beam [1]. This is an exciting
> > new runner that opens a new user base. It is a large contribution, which
> > starts the whole new component with a great potential.
> >
> > * Jesse Anderson
> > Jesse has contributed significantly by promoting Beam. He has
> co-developed
> > a Beam tutorial and delivered it at a top big data conference. He
> published
> > several blog posts positioning Beam, Q with the Apache Beam team, and a
> > demo video how to run Beam on multiple runners [2]. On the side, he has
> > authored 7 pull requests and reported 6 JIRA issues.
> >
> > * Thomas Groh
> > Since starting incubation, Thomas has contributed the most commits to the
> > project [3], a total of 226 commits, which is more than anybody else. He
> > has contributed broadly to the project, most significantly by developing
> > from scratch the DirectRunner that supports the full model semantics.
> > Additionally, he has contributed a new set of APIs for testing unbounded
> > pipelines. He published a blog highlighting this work.
> >
> > Congratulations to all three! Welcome!
> >
> > Davor
> >
> > [1] https://github.com/apache/incubator-beam/tree/apex-runner
> > [2] http://www.smokinghand.com/
> > [3] https://github.com/apache/incubator-beam/graphs/contributors
> > ?from=2016-02-01=2016-10-14=c
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: Start of release 0.3.0-incubating

2016-10-21 Thread Aljoscha Krettek
+1 @JB

We should definitely keep that in mind for the next releases. I think this
one is now sufficiently announced so I'll get started on the process.
(Which will take me a while since I have to do all the initial setup.)



On Fri, 21 Oct 2016 at 06:32 Jean-Baptiste Onofré <j...@nanthrax.net> wrote:

> Hi Dan,
>
> No problem, MQTT and other IOs will be in the next release..
>
> IMHO, it would be great to have:
> 1. A release reminder couple of days before a release. Just to ask
> everyone if there's no objection (something like this:
>
> https://lists.apache.org/thread.html/80de75df0115940ca402132338b221e5dd5f669fd1bf915cd95e15c3@%3Cdev.karaf.apache.org%3E
> )
> 2. A roughly release schedule on the website (something like this:
> http://karaf.apache.org/download.html#container-schedule for instance).
>
> Just my $0.01 ;)
>
> Regards
> JB
>
> On 10/20/2016 06:30 PM, Dan Halperin wrote:
> > Hi JB,
> >
> > This is a great discussion to have! IMO, there's no special functionality
> > requirements for these pre-TLP releases. It's more important to make sure
> > we keep the process going. (I think we should start the release as soon
> as
> > possible, because it's been 2 months since the last one.)
> >
> > If we hold a release a week for MQTT, we'll hold it another week for some
> > other new feature, and then hold it again for some other new feature.
> >
> > Can you make a strong argument for why MQTT in particular should be
> release
> > blocking?
> >
> > Dan
> >
> > On Thu, Oct 20, 2016 at 9:26 AM, Jean-Baptiste Onofré <j...@nanthrax.net>
> > wrote:
> >
> >> +1
> >>
> >> Thanks Aljosha !!
> >>
> >> Do you mind to wait the week end or Monday to start the release ? I
> would
> >> like to include MqttIO if possible.
> >>
> >> Thanks !
> >> Regards
> >> JB
> >>
> >> ⁣​
> >>
> >> On Oct 20, 2016, 18:07, at 18:07, Dan Halperin
> <dhalp...@google.com.INVALID>
> >> wrote:
> >>> On Thu, Oct 20, 2016 at 12:37 AM, Aljoscha Krettek
> >>> <aljos...@apache.org>
> >>> wrote:
> >>>
> >>>> Hi,
> >>>> thanks for taking the time and writing this extensive doc!
> >>>>
> >>>> If no-one is against this I would like to be the release manager for
> >>> the
> >>>> next (0.3.0-incubating) release. I would work with the guide and
> >>> update it
> >>>> with anything that I learn along the way. Should I open a new thread
> >>> for
> >>>> this or is it ok of nobody objects here?
> >>>>
> >>>> Cheers,
> >>>> Aljoscha
> >>>>
> >>>
> >>> Spinning this out as a separate thread.
> >>>
> >>> +1 -- Sounds great to me!
> >>>
> >>> Dan
> >>>
> >>> On Thu, Oct 20, 2016 at 12:37 AM, Aljoscha Krettek
> >>> <aljos...@apache.org>
> >>> wrote:
> >>>
> >>>> Hi,
> >>>> thanks for taking the time and writing this extensive doc!
> >>>>
> >>>> If no-one is against this I would like to be the release manager for
> >>> the
> >>>> next (0.3.0-incubating) release. I would work with the guide and
> >>> update it
> >>>> with anything that I learn along the way. Should I open a new thread
> >>> for
> >>>> this or is it ok of nobody objects here?
> >>>>
> >>>> Cheers,
> >>>> Aljoscha
> >>>>
> >>>> On Thu, 20 Oct 2016 at 07:10 Jean-Baptiste Onofré <j...@nanthrax.net>
> >>> wrote:
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> well done.
> >>>>>
> >>>>> As already discussed, it looks good to me ;)
> >>>>>
> >>>>> Regards
> >>>>> JB
> >>>>>
> >>>>> On 10/20/2016 01:24 AM, Davor Bonaci wrote:
> >>>>>> Hi everybody,
> >>>>>> As a project, I think we should have a Release Guide to document
> >>> the
> >>>>>> process, have consistent releases, on-board additional release
> >>>> managers,
> >>>>>> and generally share knowledge. It is also one of the project
> >>> graduation
> >>>>>> guidelines.
> >>>>>>
> >>>>>> Dan and I wrote a draft version, documenting the process we did
> >>> for the
> >>>>>> first two releases. It is currently in a pull request [1]. I'd
> >>> invite
> >>>>>> everyone interested to take a peek and comment, either on the
> >>> pull
> >>>>> request
> >>>>>> itself or here on mailing list, as appropriate.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Davor
> >>>>>>
> >>>>>> [1] https://github.com/apache/incubator-beam-site/pull/49
> >>>>>>
> >>>>>
> >>>>> --
> >>>>> Jean-Baptiste Onofré
> >>>>> jbono...@apache.org
> >>>>> http://blog.nanthrax.net
> >>>>> Talend - http://www.talend.com
> >>>>>
> >>>>
> >>
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: Release Guide

2016-10-20 Thread Aljoscha Krettek
Hi,
thanks for taking the time and writing this extensive doc!

If no-one is against this I would like to be the release manager for the
next (0.3.0-incubating) release. I would work with the guide and update it
with anything that I learn along the way. Should I open a new thread for
this or is it ok of nobody objects here?

Cheers,
Aljoscha

On Thu, 20 Oct 2016 at 07:10 Jean-Baptiste Onofré  wrote:

> Hi,
>
> well done.
>
> As already discussed, it looks good to me ;)
>
> Regards
> JB
>
> On 10/20/2016 01:24 AM, Davor Bonaci wrote:
> > Hi everybody,
> > As a project, I think we should have a Release Guide to document the
> > process, have consistent releases, on-board additional release managers,
> > and generally share knowledge. It is also one of the project graduation
> > guidelines.
> >
> > Dan and I wrote a draft version, documenting the process we did for the
> > first two releases. It is currently in a pull request [1]. I'd invite
> > everyone interested to take a peek and comment, either on the pull
> request
> > itself or here on mailing list, as appropriate.
> >
> > Thanks,
> > Davor
> >
> > [1] https://github.com/apache/incubator-beam-site/pull/49
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: [KUDOS] Contributed runner: Apache Apex!

2016-10-17 Thread Aljoscha Krettek
Congrats! :-)

On Mon, 17 Oct 2016 at 18:55 Kenneth Knowles  wrote:

> *I would like to :-)
>
> On Mon, Oct 17, 2016 at 9:51 AM Kenneth Knowles  wrote:
>
> > Hi all,
> >
> > I would to, once again, call attention to a great addition to Beam: a
> > runner for Apache Apex.
> >
> > After lots of review and much thoughtful revision, pull request #540 has
> > been merged to the apex-runner feature branch today. Please do take a
> look,
> > and help us put the finishing touches on it to get it ready for the
> master
> > branch.
> >
> > And please also congratulate and thank Thomas Weise for this large
> > endeavor, Vlad Rosov who helped get the integration tests working, and
> > Guarav Gupta who contributed review comments.
> >
> > Kenn
> >
>


Re: Simplifying User-Defined Metrics in Beam

2016-10-13 Thread Aljoscha Krettek
I finally found the time to have a look. :-)

The API looks very good! (It's very similar to an API we recently added to
Flink, which is inspired by the same Codahale/Dropwizard metrics).

About the semantics, the "A", "B" and "C" you mention in the doc: doesn't
this mean that we have to keep the metrics in some fault-tolerant way?
Almost in something like the StateInternals, because they should survive
failures and contain the metrics over the successful runs. (Side note: in
Flink the metrics are just "since the last restart from failure" in case of
failures.)

About querying the metrics, what we have mostly seen is that people want to
integrate metrics into a Metrics system that they already have in place.
They use Graphite, StatsD or simply JMX for this. In Flink we provide an
API for reporters that users can plug in to export the metrics to their
system of choice. I'm sure some people will like the option of having the
metrics queryable on the PipelineResult but I would assume that for most
production use cases integration with a metrics system is more important.

Regarding removal of Aggregators I'm for B, to quote Saint Exupéry:
  "It seems that perfection is attained not when there is nothing more to
add, but when there is nothing more to remove."

Cheers,
Aljoscha

On Wed, 12 Oct 2016 at 20:22 Robert Bradshaw 
wrote:

> +1 to the new metrics design. I strongly favor B as well.
>
> On Wed, Oct 12, 2016 at 10:54 AM, Kenneth Knowles
>  wrote:
> > Correction: In my eagerness to see the end of aggregators, I mistook the
> > intention. Both A and B leave aggregators in place until there is a
> > replacement. In which case, I am strongly in favor of B. As soon as we
> can
> > remove aggregators, I think we should.
> >
> > On Wed, Oct 12, 2016 at 10:48 AM Kenneth Knowles  wrote:
> >
> >> Huzzah! This is IMO a really great change. I agree that we can get
> >> something in to allow work to continue, and improve the API as we learn.
> >>
> >> On Wed, Oct 12, 2016 at 10:20 AM Ben Chambers
> 
> >> wrote:
> >>
> >> 3. One open question is what to do with Aggregators. In the doc I
> mentioned
> >>
> >> that long term I'd like to consider whether we can improve Aggregators
> to
> >> be a better fit for the model by supporting windowing and allowing them
> to
> >> serve as input for future steps. In the interim it's not clear what we
> >> should do with them. The two obvious (and extreme) options seem to be:
> >>
> >>
> >>
> >>   Option A: Do nothing, leave aggregators as they are until we revisit.
> >>
> >>
> >>   Option B: Remove aggregators from the SDK until we revisit.
> >>
> >> I'd like to suggest removing Aggregators once the existing runners have
> >> reasonable support for Metrics. Doing so reduces the surface area we
> need
> >> to maintain/support and simplifies other changes being made. It will
> also
> >> allow us to revisit them from a clean slate.
> >>
> >>
> >> +1 to removing aggregators, either of A or B. The new metrics design
> >> addresses aggregator use cases as well or better.
> >>
> >> So A vs B is a choice of whether we have a gap with no aggregator or
> >> metrics-like functionality. I think that is perhaps a bit of a bummer
> for
> >> users, and we will likely port over the runner code for it, so we
> wouldn't
> >> want to actually delete it, right? Can we do it in a week or two?
> >>
> >> One thing motivating me to do this quickly: Currently the new DoFn does
> >> not have its own implementation of aggregators, but leverages that of
> >> OldDoFn, so we cannot remove OldDoFn until either (1) new DoFn
> >> re-implements the aggregator instantiation and worker-side delegation
> (not
> >> hard, but it is throwaway code) or (2) aggregators are removed. This
> >> dependency also makes running the new DoFn directly (required for the
> state
> >> API) a bit more annoying.
> >>
>


Re: Simplifying User-Defined Metrics in Beam

2016-10-06 Thread Aljoscha Krettek
Hi,
I'm currently in holidays but I'll put some thought into this and give my
comments once I get back.

Aljoscha

On Wed, Oct 5, 2016, 21:51 Ben Chambers 
wrote:

> To provide some more background I threw together a quick doc outlining my
> current thinking for this Metrics API. You can find it at
> http://s.apache.org/beam-metrics-api.
>
> The first PR (https://github.com/apache/incubator-beam/pull/1024)
> introducing these APIs for the direct runner is hopefully nearing
> completion. If there are no objections, I'd like to check it in and start
> working on hooking this up to other runners to flesh out how this will
> interact with them. We can continue to iterate on the API and concepts in
> the doc and create follow-up PRs for any changes we'd like to make.
>
> As always, let me know if there are any questions or comments!
>
> -- Ben
>
> On Wed, Sep 28, 2016 at 5:05 PM Ben Chambers  wrote:
>
> I started looking at BEAM-147: “Rename Aggregator to [P]Metric”. Rather
> than renaming the existing concept I’d like to introduce Metrics as a
> simpler mechanism to provide information during pipeline execution (I have
> updated the issue accordingly).
>
> Here is what I'm thinking would lead to a simpler API focused on reporting
> metrics about pipeline execution:
>
>1.
>
>Rather than support arbitrary Combine functions, Metrics support a set
>of specific aggregations with documented use-cases (eg., Counter, Meter,
>Distribution, etc.) and an API inspired by the Dropwizard Metrics
> library.
>2.
>
>Rather than requiring declaration during pipeline construction (like
>Aggregators) Metrics allow declaration at any point because it is
> easier to
>use.
>3.
>
>Metrics provide more documented flexibility in how runners support them,
>by allowing each runner to provide different details about metrics and
>support different kinds of metrics, while clearly documenting what the
>kinds are and what should happen if they aren’t supported. This allows
>users to use metrics in a reliable way even though runners may implement
>them differently
>
>
> # What does the Metrics API look like?
>
> The API for using metrics would be relatively simple:
>
> // Metrics can be used as fields:
>
> private final Counter cnt = Metrics.counter(“mycode”, “odd-elements”);
>
> @ProcessElement
>
> public void processElement(ProcessContext c) {
>
>  if (c.element() % 2 == 1) {
>
>cnt.inc();
>
>  }
>
>  // Metrics can be created dynamically:
>
>  Metrics.distribution(“mycode”, “elements”).report(c.element());
>
>  ...
>
> }
>
> # What Kinds of Metrics could there be?
>
> There are many kinds of metrics that seem like they could be useful. We
> could eventually support metrics like the following:
>
>-
>
>Counter: Can be incremented/decremented. Will be part of the initial
>implementation.
>-
>
>Distribution: Values can be reported and various statistics are
>reported. The initial implementation will support “easy” statistics like
>MIN/MAX/MEAN/SUM/COUNT. We’d like to support quantiles in the future to
>make this more comparable to Dropwizard’s Histogram.
>-
>
>(Future) Meter: Method to indicate something happened. Computes the rate
>of occurrences.
>-
>
>(Future) Timer: A meter measuring how often something happens plus a
>distribution of how long it took each time.
>-
>
>(Future) Frequent Elements: Reports values that occurred more than N% of
>the time.
>
>
> # What are the next steps?
>
> I’ve started work prototyping the new API by implementing it for the Java
> DirectRunner. To see an example pipeline that reports a Counter and a
> Distribution, take a look at the first PR
> https://github.com/apache/incubator-beam/pull/1024
>
> # Where does that leave Aggregators?
> Hopefully, this new Metrics API addresses the goals of monitoring a
> pipeline more cleanly than Aggregators. In the long term, it would be good
> to make Aggregators a more complete participant in the model, by adding
> support for windowing and allowing the results to be used as input to later
> steps in the pipeline. Or to make them completely unnecessary by making it
> easy to use side-outputs with the new reflective DoFn approach. Once
> Metrics are available, we may want to deprecate or remove Aggregators until
> we’re ready to figure out what the right API is.
>


About Finishing Triggers

2016-09-14 Thread Aljoscha Krettek
Hi,
I had a chat with Kenn at Flink Forward and he did an off-hand remark about
how it might be better if triggers where not allowed to mark a window as
finished and instead always be "Repeatedly" (if I understood correctly).

Maybe you (Kenn) could go a bit more in depth about what you meant by this
and if we should actually change this in Beam. Would this mean that we then
have the opposite of Repeatedly, i.e. Once, or Only.once(T)?

I also noticed some inconsistencies in when triggers behave as repeated
triggers and once triggers. For example, AfterPane.elementCountAtLeast(5)
only fires once if used alone but it it fires repeatedly if used as the
speculative trigger in
AfterWatermark.pastEndOfWindow().withEarlyFirings(...). (This is true for
all "once" triggers.)

Cheers,
Aljoscha


Re: Anyone @scale tomorrow?

2016-09-09 Thread Aljoscha Krettek
Cool, thanks for letting us know!

On Fri, Sep 9, 2016, 18:45 Dan Halperin  wrote:

> Hey folks,
>
> Wanted to let you know that the Beam talk went pretty well. People were
> *very* excited about Beam -- loved the idea of not having to rewrite their
> pipelines every time they want to try out a new runner. There was a large
> Flink user contingent especially. We answered questions for almost two
> hours afterwards.
>
> If you weren't able to attend, here is a video of the talk (warning: I
> haven't watched it...)
>
> https://atscaleconference.com/videos/no-shard-left-behind-apis-for-massive-parallel-efficiency/
>
> Thanks,
> Dan
>
> On Tue, Aug 30, 2016 at 7:32 PM, Dan Halperin  wrote:
>
> > I'll be giving a talk at the Facebook @scale conference tomorrow.
> >
> > Sorry for the late notice, but if anyone is around to meet in the hallway
> > track or have lunch or drinks, reach out. I'd love to connect.
> >
> > Dan
> >
>


Re: KafkaIO Windowing Fn

2016-09-02 Thread Aljoscha Krettek
Ah, now I remember that the Flink runner did never support processing-time
timers. I created a Jira issue for this:
https://issues.apache.org/jira/browse/BEAM-615

On Thu, 1 Sep 2016 at 19:20 Chawla,Sumit <sumitkcha...@gmail.com> wrote:

> Thanks Ajioscha\Thomas
>
> I will explore on the option to upgrade.  Meanwhile here is what observed
> with the above code in my local Flink Cluster.
>
> 1.  To start there are 0 records in Kafka
> 2.  Deploy the pipeline.  Two records are received in Kafka at time
> 10:00:00 AM
> 3.  The Pane with 100 records would not fire because expected data is not
> there.  I would expect the 30 sec based filter to fire and downstream to
> receive the record around 10:00:30 AM.
>
> 4.  No new records are arriving.  The downstream received the above record
> around 10 minutes later around 10:10:00 AM
>
> I am not sure whats actually triggering the window firing here.  ( does not
> look like to be 30 sec trigger)
>
>
>
> Regards
> Sumit Chawla
>
>
> On Wed, Aug 31, 2016 at 11:14 PM, Aljoscha Krettek <aljos...@apache.org>
> wrote:
>
> > Ah I see, the Flink Runner had quite some updates in 0.2.0-incubating and
> > even more for the upcoming 0.3.0-incubating.
> >
> > On Thu, 1 Sep 2016 at 04:09 Thomas Groh <tg...@google.com.invalid>
> wrote:
> >
> > > In 0.2.0-incubating and beyond we've replaced the DirectPipelineRunner
> > with
> > > the DirectRunner (formerly InProcessPipelineRunner), which is capable
> of
> > > handling Unbounded Pipelines. Is it possible for you to upgrade?
> > >
> > > On Wed, Aug 31, 2016 at 5:17 PM, Chawla,Sumit <sumitkcha...@gmail.com>
> > > wrote:
> > >
> > > > @Ajioscha,  My assumption is here that atleast one trigger should
> fire.
> > > > Either the 100 elements or the 30 second since first element.
> > (whichever
> > > > happens first)
> > > >
> > > > @Thomas - here is the error i get: I am using 0.1.0-incubating
> > > >
> > > > *ava.lang.IllegalStateException: no evaluator registered for
> > > > Read(UnboundedKafkaSource)*
> > > >
> > > > * at
> > > > org.apache.beam.sdk.runners.DirectPipelineRunner$Evaluator.
> > > > visitPrimitiveTransform(DirectPipelineRunner.java:890)*
> > > > * at
> > > > org.apache.beam.sdk.runners.TransformTreeNode.visit(
> > > > TransformTreeNode.java:225)*
> > > > * at
> > > > org.apache.beam.sdk.runners.TransformTreeNode.visit(
> > > > TransformTreeNode.java:220)*
> > > > * at
> > > > org.apache.beam.sdk.runners.TransformTreeNode.visit(
> > > > TransformTreeNode.java:220)*
> > > > * a*
> > > >
> > > > Regards
> > > > Sumit Chawla
> > > >
> > > >
> > > > On Wed, Aug 31, 2016 at 10:19 AM, Aljoscha Krettek <
> > aljos...@apache.org>
> > > > wrote:
> > > >
> > > > > Hi,
> > > > > could the reason for the second part of the trigger never firing be
> > > that
> > > > > there are never at least 100 elements per key. The trigger would
> only
> > > > fire
> > > > > if it saw 100 elements and with only 540 elements that seems
> unlikely
> > > if
> > > > > you have more than 6 keys.
> > > > >
> > > > > Cheers,
> > > > > Aljoscha
> > > > >
> > > > > On Wed, 31 Aug 2016 at 17:47 Thomas Groh <tg...@google.com.invalid
> >
> > > > wrote:
> > > > >
> > > > > > KafkaIO is implemented using the UnboundedRead API, which is
> > > supported
> > > > by
> > > > > > the DirectRunner. You should be able to run without the
> > > > > withMaxNumRecords;
> > > > > > if you can't, I'd be very interested to see the stack trace that
> > you
> > > > get
> > > > > > when you try to run the Pipeline.
> > > > > >
> > > > > > On Tue, Aug 30, 2016 at 11:24 PM, Chawla,Sumit <
> > > sumitkcha...@gmail.com
> > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Yes.  I added it only for DirectRunner as it cannot translate
> > > > > > > Read(UnboundedSourceOfKafka)
> > > > > > >
> > > > > > > Regards
> > > > > > > Sumit Chawla
> >

Re: KafkaIO Windowing Fn

2016-09-01 Thread Aljoscha Krettek
Ah I see, the Flink Runner had quite some updates in 0.2.0-incubating and
even more for the upcoming 0.3.0-incubating.

On Thu, 1 Sep 2016 at 04:09 Thomas Groh <tg...@google.com.invalid> wrote:

> In 0.2.0-incubating and beyond we've replaced the DirectPipelineRunner with
> the DirectRunner (formerly InProcessPipelineRunner), which is capable of
> handling Unbounded Pipelines. Is it possible for you to upgrade?
>
> On Wed, Aug 31, 2016 at 5:17 PM, Chawla,Sumit <sumitkcha...@gmail.com>
> wrote:
>
> > @Ajioscha,  My assumption is here that atleast one trigger should fire.
> > Either the 100 elements or the 30 second since first element. (whichever
> > happens first)
> >
> > @Thomas - here is the error i get: I am using 0.1.0-incubating
> >
> > *ava.lang.IllegalStateException: no evaluator registered for
> > Read(UnboundedKafkaSource)*
> >
> > * at
> > org.apache.beam.sdk.runners.DirectPipelineRunner$Evaluator.
> > visitPrimitiveTransform(DirectPipelineRunner.java:890)*
> > * at
> > org.apache.beam.sdk.runners.TransformTreeNode.visit(
> > TransformTreeNode.java:225)*
> > * at
> > org.apache.beam.sdk.runners.TransformTreeNode.visit(
> > TransformTreeNode.java:220)*
> > * at
> > org.apache.beam.sdk.runners.TransformTreeNode.visit(
> > TransformTreeNode.java:220)*
> > * a*
> >
> > Regards
> > Sumit Chawla
> >
> >
> > On Wed, Aug 31, 2016 at 10:19 AM, Aljoscha Krettek <aljos...@apache.org>
> > wrote:
> >
> > > Hi,
> > > could the reason for the second part of the trigger never firing be
> that
> > > there are never at least 100 elements per key. The trigger would only
> > fire
> > > if it saw 100 elements and with only 540 elements that seems unlikely
> if
> > > you have more than 6 keys.
> > >
> > > Cheers,
> > > Aljoscha
> > >
> > > On Wed, 31 Aug 2016 at 17:47 Thomas Groh <tg...@google.com.invalid>
> > wrote:
> > >
> > > > KafkaIO is implemented using the UnboundedRead API, which is
> supported
> > by
> > > > the DirectRunner. You should be able to run without the
> > > withMaxNumRecords;
> > > > if you can't, I'd be very interested to see the stack trace that you
> > get
> > > > when you try to run the Pipeline.
> > > >
> > > > On Tue, Aug 30, 2016 at 11:24 PM, Chawla,Sumit <
> sumitkcha...@gmail.com
> > >
> > > > wrote:
> > > >
> > > > > Yes.  I added it only for DirectRunner as it cannot translate
> > > > > Read(UnboundedSourceOfKafka)
> > > > >
> > > > > Regards
> > > > > Sumit Chawla
> > > > >
> > > > >
> > > > > On Tue, Aug 30, 2016 at 11:18 PM, Aljoscha Krettek <
> > > aljos...@apache.org>
> > > > > wrote:
> > > > >
> > > > > > Ah ok, this might be a stupid question but did you remove this
> line
> > > > when
> > > > > > running it with Flink:
> > > > > > .withMaxNumRecords(500)
> > > > > >
> > > > > > Cheers,
> > > > > > Aljoscha
> > > > > >
> > > > > > On Wed, 31 Aug 2016 at 08:14 Chawla,Sumit <
> sumitkcha...@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > Hi Aljoscha
> > > > > > >
> > > > > > > The code is not different while running on Flink.  It have
> > removed
> > > > > > business
> > > > > > > specific transformations only.
> > > > > > >
> > > > > > > Regards
> > > > > > > Sumit Chawla
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Aug 30, 2016 at 4:49 AM, Aljoscha Krettek <
> > > > aljos...@apache.org
> > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > > could you maybe also post the complete that you're using with
> > the
> > > > > > > > FlinkRunner? I could have a look into it.
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > > Aljoscha
> > > > > > > >
> > > > > > > > On Tue, 30 Aug 2016 at 09:01 Chawla,Sumit <

Re: KafkaIO Windowing Fn

2016-08-31 Thread Aljoscha Krettek
Hi,
could the reason for the second part of the trigger never firing be that
there are never at least 100 elements per key. The trigger would only fire
if it saw 100 elements and with only 540 elements that seems unlikely if
you have more than 6 keys.

Cheers,
Aljoscha

On Wed, 31 Aug 2016 at 17:47 Thomas Groh <tg...@google.com.invalid> wrote:

> KafkaIO is implemented using the UnboundedRead API, which is supported by
> the DirectRunner. You should be able to run without the withMaxNumRecords;
> if you can't, I'd be very interested to see the stack trace that you get
> when you try to run the Pipeline.
>
> On Tue, Aug 30, 2016 at 11:24 PM, Chawla,Sumit <sumitkcha...@gmail.com>
> wrote:
>
> > Yes.  I added it only for DirectRunner as it cannot translate
> > Read(UnboundedSourceOfKafka)
> >
> > Regards
> > Sumit Chawla
> >
> >
> > On Tue, Aug 30, 2016 at 11:18 PM, Aljoscha Krettek <aljos...@apache.org>
> > wrote:
> >
> > > Ah ok, this might be a stupid question but did you remove this line
> when
> > > running it with Flink:
> > > .withMaxNumRecords(500)
> > >
> > > Cheers,
> > > Aljoscha
> > >
> > > On Wed, 31 Aug 2016 at 08:14 Chawla,Sumit <sumitkcha...@gmail.com>
> > wrote:
> > >
> > > > Hi Aljoscha
> > > >
> > > > The code is not different while running on Flink.  It have removed
> > > business
> > > > specific transformations only.
> > > >
> > > > Regards
> > > > Sumit Chawla
> > > >
> > > >
> > > > On Tue, Aug 30, 2016 at 4:49 AM, Aljoscha Krettek <
> aljos...@apache.org
> > >
> > > > wrote:
> > > >
> > > > > Hi,
> > > > > could you maybe also post the complete that you're using with the
> > > > > FlinkRunner? I could have a look into it.
> > > > >
> > > > > Cheers,
> > > > > Aljoscha
> > > > >
> > > > > On Tue, 30 Aug 2016 at 09:01 Chawla,Sumit <sumitkcha...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > Hi Thomas
> > > > > >
> > > > > > Sorry i tried with DirectRunner but ran into some kafka issues.
> > > > > Following
> > > > > > is the snippet i am working on, and will post more details once i
> > get
> > > > it
> > > > > > working ( as of now i am unable to read messages from Kafka using
> > > > > > DirectRunner)
> > > > > >
> > > > > >
> > > > > > PipelineOptions pipelineOptions =
> PipelineOptionsFactory.create();
> > > > > > pipelineOptions.setRunner(DirectPipelineRunner.class);
> > > > > > Pipeline pipeline = Pipeline.create(pipelineOptions);
> > > > > > pipeline.apply(KafkaIO.read()
> > > > > > .withMaxNumRecords(500)
> > > > > > .withTopics(ImmutableList.of("mytopic"))
> > > > > > .withBootstrapServers("localhost:9092")
> > > > > > .updateConsumerProperties(ImmutableMap.of(
> > > > > > ConsumerConfig.GROUP_ID_CONFIG, "test1",
> > > > > > ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "true",
> > > > > > ConsumerConfig.AUTO_OFFSET_RESET_CONFIG,
> > "earliest"
> > > > > > ))).apply(ParDo.of(new DoFn<KafkaRecord<byte[], byte[]>,
> > > > > > KV<String, String>>() {
> > > > > > @Override
> > > > > > public void processElement(ProcessContext c) throws
> Exception {
> > > > > > KV<byte[], byte[]> record = c.element().getKV();
> > > > > > c.output(KV.of(new String(record.getKey()), new
> > > > > > String(record.getValue(;
> > > > > > }
> > > > > > }))
> > > > > > .apply("WindowByMinute", Window.<KV<String,
> > > > > > String>>into(FixedWindows.of(Duration.standardSeconds(10)))
> > > > > > .withAllowedLateness(Duration.standardSeconds(1))
> > > > > > .triggering(
> > > > > > Repeatedly.forever(
> > > > > > After

Re: KafkaIO Windowing Fn

2016-08-31 Thread Aljoscha Krettek
Ah ok, this might be a stupid question but did you remove this line when
running it with Flink:
.withMaxNumRecords(500)

Cheers,
Aljoscha

On Wed, 31 Aug 2016 at 08:14 Chawla,Sumit <sumitkcha...@gmail.com> wrote:

> Hi Aljoscha
>
> The code is not different while running on Flink.  It have removed business
> specific transformations only.
>
> Regards
> Sumit Chawla
>
>
> On Tue, Aug 30, 2016 at 4:49 AM, Aljoscha Krettek <aljos...@apache.org>
> wrote:
>
> > Hi,
> > could you maybe also post the complete that you're using with the
> > FlinkRunner? I could have a look into it.
> >
> > Cheers,
> > Aljoscha
> >
> > On Tue, 30 Aug 2016 at 09:01 Chawla,Sumit <sumitkcha...@gmail.com>
> wrote:
> >
> > > Hi Thomas
> > >
> > > Sorry i tried with DirectRunner but ran into some kafka issues.
> > Following
> > > is the snippet i am working on, and will post more details once i get
> it
> > > working ( as of now i am unable to read messages from Kafka using
> > > DirectRunner)
> > >
> > >
> > > PipelineOptions pipelineOptions = PipelineOptionsFactory.create();
> > > pipelineOptions.setRunner(DirectPipelineRunner.class);
> > > Pipeline pipeline = Pipeline.create(pipelineOptions);
> > > pipeline.apply(KafkaIO.read()
> > > .withMaxNumRecords(500)
> > > .withTopics(ImmutableList.of("mytopic"))
> > > .withBootstrapServers("localhost:9092")
> > > .updateConsumerProperties(ImmutableMap.of(
> > > ConsumerConfig.GROUP_ID_CONFIG, "test1",
> > > ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "true",
> > > ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest"
> > > ))).apply(ParDo.of(new DoFn<KafkaRecord<byte[], byte[]>,
> > > KV<String, String>>() {
> > > @Override
> > > public void processElement(ProcessContext c) throws Exception {
> > > KV<byte[], byte[]> record = c.element().getKV();
> > > c.output(KV.of(new String(record.getKey()), new
> > > String(record.getValue(;
> > > }
> > > }))
> > > .apply("WindowByMinute", Window.<KV<String,
> > > String>>into(FixedWindows.of(Duration.standardSeconds(10)))
> > > .withAllowedLateness(Duration.standardSeconds(1))
> > > .triggering(
> > > Repeatedly.forever(
> > > AfterFirst.of(
> > >
> > > AfterProcessingTime.pastFirstElementInPane()
> > >
> > > .plusDelayOf(Duration.standardSeconds(30)),
> > > AfterPane.elementCountAtLeast(
> > 100)
> > > )))
> > > .discardingFiredPanes())
> > > .apply("GroupByTenant", GroupByKey.create())
> > > .apply(ParDo.of(new DoFn<KV<String, Iterable>, Void>()
> {
> > > @Override
> > > public void processElement(ProcessContext c) throws
> > Exception {
> > > KV<String, Iterable> element = c.element();
> > > Iterator iterator =
> > element.getValue().iterator();
> > > int count = 0;
> > > while (iterator.hasNext()) {
> > > iterator.next();
> > > count++;
> > > }
> > > System.out.println(String.format("Key %s Value Count
> > > %d", element.getKey(), count));
> > > }
> > > }));
> > > pipeline.run();
> > >
> > >
> > >
> > > Regards
> > > Sumit Chawla
> > >
> > >
> > > On Fri, Aug 26, 2016 at 9:46 AM, Thomas Groh <tg...@google.com.invalid
> >
> > > wrote:
> > >
> > > > If you use the DirectRunner, do you observe the same behavior?
> > > >
> > > > On Thu, Aug 25, 2016 at 4:32 PM, Chawla,Sumit <
> sumitkcha...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi Thomas
> > > > >
> > > > > I am using FlinkRunner.  Yes the second part of trigger never fires
> > for
> > > > me,
> > > > >
> > > > > Regards
> > > >

Re: KafkaIO Windowing Fn

2016-08-30 Thread Aljoscha Krettek
Hi,
could you maybe also post the complete that you're using with the
FlinkRunner? I could have a look into it.

Cheers,
Aljoscha

On Tue, 30 Aug 2016 at 09:01 Chawla,Sumit  wrote:

> Hi Thomas
>
> Sorry i tried with DirectRunner but ran into some kafka issues.  Following
> is the snippet i am working on, and will post more details once i get it
> working ( as of now i am unable to read messages from Kafka using
> DirectRunner)
>
>
> PipelineOptions pipelineOptions = PipelineOptionsFactory.create();
> pipelineOptions.setRunner(DirectPipelineRunner.class);
> Pipeline pipeline = Pipeline.create(pipelineOptions);
> pipeline.apply(KafkaIO.read()
> .withMaxNumRecords(500)
> .withTopics(ImmutableList.of("mytopic"))
> .withBootstrapServers("localhost:9092")
> .updateConsumerProperties(ImmutableMap.of(
> ConsumerConfig.GROUP_ID_CONFIG, "test1",
> ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "true",
> ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest"
> ))).apply(ParDo.of(new DoFn,
> KV>() {
> @Override
> public void processElement(ProcessContext c) throws Exception {
> KV record = c.element().getKV();
> c.output(KV.of(new String(record.getKey()), new
> String(record.getValue(;
> }
> }))
> .apply("WindowByMinute", Window. String>>into(FixedWindows.of(Duration.standardSeconds(10)))
> .withAllowedLateness(Duration.standardSeconds(1))
> .triggering(
> Repeatedly.forever(
> AfterFirst.of(
>
> AfterProcessingTime.pastFirstElementInPane()
>
> .plusDelayOf(Duration.standardSeconds(30)),
> AfterPane.elementCountAtLeast(100)
> )))
> .discardingFiredPanes())
> .apply("GroupByTenant", GroupByKey.create())
> .apply(ParDo.of(new DoFn, Void>() {
> @Override
> public void processElement(ProcessContext c) throws Exception {
> KV element = c.element();
> Iterator iterator = element.getValue().iterator();
> int count = 0;
> while (iterator.hasNext()) {
> iterator.next();
> count++;
> }
> System.out.println(String.format("Key %s Value Count
> %d", element.getKey(), count));
> }
> }));
> pipeline.run();
>
>
>
> Regards
> Sumit Chawla
>
>
> On Fri, Aug 26, 2016 at 9:46 AM, Thomas Groh 
> wrote:
>
> > If you use the DirectRunner, do you observe the same behavior?
> >
> > On Thu, Aug 25, 2016 at 4:32 PM, Chawla,Sumit 
> > wrote:
> >
> > > Hi Thomas
> > >
> > > I am using FlinkRunner.  Yes the second part of trigger never fires for
> > me,
> > >
> > > Regards
> > > Sumit Chawla
> > >
> > >
> > > On Thu, Aug 25, 2016 at 4:18 PM, Thomas Groh  >
> > > wrote:
> > >
> > > > Hey Sumit;
> > > >
> > > > What runner are you using? I can set up a test with the same trigger
> > > > reading from an unbounded input using the DirectRunner and I get the
> > > > expected output panes.
> > > >
> > > > Just to clarify, the second half of the trigger ('when the first
> > element
> > > > has been there for at least 30+ seconds') simply never fires?
> > > >
> > > > On Thu, Aug 25, 2016 at 2:38 PM, Chawla,Sumit <
> sumitkcha...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi Thomas
> > > > >
> > > > > That did not work.
> > > > >
> > > > > I tried following instead:
> > > > >
> > > > > .triggering(
> > > > > Repeatedly.forever(
> > > > > AfterFirst.of(
> > > > >   AfterProcessingTime.
> > > > pastFirstElementInPane()
> > > > > .plusDelayOf(Duration.standard
> > > > > Seconds(30)),
> > > > >   AfterPane.elementCountAtLeast(100)
> > > > > )))
> > > > > .discardingFiredPanes()
> > > > >
> > > > > What i am trying to do here.  This is to make sure that followup
> > > > > operations receive batches of records.
> > > > >
> > > > > 1.  Fire when at Pane has 100+ elements
> > > > >
> > > > > 2.  Or Fire when the first element has been there for atleast 30
> > sec+.
> > > > >
> > > > > However,  2 point does not seem to work.  e.g. I have 540 records
> in
> > > > > Kafka.  The first 500 records are available immediately,
> > > > >
> > > > > but the remaining 40 don't pass through. I was expecting 2nd to
> > > > > trigger to help here.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > Regards
> > > > > Sumit Chawla
> > > > >
> > > > >
> > > > > On Thu, Aug 25, 2016 at 1:13 PM, Thomas Groh
> > 

Re: [PROPOSAL] Splittable DoFn - Replacing the Source API with non-monolithic element processing in DoFn

2016-08-30 Thread Aljoscha Krettek
Thanks for the explanation Eugene and JB.

By the way, I'm not trying to find holes in this, I really like the
feature. I just sometimes wonder how a specific thing might be implemented
with this.

On Mon, 29 Aug 2016 at 18:00 Eugene Kirpichov <kirpic...@google.com.invalid>
wrote:

> Hi Aljoscha,
>
> The watermark reporting is done via
> ProcessContinuation.futureOutputWatermark, at the granularity of returning
> from individual processElement() calls - you return from the call and give
> a watermark on your future output. We assume that updating watermark is
> sufficient at a per-bundle level (or, if not, then that you can make
> bundles small enough) cause that's the same level at which state changes,
> timers etc. are committed.
> It can be implemented by setting a per-key watermark hold and updating it
> when each call for this element returns. That's the way it is implemented
> in my current prototype https://github.com/apache/incubator-beam/pull/896
> (see
> SplittableParDo.ProcessFn)
>
> On Mon, Aug 29, 2016 at 2:55 AM Aljoscha Krettek <aljos...@apache.org>
> wrote:
>
> > Hi,
> > I have another question about this: currently, unbounded sources have
> > special logic for determining the watermark and the system periodically
> > asks the sources for the current watermark. As I understood it,
> watermarks
> > are only "generated" at the sources. How will this work when sources are
> > implemented as a combination of DoFns and SplittableDoFns? Will
> > SplittableDoFns be asked for a watermark, does this mean that watermarks
> > can then be "generated" at any operation?
> >
> > Cheers,
> > Aljoscha
> >
> > On Sun, 21 Aug 2016 at 22:34 Eugene Kirpichov
> <kirpic...@google.com.invalid
> > >
> > wrote:
> >
> > > Hi JB,
> > >
> > > Yes, I'm assuming you're referring to the "magic" part on the transform
> > > expansion diagram. This is indeed runner-specific, and timers+state are
> > > likely the simplest way to do this for an SDF that does unbounded
> amount
> > of
> > > work.
> > >
> > > On Sun, Aug 21, 2016 at 12:14 PM Jean-Baptiste Onofré <j...@nanthrax.net
> >
> > > wrote:
> > >
> > > > Anyway, from a runner perspective, we will have kind of API (part of
> > the
> > > > Runner API) to "orchestrate" the SDF as we discussed during the call,
> > > > right ?
> > > >
> > > > Regards
> > > > JB
> > > >
> > > > On 08/21/2016 07:24 PM, Eugene Kirpichov wrote:
> > > > > Hi Aljoscha,
> > > > > This is an excellent question! And the answer is, we don't need any
> > new
> > > > > concepts like "SDF executor" and can rely on the per-key state and
> > > timers
> > > > > machinery that already exists in all runners because it's necessary
> > to
> > > > > implement windowing/triggering properly.
> > > > >
> > > > > Note that this is already somewhat addressed in the previously
> posted
> > > > State
> > > > > and Timers proposal https://s.apache.org/beam-state , under
> "per-key
> > > > > workflows".
> > > > >
> > > > > Think of it this way, using the Kafka example: we'll expand it
> into a
> > > > > transform:
> > > > >
> > > > > (1) ParDo { topic -> (unique key, topic, partition, [0, inf))) for
> > > > > partition in topic.listPartitions() }
> > > > > (2) GroupByKey
> > > > > (3) ParDo { key, topic, partition, R -> Kafka reader code in the
> > > > > proposal/slides }
> > > > >   - R is the OffsetRange restriction which in this case will be
> > always
> > > of
> > > > > the form [startOffset, inf).
> > > > >   - there'll be just 1 value per key, but we use GBK to just get
> > access
> > > > to
> > > > > the per-key state/timers machinery. This may be runner-specific;
> > maybe
> > > > some
> > > > > runners don't need a GBK to do that.
> > > > >
> > > > > Now suppose the topic has two partitions, P1 and P2, and they get
> > > > assigned
> > > > > unique keys K1, K2.
> > > > > Then the input to (3) will be a collection of: (K1, topic, P1, [0,
> > > inf)),
> > > > > (K2, topic, P2, [0, inf)).
> > > > > Suppose we have just 1 worker with 

Re: [PROPOSAL] Splittable DoFn - Replacing the Source API with non-monolithic element processing in DoFn

2016-08-29 Thread Aljoscha Krettek
Hi,
I have another question about this: currently, unbounded sources have
special logic for determining the watermark and the system periodically
asks the sources for the current watermark. As I understood it, watermarks
are only "generated" at the sources. How will this work when sources are
implemented as a combination of DoFns and SplittableDoFns? Will
SplittableDoFns be asked for a watermark, does this mean that watermarks
can then be "generated" at any operation?

Cheers,
Aljoscha

On Sun, 21 Aug 2016 at 22:34 Eugene Kirpichov <kirpic...@google.com.invalid>
wrote:

> Hi JB,
>
> Yes, I'm assuming you're referring to the "magic" part on the transform
> expansion diagram. This is indeed runner-specific, and timers+state are
> likely the simplest way to do this for an SDF that does unbounded amount of
> work.
>
> On Sun, Aug 21, 2016 at 12:14 PM Jean-Baptiste Onofré <j...@nanthrax.net>
> wrote:
>
> > Anyway, from a runner perspective, we will have kind of API (part of the
> > Runner API) to "orchestrate" the SDF as we discussed during the call,
> > right ?
> >
> > Regards
> > JB
> >
> > On 08/21/2016 07:24 PM, Eugene Kirpichov wrote:
> > > Hi Aljoscha,
> > > This is an excellent question! And the answer is, we don't need any new
> > > concepts like "SDF executor" and can rely on the per-key state and
> timers
> > > machinery that already exists in all runners because it's necessary to
> > > implement windowing/triggering properly.
> > >
> > > Note that this is already somewhat addressed in the previously posted
> > State
> > > and Timers proposal https://s.apache.org/beam-state , under "per-key
> > > workflows".
> > >
> > > Think of it this way, using the Kafka example: we'll expand it into a
> > > transform:
> > >
> > > (1) ParDo { topic -> (unique key, topic, partition, [0, inf))) for
> > > partition in topic.listPartitions() }
> > > (2) GroupByKey
> > > (3) ParDo { key, topic, partition, R -> Kafka reader code in the
> > > proposal/slides }
> > >   - R is the OffsetRange restriction which in this case will be always
> of
> > > the form [startOffset, inf).
> > >   - there'll be just 1 value per key, but we use GBK to just get access
> > to
> > > the per-key state/timers machinery. This may be runner-specific; maybe
> > some
> > > runners don't need a GBK to do that.
> > >
> > > Now suppose the topic has two partitions, P1 and P2, and they get
> > assigned
> > > unique keys K1, K2.
> > > Then the input to (3) will be a collection of: (K1, topic, P1, [0,
> inf)),
> > > (K2, topic, P2, [0, inf)).
> > > Suppose we have just 1 worker with just 1 thread. Now, how will this
> > thread
> > > be able to produce elements from both P1 and P2? here's how.
> > >
> > > The thread will process (K1, topic, P1, [0, inf)), checkpoint after a
> > > certain time or after a certain number of elements are output (just
> like
> > > with the current UnboundedSource reading code) producing a residual
> > > restriction R1' (basically a new start timestamp), put R11 into the
> > per-key
> > > state and set a timer T1 to resume.
> > > Then it will process (K2, topic, P2, [0, inf)), do the same producing a
> > > residual restriction R2' and setting a timer T2 to resume.
> > > Then timer T1 will fire in the context of the key K1. The thread will
> > call
> > > processElement again, this time supplying R1' as the restriction; the
> > > process repeats and after a while it checkpoints and stores R1'' into
> > state
> > > of K1.
> > > Then timer T2 will fire in the context of K2, run processElement for a
> > > while, set a new timer and store R2'' into the state of K2.
> > > Etc.
> > > If partition 1 goes away, the processElement call will return "do not
> > > resume", so a timer will not be set and instead the state associated
> with
> > > K1 will be GC'd.
> > >
> > > So basically it's almost like cooperative thread scheduling: things run
> > for
> > > a while, until the runner tells them to checkpoint, then they set a
> timer
> > > to resume themselves, and the runner fires the timers, and the process
> > > repeats. And, again, this only requires things that runners can already
> > do
> > > - state and timers, but no new concept of SDF executor (and
> consequently
> > no
> > > necessity to

Re: Remove legacy import-order?

2016-08-24 Thread Aljoscha Krettek
+1 on the import order

+1 on also starting a discussion about enforced formatting

On Wed, 24 Aug 2016 at 06:43 Jean-Baptiste Onofré  wrote:

> Agreed.
>
> It makes sense for the import order.
>
> Regards
> JB
>
> On 08/24/2016 02:32 AM, Ben Chambers wrote:
> > I think introducing formatting should be a separate discussion.
> >
> > Regarding the import order: this PR demonstrates the change
> > https://github.com/apache/incubator-beam/pull/869
> >
> > I would need to update the second part (applying optimize imports) prior
> to
> > actually merging.
> >
> > On Tue, Aug 23, 2016 at 5:08 PM Eugene Kirpichov
> >  wrote:
> >
> >> Two cents: While we're at it, we could consider enforcing formatting as
> >> well (https://github.com/google/google-java-format). That's a bigger
> >> change
> >> though, and I don't think it has checkstyle integration or anything like
> >> that.
> >>
> >> On Tue, Aug 23, 2016 at 4:54 PM Dan Halperin
> 
> >> wrote:
> >>
> >>> yeah I think that we would be SO MUCH better off if we worked with an
> >>> out-of-the-box IDE. We don't even distribute an IntelliJ/Eclipse config
> >>> file right now, and I'd like to not have to.
> >>>
> >>> But, ugh, it will mess up ongoing PRs. I guess committers could fix
> them
> >> in
> >>> merge, or we could just make proposers rebase. (Since committers are
> most
> >>> proposers, probably little harm in the latter).
> >>>
> >>> On Tue, Aug 23, 2016 at 4:11 PM, Jesse Anderson  >
> >>> wrote:
> >>>
>  Please. That's the one that always trips me up.
> 
>  On Tue, Aug 23, 2016, 4:10 PM Ben Chambers 
> >> wrote:
> 
> > When Beam was contributed it inherited an import order [1] that was
>  pretty
> > arbitrary. We've added org.apache.beam [2], but continue to use this
> > ordering.
> >
> > Both Eclipse and IntelliJ default to grouping imports into alphabetic
> > order. I think it would simplify development if we switched our
>  checkstyle
> > ordering to agree with these IDEs. This also removes special
> >> treatment
>  for
> > specific packages.
> >
> > If people agree, I'll send out a PR that changes the checkstyle
> > configuration and runs IntelliJ's sort-imports on the existing files.
> >
> > -- Ben
> >
> > [1]
> > org.apache.beam,com.google,android,com,io,Jama,junit,net,
>  org,sun,java,javax
> > [2] com.google,android,com,io,Jama,junit,net,org,sun,java,javax
> >
> 
> >>>
> >>
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: [PROPOSAL] Splittable DoFn - Replacing the Source API with non-monolithic element processing in DoFn

2016-08-21 Thread Aljoscha Krettek
Hi Eugene,
thanks for the long description! With the interleaving of execution it
completely makes sense.

Best,
Aljoscha

On Sun, 21 Aug 2016 at 21:14 Jean-Baptiste Onofré <j...@nanthrax.net> wrote:

> Anyway, from a runner perspective, we will have kind of API (part of the
> Runner API) to "orchestrate" the SDF as we discussed during the call,
> right ?
>
> Regards
> JB
>
> On 08/21/2016 07:24 PM, Eugene Kirpichov wrote:
> > Hi Aljoscha,
> > This is an excellent question! And the answer is, we don't need any new
> > concepts like "SDF executor" and can rely on the per-key state and timers
> > machinery that already exists in all runners because it's necessary to
> > implement windowing/triggering properly.
> >
> > Note that this is already somewhat addressed in the previously posted
> State
> > and Timers proposal https://s.apache.org/beam-state , under "per-key
> > workflows".
> >
> > Think of it this way, using the Kafka example: we'll expand it into a
> > transform:
> >
> > (1) ParDo { topic -> (unique key, topic, partition, [0, inf))) for
> > partition in topic.listPartitions() }
> > (2) GroupByKey
> > (3) ParDo { key, topic, partition, R -> Kafka reader code in the
> > proposal/slides }
> >   - R is the OffsetRange restriction which in this case will be always of
> > the form [startOffset, inf).
> >   - there'll be just 1 value per key, but we use GBK to just get access
> to
> > the per-key state/timers machinery. This may be runner-specific; maybe
> some
> > runners don't need a GBK to do that.
> >
> > Now suppose the topic has two partitions, P1 and P2, and they get
> assigned
> > unique keys K1, K2.
> > Then the input to (3) will be a collection of: (K1, topic, P1, [0, inf)),
> > (K2, topic, P2, [0, inf)).
> > Suppose we have just 1 worker with just 1 thread. Now, how will this
> thread
> > be able to produce elements from both P1 and P2? here's how.
> >
> > The thread will process (K1, topic, P1, [0, inf)), checkpoint after a
> > certain time or after a certain number of elements are output (just like
> > with the current UnboundedSource reading code) producing a residual
> > restriction R1' (basically a new start timestamp), put R11 into the
> per-key
> > state and set a timer T1 to resume.
> > Then it will process (K2, topic, P2, [0, inf)), do the same producing a
> > residual restriction R2' and setting a timer T2 to resume.
> > Then timer T1 will fire in the context of the key K1. The thread will
> call
> > processElement again, this time supplying R1' as the restriction; the
> > process repeats and after a while it checkpoints and stores R1'' into
> state
> > of K1.
> > Then timer T2 will fire in the context of K2, run processElement for a
> > while, set a new timer and store R2'' into the state of K2.
> > Etc.
> > If partition 1 goes away, the processElement call will return "do not
> > resume", so a timer will not be set and instead the state associated with
> > K1 will be GC'd.
> >
> > So basically it's almost like cooperative thread scheduling: things run
> for
> > a while, until the runner tells them to checkpoint, then they set a timer
> > to resume themselves, and the runner fires the timers, and the process
> > repeats. And, again, this only requires things that runners can already
> do
> > - state and timers, but no new concept of SDF executor (and consequently
> no
> > necessity to choose/tune how many you need).
> >
> > Makes sense?
> >
> > On Sat, Aug 20, 2016 at 9:46 AM Aljoscha Krettek <aljos...@apache.org>
> > wrote:
> >
> >> Hi,
> >> I have another question that I think wasn't addressed in the meeting. At
> >> least it wasn't mentioned in the notes.
> >>
> >> In the context of replacing sources by a combination of to SDFs, how do
> you
> >> determine how many "SDF executor" instances you need downstream? For the
> >> sake of argument assume that both SDFs are executed with parallelism 1
> (or
> >> one per worker). Now, if you have a file source that reads from a static
> >> set of files the first SDF would emit the filenames while the second SDF
> >> would receive the filenames and emit their contents. This works well and
> >> the downstream SDF can process one filename after the other. Now, think
> of
> >> something like a Kafka source. The first SDF would emit the partitions
> (say
> >> 4 partitions, in this example) and the second SDF would be responsible
&

Re: [PROPOSAL] Splittable DoFn - Replacing the Source API with non-monolithic element processing in DoFn

2016-08-20 Thread Aljoscha Krettek
Hi,
I have another question that I think wasn't addressed in the meeting. At
least it wasn't mentioned in the notes.

In the context of replacing sources by a combination of to SDFs, how do you
determine how many "SDF executor" instances you need downstream? For the
sake of argument assume that both SDFs are executed with parallelism 1 (or
one per worker). Now, if you have a file source that reads from a static
set of files the first SDF would emit the filenames while the second SDF
would receive the filenames and emit their contents. This works well and
the downstream SDF can process one filename after the other. Now, think of
something like a Kafka source. The first SDF would emit the partitions (say
4 partitions, in this example) and the second SDF would be responsible for
reading from a topic and emitting elements. Reading from one topic never
finishes so you can't process the topics in series. I think you would need
to have 4 downstream "SDF executor" instances. The question now is: how do
you determine whether you are in the first or the second situation?

Probably I'm just overlooking something and this is already dealt with
somewhere... :-)

Cheers,
Aljoscha

On Fri, 19 Aug 2016 at 21:02 Ismaël Mejía  wrote:

> Hello,
>
> Thanks for the notes both Dan and Eugene, and for taking the time to do the
> presentation and  answer our questions.
>
> I mentioned the ongoing work on dynamic scaling on Flink because I suppose
> that it will address dynamic rebalancing eventually (there are multiple
> changes going on for dynamic scaling).
>
>
> https://docs.google.com/document/d/1G1OS1z3xEBOrYD4wSu-LuBCyPUWyFd9l3T9WyssQ63w/edit#heading=h.2suhu1fjxmp4
>
> https://lists.apache.org/list.html?d...@flink.apache.org:lte=1M:FLIP-8
>
> Anyway I am far from an expert on flink, but probably the flink guys can
> give their opinion about this and refer to a more precise document that the
> ones I mentioned..
>
> ​Thanks again,
> Ismaël​
>
> On Fri, Aug 19, 2016 at 8:52 PM, Jean-Baptiste Onofré 
> wrote:
>
> > Great summary Eugene and Dan.
> >
> > And thanks again for the details, explanation, and discussion.
> >
> > Regards
> > JB
> >
> >
> > On 08/19/2016 08:16 PM, Eugene Kirpichov wrote:
> >
> >> Thanks for attending, everybody!
> >>
> >> Here are meeting notes (thanks Dan!).
> >>
> >> Q: Will SplittableDoFn enable better repartitioning of the input/output
> >> data?
> >> A: Not really; repartitioning is orthogonal to SDF.
> >>
> >> Current Source API suffers from lack of composition and scalability
> >> because
> >> we treat sources too much as metadata, not enough as data.
> >>
> >> Q(slide with transform expansion): who does the "magic"?
> >> A: The runner. Checkpointing and dynamically splitting restrictions will
> >> require collaboration with the runner.
> >>
> >> Q: How does the runner interact with the DoFn to control the
> restrictions?
> >> Is it related to the centralized job tracker etc.?
> >> A: RestrictionTracker is a simple helper object, that exists purely on
> the
> >> worker while executing a single partition, and interacts with the worker
> >> harness part of the runner. Not to be confused with the centralized job
> >> tracker (master) - completely unrelated. Worker harness, of course,
> >> interacts with the master in some relevant ways (e.g. Dataflow master
> can
> >> tell "you're a straggler, you should split").
> >>
> >> Q: Is this a new DoFn subclass, or how will this integrate with the
> >> existing code?
> >> A: It's a feature of reflection-based DoFn (
> https://s.apache.org/a-new-do
> >> fn)
> >> - just another optional parameter of type RestrictionTracker to
> >> processElement() which is dynamically bound via reflection, so fully
> >> backward/forward compatible, and looks to users like a regular DoFn.
> >>
> >> Q: why is fractionClaimed a double?
> >> A: It's a number in 0..1. Job-wide magic (e.g. autoscaling, dynamic
> >> rebalancing) requires a uniform way to represent progress through
> >> different
> >> sources.
> >>
> >> Q: Spark runner is microbatch-based, so this seems to map well onto
> >> checkpoint/resume, right?
> >> A: Yes; actually the Dataflow runner is, at a worker level, also
> >> microbatch-based. The way SDF interacts with a runner will be very
> similar
> >> to how a Bounded/UnboundedSource interacts with a runner.
> >>
> >> Q: Using SDF, what would be the "packaging" of the IO?
> >> A: Same as currently: package IO's as PTransforms and their
> implementation
> >> under the hood can be anything: Source, simple ParDo's, SDF, etc. E.g.
> >> Datastore was recently refactored from BoundedSource to ParDo (ended up
> >> simpler and more scalable), transparently to users.
> >>
> >> Q: What's the timeline; what to do with the IOs currently in
> development?
> >> A: Timeline is O(months). Keep doing what you're doing and working on
> top
> >> of Source APIs when necessary and simple ParDo's otherwise.
> >>
> >> Q: What's the impact for the 

Re: [PROPOSAL] Splittable DoFn - Replacing the Source API with non-monolithic element processing in DoFn

2016-08-08 Thread Aljoscha Krettek
Jip, thanks, that answers it.

On Fri, 5 Aug 2016 at 19:51 Eugene Kirpichov <kirpic...@google.com.invalid>
wrote:

> Hi Aljoscha,
>
> AFAIK, the effect of .requiresDeduping() is that the runner inserts a
> GBK/dedup transform on top of the read. This seems entirely compatible with
> SDF, except it will be decoupled from the SDF itself: if an SDF produces
> output that potentially contains duplicates, and there's no easy way to fix
> it in the SDF itself, and you (developer of the connector) would like to
> eliminate them, you can explicitly compose the SDF with a canned deduping
> transform. Does this address your question?
>
> Thanks!
>
> On Fri, Aug 5, 2016 at 7:14 AM Aljoscha Krettek <aljos...@apache.org>
> wrote:
>
> > I really like the proposal, especially how it unifies at lot of things.
> >
> > One question: How would this work with sources that (right now) return
> true
> > from UnboundedSource.requiresDeduping(). As I understand it the code that
> > executes such sources has to do bookkeeping to ensure that we don't get
> > duplicate values. Would we add such a feature for the output of DoFns or
> > would we work towards removing the deduping functionality from Beam and
> > push it into the source implementations?
> >
> > Cheers,
> > Aljoscha
> >
> > On Fri, 5 Aug 2016 at 03:27 Jean-Baptiste Onofré <j...@nanthrax.net>
> wrote:
> >
> > > By the way I like the use cases you are introducing: we discussed about
> > > similar use cases with Dan.
> > >
> > > Just wonder about the existing IO.
> > >
> > > Regards
> > > JB
> > >
> > >
> > > On August 4, 2016 7:46:14 PM Eugene Kirpichov
> > > <kirpic...@google.com.INVALID> wrote:
> > >
> > > > Hello Beam community,
> > > >
> > > > We (myself, Daniel Mills and Robert Bradshaw) would like to propose
> > > > "Splittable DoFn" - a major generalization of DoFn, which allows
> > > processing
> > > > of a single element to be non-monolithic, i.e. checkpointable and
> > > > parallelizable, as well as doing an unbounded amount of work per
> > element.
> > > >
> > > > This allows effectively replacing the current Bounded/UnboundedSource
> > > APIs
> > > > with DoFn's that are much easier to code, more scalable and
> composable
> > > with
> > > > the rest of the Beam programming model, and enables many use cases
> that
> > > > were previously difficult or impossible, as well as some non-obvious
> > new
> > > > use cases.
> > > >
> > > > This proposal has been mentioned before in JIRA [BEAM-65] and some
> Beam
> > > > meetings, and now the whole thing is written up in a document:
> > > >
> > > > https://s.apache.org/splittable-do-fn
> > > >
> > > > Here are some things that become possible with Splittable DoFn:
> > > > - Efficiently read a filepattern matching millions of files
> > > > - Read a collection of files that are produced by an earlier step in
> > the
> > > > pipeline (e.g. easily implement a connector to a storage system that
> > can
> > > > export itself to files)
> > > > - Implement a Kafka reader by composing a "list partitions" DoFn
> with a
> > > > DoFn that simply polls a consumer and outputs new records in a
> while()
> > > loop
> > > > - Implement a log tailer by composing a DoFn that incrementally
> returns
> > > new
> > > > files in a directory and a DoFn that tails a file
> > > > - Implement a parallel "count friends in common" algorithm (matrix
> > > > squaring) with good work balancing
> > > >
> > > > Here is the meaningful part of a hypothetical Kafka reader written
> > > against
> > > > this API:
> > > >
> > > > ProcessContinuation processElement(
> > > > ProcessContext context, OffsetRangeTracker tracker) {
> > > >   try (KafkaConsumer<String, String> consumer =
> > > > Kafka.subscribe(context.element().topic,
> > > > context.element().partition)) {
> > > > consumer.seek(tracker.start());
> > > > while (true) {
> > > >   ConsumerRecords<String, String> records =
> > consumer.poll(100ms);
> > > >   if (records == null) return done();
>

Re: [PROPOSAL] Website page or Jira to host all current proposal discussion and docs

2016-08-08 Thread Aljoscha Krettek
Please have a look at this:
https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals

We recently started using this process in Flink and so far are quite happy
with it.

On Mon, 8 Aug 2016 at 06:52 Jean-Baptiste Onofré  wrote:

> Good point Ben.
>
> I would say a "discussion" Jira can "evolve" to a implementation "Jira"
> (just changing the component).
>
> WDYT ?
>
> Regards
> JB
>
> On 08/08/2016 06:50 AM, Ben Chambers wrote:
> > Would we use the same Jira to track the series of PRs implementing the
> > proposal (if accepted) or would it be discussion only (possibly linked to
> > the implementation tasks)?
> >
> > On Sun, Aug 7, 2016, 9:48 PM Frances Perry 
> wrote:
> >
> >> I'm a huge fan of keeping all the details related to a topic in a
> relevant
> >> jira issue.
> >>
> >> On Sun, Aug 7, 2016 at 9:31 PM, Jean-Baptiste Onofré 
> >> wrote:
> >>
> >>> Hi guys,
> >>>
> >>> we have now several technical discussions, sent on the mailing list
> with
> >>> link to document for details.
> >>>
> >>> I think it's not easy for people to follow the different discussions,
> and
> >>> to look for the e-mail containing the document links.
> >>>
> >>> Of course, it's required to have the discussion on the mailing list
> (per
> >>> Apache rules). However, maybe it could be helpful to have a place to
> find
> >>> open discussions, with the link to the mailing list discussion thread,
> >> and
> >>> to the detailed document.
> >>> It could be on the website (but maybe not easy to maintain and
> publish),
> >>> or on Jira (one Jira per discussion), or a wiki.
> >>>
> >>> WDYT ?
> >>>
> >>> Regards
> >>> JB
> >>> --
> >>> Jean-Baptiste Onofré
> >>> jbono...@apache.org
> >>> http://blog.nanthrax.net
> >>> Talend - http://www.talend.com
> >>>
> >>
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: Proposal: Dynamic PIpelineOptions

2016-08-05 Thread Aljoscha Krettek
+1

It's true that Flink provides a way to pass dynamic parameters to operator
instances. That's not used in any of the built-in sources and operators,
however. They are instantiated with their parameters when the graph is
constructed. So what you are suggesting for Beam would actually provide
more functionality than what we currently have in Flink. :-)

Out of the options I think (4) would be the best. (1) and (2) are not type
safe, correct? and (3) seems very boilerplate-y.

Cheers,
Aljoscha

On Thu, 4 Aug 2016 at 21:53 Frances Perry  wrote:

> +Amit, Aljoscha, Manu
>
> Any comments from folks on the Flink, Spark, or Gearpump runners?
>
> On Tue, Aug 2, 2016 at 11:10 AM, Robert Bradshaw <
> rober...@google.com.invalid> wrote:
>
> > Being able to "late-bind" parameters like input paths to a
> > pre-constructed program would be a very useful feature, and I think is
> > worth adding to Beam.
> >
> > Of the four API proposals, I have a strong preference for (4).
> > Further, it seems that these need not be bound to the PipelineOptions
> > object itself (i.e. a named RuntimeValueSupplier could be constructed
> > off of a pipeline object), which the Python API makes less heavy use
> > of (encouraging the user to use familiar, standard libraries for
> > argument parsing), though of course such integration is useful to
> > provide for convenience.
> >
> > - Robert
> >
> > On Fri, Jul 29, 2016 at 12:14 PM, Sam McVeety 
> > wrote:
> > > During the graph construction phase, the given SDK generates an initial
> > > execution graph for the program.  At execution time, this graph is
> > > executed, either locally or by a service.  Currently, Beam only
> supports
> > > parameterization at graph construction time.  Both Flink and Spark
> supply
> > > functionality that allows a pre-compiled job to be run without SDK
> > > interaction with updated runtime parameters.
> > >
> > > In its current incarnation, Dataflow can read values of PipelineOptions
> > at
> > > job submission time, but this requires the presence of an SDK to
> properly
> > > encode these values into the job.  We would like to build a common
> layer
> > > into the Beam model so that these dynamic options can be properly
> > provided
> > > to jobs.
> > >
> > > Please see
> > > https://docs.google.com/document/d/1I-iIgWDYasb7ZmXbGBHdok_I
> > K1r1YAJ90JG5Fz0_28o/edit
> > > for the high-level model, and
> > > https://docs.google.com/document/d/17I7HeNQmiIfOJi0aI70tgGMM
> > kOSgGi8ZUH-MOnFatZ8/edit
> > > for
> > > the specific API proposal.
> > >
> > > Cheers,
> > > Sam
> >
>


Re: [PROPOSAL] Splittable DoFn - Replacing the Source API with non-monolithic element processing in DoFn

2016-08-05 Thread Aljoscha Krettek
I really like the proposal, especially how it unifies at lot of things.

One question: How would this work with sources that (right now) return true
from UnboundedSource.requiresDeduping(). As I understand it the code that
executes such sources has to do bookkeeping to ensure that we don't get
duplicate values. Would we add such a feature for the output of DoFns or
would we work towards removing the deduping functionality from Beam and
push it into the source implementations?

Cheers,
Aljoscha

On Fri, 5 Aug 2016 at 03:27 Jean-Baptiste Onofré  wrote:

> By the way I like the use cases you are introducing: we discussed about
> similar use cases with Dan.
>
> Just wonder about the existing IO.
>
> Regards
> JB
>
>
> On August 4, 2016 7:46:14 PM Eugene Kirpichov
>  wrote:
>
> > Hello Beam community,
> >
> > We (myself, Daniel Mills and Robert Bradshaw) would like to propose
> > "Splittable DoFn" - a major generalization of DoFn, which allows
> processing
> > of a single element to be non-monolithic, i.e. checkpointable and
> > parallelizable, as well as doing an unbounded amount of work per element.
> >
> > This allows effectively replacing the current Bounded/UnboundedSource
> APIs
> > with DoFn's that are much easier to code, more scalable and composable
> with
> > the rest of the Beam programming model, and enables many use cases that
> > were previously difficult or impossible, as well as some non-obvious new
> > use cases.
> >
> > This proposal has been mentioned before in JIRA [BEAM-65] and some Beam
> > meetings, and now the whole thing is written up in a document:
> >
> > https://s.apache.org/splittable-do-fn
> >
> > Here are some things that become possible with Splittable DoFn:
> > - Efficiently read a filepattern matching millions of files
> > - Read a collection of files that are produced by an earlier step in the
> > pipeline (e.g. easily implement a connector to a storage system that can
> > export itself to files)
> > - Implement a Kafka reader by composing a "list partitions" DoFn with a
> > DoFn that simply polls a consumer and outputs new records in a while()
> loop
> > - Implement a log tailer by composing a DoFn that incrementally returns
> new
> > files in a directory and a DoFn that tails a file
> > - Implement a parallel "count friends in common" algorithm (matrix
> > squaring) with good work balancing
> >
> > Here is the meaningful part of a hypothetical Kafka reader written
> against
> > this API:
> >
> > ProcessContinuation processElement(
> > ProcessContext context, OffsetRangeTracker tracker) {
> >   try (KafkaConsumer consumer =
> > Kafka.subscribe(context.element().topic,
> > context.element().partition)) {
> > consumer.seek(tracker.start());
> > while (true) {
> >   ConsumerRecords records = consumer.poll(100ms);
> >   if (records == null) return done();
> >   for (ConsumerRecord record : records) {
> > if (!tracker.tryClaim(record.offset())) {
> >   return
> resume().withFutureOutputWatermark(record.timestamp());
> > }
> > context.output(record);
> >   }
> > }
> >   }
> > }
> >
> > The document describes in detail the motivations behind this feature, the
> > basic idea and API, open questions, and outlines an incremental delivery
> > plan.
> >
> > The proposed API builds on the reflection-based new DoFn [new-do-fn] and
> is
> > loosely related to "State and Timers for DoFn" [beam-state].
> >
> > Please take a look and comment!
> >
> > Thanks.
> >
> > [BEAM-65] https://issues.apache.org/jira/browse/BEAM-65
> > [new-do-fn] https://s.apache.org/a-new-do-fn
> > [beam-state] https://s.apache.org/beam-state
>
>
>


Re: [PROPOSAL] Pipeline Runner API design doc

2016-08-02 Thread Aljoscha Krettek
Hi,
thanks for putting this together. Now that I'm seeing them side by side I
think the Avro schema looks a lot nicer than the JSON schema but it's
probably alright since we don't want to change this often (as you already
said). The advantage of JSON is that the (intermediate) plans can easily be
inspected by humans.

I think at this stage there is not much left to discuss on the plan
representation. To me it seems pretty straightforward what has to be in
there and that is already more or less in. The only real thing missing are
triggers but there isn't yet a discussion about how that is going to work
out, correct?

Cheers,
Aljoscha

On Thu, 14 Jul 2016 at 21:34 Kenneth Knowles  wrote:

> Hi everyone,
>
> I wanted to circle back on this thread and with another invitation to a
> discussion. Work on the high level refactorings to align the Java SDK with
> the primitives of the proposed model is pretty far along, as is moving out
> the stuff that we don't want in the user-facing SDK.
>
> Since our runners are all Java-based, and we tend to discuss the model in
> Java first, I think part of the proposal that may have received less
> attention was the concrete Avro schema towards the bottom of the doc. Since
> our serialization tech discussion seemed to favor JSON on the front end, I
> just spent a few minutes to port the Avro schema to a JSON schema and do
> some project set up to demonstrate where & how it would incorporate into
> the project structure. I'd done the same for Avro previously, so we can see
> how they compare.
>
> I put the code in a PR, for discussion only at this point, at
> https://github.com/apache/incubator-beam/pull/662. I'd love if you took a
> look at the notes on the PR and briefly at the schema; I'll continue to
> evolve it according to current & future feedback.
>
> Kenn
>
> On Wed, Mar 23, 2016 at 2:17 PM, Kenneth Knowles  wrote:
>
> > Hi everyone,
> >
> > Incorporating the feedback from the 1-pager I circulated a week ago, I
> > have put together a concrete design document for the new API(s).
> >
> >
> >
> https://docs.google.com/document/d/1bao-5B6uBuf-kwH1meenAuXXS0c9cBQ1B2J59I3FiyI/edit?usp=sharing
> >
> > I appreciate any and all feedback on the design.
> >
> > Kenn
> >
>


Re: [RESULT] Release version 0.2.0-incubating

2016-07-31 Thread Aljoscha Krettek
Ah, it seems I always have to mention it? I would also make mine "+1
(binding)"

On Sun, 31 Jul 2016 at 12:47 Dan Halperin <dhalp...@google.com.invalid>
wrote:

> My apologies: a slight revision. We have 4 approving votes, including 3
> binding votes.
>
> On Sun, Jul 31, 2016 at 12:29 PM, Dan Halperin <dhalp...@google.com>
> wrote:
>
> > I'm happy to announce that we have unanimously approved this release.
> >
> > There are 3 binding approving votes:
> > * Dan Halperin
> > * Jean-Baptiste Onofré
> > * Amit Sela
> >
> > There is a fourth approving vote:
>
> > * Aljoscha Krettek (not binding)
> >
>
>
> > There are no disapproving votes.
> >
> > At this point, this proposal will be presented to the Apache Incubator
> for
> > their review.
> >
> > Thanks,
> > Dan
> >
>


Re: [VOTE] Release version 0.2.0-incubating

2016-07-31 Thread Aljoscha Krettek
+1

I checked:
- MD5 and SHA1 are good
- read DISCLAIMER, NOTICE, README.md and README of Flink Runnner
- source distribution builds with "mvn clean verify"
- tested the built-in generic org.apache.beam.examples.WordCount with a
local Flink cluster
- files names have "incubating"
- source release contains no binaries
- sources files have license headers (I'm relying on the rat plugin here,
though)

(As I said above I fixed the "fixed version" tag on two issues.)


On Sun, 31 Jul 2016 at 08:22 Aljoscha Krettek <aljos...@apache.org> wrote:

> Just a quick note, these two were not fixed for 0.2.0:
>  - [BEAM-478 <https://issues.apache.org/jira/browse/BEAM-478>] - Create
> Vector, Matrix types and operations to enable linear algebra API
>  - [BEAM-322 <https://issues.apache.org/jira/browse/BEAM-322>] - Compare
> encoded keys in streaming mode
>
> I removed the "fix version" of 0.2.0-incubating from them, that should fix
> the release notes.
>
>
> On Thu, 28 Jul 2016 at 06:55 Jean-Baptiste Onofré <j...@nanthrax.net> wrote:
>
>> Just one note.
>>
>> The key used to sign the artefacts is present in the Beam KEYS file
>> located:
>>
>> https://dist.apache.org/repos/dist/release/incubator/beam/KEYS
>>
>> You can find the 0.2.0-incubating source distribution here:
>>
>>
>> https://repository.apache.org/content/repositories/orgapachebeam-1004/org/apache/beam/beam-parent/0.2.0-incubating/beam-parent-0.2.0-incubating-source-release.zip
>>
>> with the corresponding MD5, SHA1, and ASC files:
>>
>>
>> https://repository.apache.org/content/repositories/orgapachebeam-1004/org/apache/beam/beam-parent/0.2.0-incubating/beam-parent-0.2.0-incubating-source-release.zip.asc
>>
>>
>> https://repository.apache.org/content/repositories/orgapachebeam-1004/org/apache/beam/beam-parent/0.2.0-incubating/beam-parent-0.2.0-incubating-source-release.zip.md5
>>
>>
>> https://repository.apache.org/content/repositories/orgapachebeam-1004/org/apache/beam/beam-parent/0.2.0-incubating/beam-parent-0.2.0-incubating-source-release.zip.sha1
>>
>> Regards
>> JB
>>
>> On 07/28/2016 02:57 PM, Jean-Baptiste Onofré wrote:
>> > +1 (binding)
>> >
>> > I checked:
>> > - artefact names contain incubating
>> > - signatures and hashes good
>> > - DISCLAIMER exists
>> > - LICENSE file looks good
>> > - NOTICE file looks good
>> > - Source files have ASF headers
>> > - source distribution exists
>> > - Tested with GDELT samples and wordcount with Direct, Flink, and Spark
>> > runner (I plan to run on Google Dataflow asap).
>> >
>> > Regards
>> > JB
>> >
>> > On 07/28/2016 11:51 AM, Dan Halperin wrote:
>> >> Hey folks!
>> >>
>> >> I'm excited to be kicking off the first vote for the second release of
>> >> Apache Beam: version 0.2.0-incubating!
>> >>
>> >> As with 0.1.0-incubating, we are not looking for any specific new
>> >> functionality. Instead, we're
>> >> continuing to execute and refine the release process, as well as making
>> >> stable source code
>> >> and binary artifacts available for our users.
>> >>
>> >> The complete staging area is available for your review, which includes:
>> >> * the official Apache source release to be deployed to dist.apache.org
>> >> [1],
>> >> and
>> >> * all artifacts to be deployed to the Maven Central Repository [2].
>> >>
>> >> This corresponds to the tag "v0.2.0-incubating-RC2" in source control,
>> >> [3].
>> >>
>> >> New for this release: Release notes are available in JIRA [4].
>> >>
>> >> Please vote as follows:
>> >> [ ] +1, Approve the release
>> >> [ ] -1, Do not approve the release (please provide specific comments)
>> >>
>> >> Thanks,
>> >> Dan
>> >>
>> >> As a reminder for those of us still learning the Apache way, the
>> >> release checklist
>> >> is
>> >> here [5]. This is a "package release"-type of the Apache voting process
>> >> [6]. As
>> >> customary, the vote will be open for 72 hours. It is adopted by
>> majority
>> >> approval with at
>> >> least 3 PPMC affirmative votes. If approved, the proposal will be
>> >> presented
>> >> to the
>> >> Apache Incubator for their review.
>> >>
>> >> [1]
>> >>
>> https://repository.apache.org/content/repositories/orgapachebeam-1004/org/apache/beam/beam-parent/0.2.0-incubating/beam-parent-0.2.0-incubating-source-release.zip
>> >>
>> >> [2]
>> >> https://repository.apache.org/content/repositories/orgapachebeam-1004/
>> >> [3]
>> https://github.com/apache/incubator-beam/tree/v0.2.0-incubating-RC2
>> >> [4]
>> >>
>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12335766
>> >>
>> >> [5]
>> http://incubator.apache.org/guides/releasemanagement.html#check-list
>> >> [6] http://www.apache.org/foundation/voting.html
>> >>
>> >
>>
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>


Re: [VOTE] Release version 0.2.0-incubating

2016-07-31 Thread Aljoscha Krettek
Just a quick note, these two were not fixed for 0.2.0:
 - [BEAM-478 ] - Create
Vector, Matrix types and operations to enable linear algebra API
 - [BEAM-322 ] - Compare
encoded keys in streaming mode

I removed the "fix version" of 0.2.0-incubating from them, that should fix
the release notes.


On Thu, 28 Jul 2016 at 06:55 Jean-Baptiste Onofré  wrote:

> Just one note.
>
> The key used to sign the artefacts is present in the Beam KEYS file
> located:
>
> https://dist.apache.org/repos/dist/release/incubator/beam/KEYS
>
> You can find the 0.2.0-incubating source distribution here:
>
>
> https://repository.apache.org/content/repositories/orgapachebeam-1004/org/apache/beam/beam-parent/0.2.0-incubating/beam-parent-0.2.0-incubating-source-release.zip
>
> with the corresponding MD5, SHA1, and ASC files:
>
>
> https://repository.apache.org/content/repositories/orgapachebeam-1004/org/apache/beam/beam-parent/0.2.0-incubating/beam-parent-0.2.0-incubating-source-release.zip.asc
>
>
> https://repository.apache.org/content/repositories/orgapachebeam-1004/org/apache/beam/beam-parent/0.2.0-incubating/beam-parent-0.2.0-incubating-source-release.zip.md5
>
>
> https://repository.apache.org/content/repositories/orgapachebeam-1004/org/apache/beam/beam-parent/0.2.0-incubating/beam-parent-0.2.0-incubating-source-release.zip.sha1
>
> Regards
> JB
>
> On 07/28/2016 02:57 PM, Jean-Baptiste Onofré wrote:
> > +1 (binding)
> >
> > I checked:
> > - artefact names contain incubating
> > - signatures and hashes good
> > - DISCLAIMER exists
> > - LICENSE file looks good
> > - NOTICE file looks good
> > - Source files have ASF headers
> > - source distribution exists
> > - Tested with GDELT samples and wordcount with Direct, Flink, and Spark
> > runner (I plan to run on Google Dataflow asap).
> >
> > Regards
> > JB
> >
> > On 07/28/2016 11:51 AM, Dan Halperin wrote:
> >> Hey folks!
> >>
> >> I'm excited to be kicking off the first vote for the second release of
> >> Apache Beam: version 0.2.0-incubating!
> >>
> >> As with 0.1.0-incubating, we are not looking for any specific new
> >> functionality. Instead, we're
> >> continuing to execute and refine the release process, as well as making
> >> stable source code
> >> and binary artifacts available for our users.
> >>
> >> The complete staging area is available for your review, which includes:
> >> * the official Apache source release to be deployed to dist.apache.org
> >> [1],
> >> and
> >> * all artifacts to be deployed to the Maven Central Repository [2].
> >>
> >> This corresponds to the tag "v0.2.0-incubating-RC2" in source control,
> >> [3].
> >>
> >> New for this release: Release notes are available in JIRA [4].
> >>
> >> Please vote as follows:
> >> [ ] +1, Approve the release
> >> [ ] -1, Do not approve the release (please provide specific comments)
> >>
> >> Thanks,
> >> Dan
> >>
> >> As a reminder for those of us still learning the Apache way, the
> >> release checklist
> >> is
> >> here [5]. This is a "package release"-type of the Apache voting process
> >> [6]. As
> >> customary, the vote will be open for 72 hours. It is adopted by majority
> >> approval with at
> >> least 3 PPMC affirmative votes. If approved, the proposal will be
> >> presented
> >> to the
> >> Apache Incubator for their review.
> >>
> >> [1]
> >>
> https://repository.apache.org/content/repositories/orgapachebeam-1004/org/apache/beam/beam-parent/0.2.0-incubating/beam-parent-0.2.0-incubating-source-release.zip
> >>
> >> [2]
> >> https://repository.apache.org/content/repositories/orgapachebeam-1004/
> >> [3] https://github.com/apache/incubator-beam/tree/v0.2.0-incubating-RC2
> >> [4]
> >>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12335766
> >>
> >> [5]
> http://incubator.apache.org/guides/releasemanagement.html#check-list
> >> [6] http://www.apache.org/foundation/voting.html
> >>
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: [PROPOSAL] State and Timers for DoFn (aka per-key workflows)

2016-07-29 Thread Aljoscha Krettek
+1 Very nice proposal and the API already looks very good. I guess the only
thing people still like to discuss on this is naming of things. :-)

I just have one general remark about giving users access to state and
timers. The Beam model takes great care to mostly shield users from the
reality of out-of-order events. The windowing mostly deals with this
internally and the watermarks provide some level of completeness
guarantees. If users directly modify their state based on each arriving
element they might run into problems if they don't take into account that
elements can (will) arrive out-of-order. For example, let's say they have
three types of event: "start", "in-between", and "end". In the state
machine they probably assume that the "start" event will arrive first and
that the "end" event will arrive last. Due to slowdowns anywhere in the
system they might not arrive in that order, however, and the state machine
will trip up. This is an artificial example but I imagine there could be
real-world cases where this plays a role.

Do we have any ideas on mitigating those kinds of problems or will we rely
on users properly understanding that this could happen in their pipeline?

Cheers,
Aljoscha

On Wed, 27 Jul 2016 at 05:20 Kenneth Knowles  wrote:

> Hi everyone,
>
>
> I would like to offer a proposal for a much-requested feature in Beam:
> Stateful processing in a DoFn. Please check out and comment on the proposal
> at this URL:
>
>
>   https://s.apache.org/beam-state
>
>
> This proposal includes user-facing APIs for persistent state and timers.
> Together, these provide rich capabilities that have been called "per-key
> workflows", the subject of [BEAM-23].
>
>
> Note that this proposal has an important prerequisite: a new design for
> DoFn. The new DoFn is strongly motivated by this design for state and
> timers, but we should discuss it separately. I will start a separate thread
> for that.
>
>
> On this email thread, I'd like to try to focus the discussion on state &
> timers. And of course, please do comment on the particulars in the
> document.
>
>
> Kenn
>
>
> [BEAM-23] https://issues.apache.org/jira/browse/BEAM-23
>


Re: [PROPOSAL] A brand new DoFn

2016-07-28 Thread Aljoscha Krettek
+1

At first I liked the API but was skeptical because I though that this would
require reflective invocation. Then I read on and saw that code generation
is used and was convinced. :-)

I especially like how it both cleans up the API and allows more
optimizations in the future, especially with side inputs and the different
methods for emitting.

On Wed, 27 Jul 2016 at 06:49 Jean-Baptiste Onofré  wrote:

>
>
> +1
> I like the proposal and great description.
> ThanksRegards JB
>
>  Original message 
> From: Kenneth Knowles 
> Date: 27/07/2016  05:29  (GMT+01:00)
> To: dev@beam.incubator.apache.org
> Subject: [PROPOSAL] A brand new DoFn
>
> Hi all,
>
> I have a major new feature to propose: the next generation of DoFn.
>
> It sounds a bit grandiose, but I think it is the best way to understand the
> proposal.
>
> This is strongly motivated by the design for state and timers, aka "per-key
> workflows". Since the two features are separable and have separate design
> docs, I have started a separate thread for each.
>
> To get a quick overview of the proposal for a new DoFn, and how it improves
> upon the flexibility and validation of DoFn, browse this presentation:
>
>   https://s.apache.org/presenting-a-new-dofn
>
> Due to the extent of this proposal, Ben & I have also prepared an in-depth
> document at https://s.apache.org/a-new-dofn with additional details.
> Please
> comment on particulars there, or just reply to this email.
>
> The remainder of this email is yet another summary of the proposal, to
> entice you to read the documents above and respond with a "+1".
>
> This is a feature that has been an experimental feature of the Java SDK for
> some time, under the name DoFnWithContext. For the purposes of this email
> and the linked documents, I will call it NewDoFn and I will call the status
> quo OldDoFn.
>
> The differences between NewDoFn and and OldDoFn are most easily understood
> with a quick code snippet:
>
> new OldDoFn() {
>   @Override
>   public void processElement(ProcessContext c) { … }
> }
>
> new NewDoFn() {
>   @ProcessElement   // <-- This is the only difference
>   public void processElement(ProcessContext c) { … }
> }
>
> What changed? NewDoFn uses annotation-based dispatch instead of method
> overrides. The method annotated with @ProcessElement is used to process
> elements. It can have any name or signature, and validation is performed at
> pipeline construction time.
>
> Why do this? It allows the argument list for processElement to change. This
> approach gives NewDoFn many advantages, which are demonstrated at length in
> the linked documents. Here are some highlights:
>
>  - Simpler backwards-compatible approaches to new features
>  - Simpler composition of advanced features
>  - Greater pipeline construction-time validation
>  - Easier evolution of a simple anonymous DoFn into one that uses advanced
> features
>
> Here are some abbreviated demonstrations of things that work today or could
> work easily with NewDoFn but require complex interrelated designs without
> it:
>
> Access the element's window:
>
> new NewDoFn() {
>   @ProcessElement
>   public void processElement(ProcessContext c, BoundedWindow w) { … }
> }
>
> Use persistent state:
>
> new NewDoFn() {
>   @ProcessElement
>   public void processElement(
>   ProcessContext c,
>   @StateId("cell-id") ValueState state) {
> …
>   }
> }
>
> Set and receive timers:
>
> new NewDoFn() {
>   @ProcessElement
>   public void processElement(
>   ProcessContext c,
>   @TimerId("timer-id") Timer state) {
> …
>   }
>
>   @OnTimer("timer-id")
>   void onMyTimer(OnTimerContext) { … }
> }
>
> Receive a side input as a parameter:
>
> new NewDoFn() {
>   @ProcessElement
>   public void processElement(
>   ProcessContext c,
>   @SideInput Supplier side) {
> …
>   }
> }
>
> So this is what I am proposing: We should move the Beam Java SDK to
> NewDoFn!
>
> My proposed migration plan is:
>
> 1. leave a git tag before anything, so users can pin to it
> 2. mv DoFn OldDoFn && mv DoFnWithContext DoFn
> 3. get everything working with all runners
> 4. rm OldDoFn # a few weeks later
>
> This will affect bleeding edge users, who will need to replace @Override
> with @ProcessElement in all their DoFns. They can also pin to a commit
> prior to the change or temporarily replace DoFn with OldDoFn everywhere.
>
> I've already done step 2 in a branch at
> https://github.com/kennknowles/incubator-beam/DoFnWithContext and ported a
> few examples in their own commits. If you view those commits, you can see
> how simple the migration path is.
>
> Please let me know what you think. It is a big change, but one that I think
> yields pretty nice rewards.
>
> Kenn
>

Re: Beam/Flink : State access

2016-07-26 Thread Aljoscha Krettek
Hi,
the purpose of Beam is to abstract the user from the underlying execution
engine. IMHO, allowing access to state of the underlying execution engine
will never be a goal for the Beam project.

If you want/need to access Flink state, I think this is a good indicator
that you should use Flink directly because your programs would never run on
another runner anyways.

Cheers,
Aljoscha

On Mon, 25 Jul 2016 at 17:04 Aparup Banerjee (apbanerj) 
wrote:

> I am looking for a way to access streaming engine state (Flink) in my beam
> transformations. I understand this can be accessed only though the runtime
> context. Has any one tried accessing flink runtime context in beam? May be
> roll it up as a custom API of some sort. Might need some changes in
> FlinkRunner is my hunch.
>
> Thanks,
> Aparup
>


Re: Help understand how Flink Runner translate triggering information

2016-07-25 Thread Aljoscha Krettek
Hi,
for that you would have to look at how Combine.PerKey and GroupByKey are
translated. We use a GroupAlsoByWindowViaWindowSetDoFn that internally uses
a ReduceFnRunner to manage all the windowing. The windowing strategy as
well as the SystemReduceFn is passed to
GroupAlsoByWindowViaWindowSetDoFn.create() to create an actual instance of
 GroupAlsoByWindowViaWindowSetDoFn.

Cheers,
Aljoscha

On Mon, 25 Jul 2016 at 17:55 Shen Li  wrote:

> Hi,
>
> I am trying to understand how Flink Runner tells the Flink system about the
> triggers defined using Beam API. In the source code of Flink runner, the
> WindowBoundTranslator passes the windowingStrategy to the
> FlinkParDoBoundWrapper which does not seem to use it? How is the triggering
> information passed to Flink?
>
> Thanks,
>
> Shen
>


Re: [PROPOSAL] CoGBK as primitive transform instead of GBK

2016-07-21 Thread Aljoscha Krettek
+1

Out of curiosity, does Cloud Dataflow have a CoGBK primitive or will it
also be executed as a GBK there?

On Thu, 21 Jul 2016 at 02:29 Kam Kasravi  wrote:

> +1 - awesome Manu.
>
> On Wednesday, July 20, 2016 1:53 PM, Kenneth Knowles
>  wrote:
>
>
>  +1
>
> I assume that the intent is for the semantics of both GBK and CoGBK to be
> unchanged, just swapping their status as primitives.
>
> This seems like a good change, with strictly positive impact on users and
> SDK authors, with only an extremely minor burden (doing an insertion of the
> provided implementation in the worst case) on runner authors.
>
> Kenn
>
>
> On Wed, Jul 20, 2016 at 10:38 AM, Lukasz Cwik 
> wrote:
>
> > I would like to propose a change to Beam to make CoGBK the basis for
> > grouping instead of GBK. The idea behind this proposal is that CoGBK is a
> > more powerful operator then GBK allowing for two key benefits:
> >
> > 1) SDKs are simplified: transforming a CoGBK into a GBK is trivial while
> > the reverse is not.
> > 2) It will be easier for runners to provide more efficient
> implementations
> > of CoGBK as they will be responsible for the logic which takes their own
> > internal grouping implementation and maps it onto a CoGBK.
> >
> > This requires the following modifications to the Beam code base:
> >
> > 1) Make GBK a composite transform in terms of CoGBK.
> > 2) Move the CoGBK from contrib to runners-core as an adapter*. Runners
> that
> > more naturally support GBK can just use this and everything executes
> > exactly as before.
> >
> > *just like GroupByKeyViaGroupByKeyOnly and UnboundedReadFromBoundedSource
> >
>
>
>


Re: Adding DoFn Setup and Teardown methods

2016-07-18 Thread Aljoscha Krettek
Did you mean "usual" or "useful"? ;-)

On Mon, 18 Jul 2016 at 12:42 Maximilian Michels <m...@apache.org> wrote:

> +1 for setup() and teardown() methods. Very usual for proper initialization
> and cleanup of DoFn related data structures.
>
> On Wed, Jun 29, 2016 at 9:34 PM, Aljoscha Krettek <aljos...@apache.org>
> wrote:
>
> > +1 I think some people might already mistake the
> > startBundle()/finishBundle() methods for what the new methods are
> supposed
> > to be
> >
> > On Tue, 28 Jun 2016 at 19:38 Raghu Angadi <rang...@google.com.invalid>
> > wrote:
> >
> > > This is terrific!
> > > Thanks for the proposal.
> > >
> > > On Tue, Jun 28, 2016 at 9:06 AM, Thomas Groh <tg...@google.com.invalid
> >
> > > wrote:
> > >
> > > > Hey Everyone:
> > > >
> > > > We've recently started to be permitted to reuse DoFn instances in
> > > Beam[1].
> > > > Beyond the efficiency gains from not having to deserialize new DoFn
> > > > instances for every bundle, DoFn reuse also provides the ability to
> > > > minimize expensive setup work done per-bundle, which hasn't formerly
> > been
> > > > possible. Additionally, it has also enabled more failure cases, where
> > > > element-based state leaks improperly across bundles.
> > > >
> > > > I've written a document proposing that two methods are added to the
> API
> > > of
> > > > DoFn, setup and teardown, which both provides hooks for users to
> write
> > > > efficient DoFns, as well as signals that DoFns will be reused.
> > > >
> > > > The document is located at
> > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1LLQqggSePURt3XavKBGV7SZJYQ4NW8yCu63lBchzMRk/edit?ts=5771458f#
> > > > and committers have edit access
> > > >
> > > > Thanks,
> > > >
> > > > Thomas
> > > >
> > > > [1] https://github.com/apache/incubator-beam/pull/419
> > > >
> > >
> >
>


Re: Display Data Runner Support

2016-07-04 Thread Aljoscha Krettek
Thanks Scott for this compilation of information! I'll look into how this
can be incorporated into the Flink runner once I have some time on my hands.

On Thu, 30 Jun 2016 at 17:05 Scott Wegner 
wrote:

> Hi Beam Dev community,
>
> I wanted to circle-back on a recent Beam feature, Display Data, which we
> proposed back in March [1] and is now implemented in the Beam SDK. Display
> Data provides a method for Runners to collect additional metadata about a
> pipeline during construction, suitable for display in UI. PipelineOptions,
> PTransforms, and user-defined function types (DoFn, CombineFn, WindowFn)
> will register their display data, and the SDK hooks are provided for users
> to integrate display data from their own components.
>
> Alex Amato and I wrote a blog post describing how Google Dataflow is now
> surfacing display data in its monitoring interface [2]. I encourage other
> Runner authors to take a look and consider how display data could fit into
> your runner. Integrating display data is relatively straightforward as most
> of the heavy-lifting is done in the SDK. The Dataflow Runner collects
> display data during pipeline translation for PipelineOptions [3] and
> PTransforms [4].
>
> Please have a look at the display data API docs [5] and let me know if you
> have any questions.
>
> - Scott
>
> [1]
>
> http://mail-archives.apache.org/mod_mbox/incubator-beam-dev/201603.mbox/raw/%3CCAN-7FgbR%3DyXPHZj-GrPO3aGSkkj11NXwAoyOGEzWc9r3ApnOpg%40mail.gmail.com%3E/1
> [2]
>
> https://cloud.google.com/blog/big-data/2016/06/dataflow-updates-see-more-details-about-your-pipelines
> [3]
>
> https://github.com/apache/incubator-beam/blob/7d767056a90e769eff68d4347e1b3a7bc43f415c/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowPipelineTranslator.java#L406
> [4]
>
> https://github.com/apache/incubator-beam/blob/7d767056a90e769eff68d4347e1b3a7bc43f415c/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowPipelineTranslator.java#L548
> [5]
>
> http://beam.incubator.apache.org/javadoc/0.1.0-incubating/org/apache/beam/sdk/transforms/display/HasDisplayData.html#populateDisplayData-org.apache.beam.sdk.transforms.display.DisplayData.Builder-
>


Re: Adding DoFn Setup and Teardown methods

2016-06-29 Thread Aljoscha Krettek
+1 I think some people might already mistake the
startBundle()/finishBundle() methods for what the new methods are supposed
to be

On Tue, 28 Jun 2016 at 19:38 Raghu Angadi 
wrote:

> This is terrific!
> Thanks for the proposal.
>
> On Tue, Jun 28, 2016 at 9:06 AM, Thomas Groh 
> wrote:
>
> > Hey Everyone:
> >
> > We've recently started to be permitted to reuse DoFn instances in
> Beam[1].
> > Beyond the efficiency gains from not having to deserialize new DoFn
> > instances for every bundle, DoFn reuse also provides the ability to
> > minimize expensive setup work done per-bundle, which hasn't formerly been
> > possible. Additionally, it has also enabled more failure cases, where
> > element-based state leaks improperly across bundles.
> >
> > I've written a document proposing that two methods are added to the API
> of
> > DoFn, setup and teardown, which both provides hooks for users to write
> > efficient DoFns, as well as signals that DoFns will be reused.
> >
> > The document is located at
> >
> >
> https://docs.google.com/document/d/1LLQqggSePURt3XavKBGV7SZJYQ4NW8yCu63lBchzMRk/edit?ts=5771458f#
> > and committers have edit access
> >
> > Thanks,
> >
> > Thomas
> >
> > [1] https://github.com/apache/incubator-beam/pull/419
> >
>


Re: [DISCUSS] Beam data plane serialization tech

2016-06-29 Thread Aljoscha Krettek
My bad, I didn't know that. Thanks for the clarification!

On Wed, 29 Jun 2016 at 16:38 Daniel Kulp <dk...@apache.org> wrote:

>
> > On Jun 27, 2016, at 10:24 AM, Aljoscha Krettek <aljos...@apache.org>
> wrote:
> >
> > Out of the systems you suggested Thrift and ProtoBuf3 + gRPC are probably
> > best suited for the task. Both of these provide a way for generating
> > serializers as well as for specifying an RPC interface. Avro and
> > FlatBuffers are only dealing in serializers and we would have to roll our
> > own RPC system on top of these.
>
>
> Just a point of clarification, Avro does handle RPC as well as
> serialization.   It's one of the main bullets on their overview page:
>
> http://avro.apache.org/docs/current/index.html
>
> Unfortunately, their documentation around the subject really sucks.  Some
> info at:
>
> https://cwiki.apache.org/confluence/display/AVRO/Porting+Existing+RPC+Frameworks
>
> and a “quick start”:
>
> https://github.com/phunt/avro-rpc-quickstart
>
>
>
> --
> Daniel Kulp
> dk...@apache.org - http://dankulp.com/blog
> Talend Community Coder - http://coders.talend.com
>
>


Re: Improvements to issue/version tracking

2016-06-28 Thread Aljoscha Krettek
+1 The release view and especially the automatic generation of release
notes should come in quite handy.

On Tue, 28 Jun 2016 at 01:01 Davor Bonaci  wrote:

> Hi everyone,
> I'd like to propose a simple change in Beam JIRA that will hopefully
> improve our issue and version tracking -- to actually use the "Fix
> Versions" field as intended [1].
>
> The goal would be to simplify issue tracking, streamline generation of
> release notes, add a view of outstanding work towards a release, and
> clearly communicate which Beam version contains fixes for each issue.
>
> The standard usage of the field is:
> * For open (or in-progress/re-opened) issues, "Fix Versions" field is
> optional and indicates an unreleased version that this issue is targeting.
> The release is not expected to proceed unless this issue is fixed, or the
> field is changed.
> * For closed (or resolved) issues, "Fix Versions" field indicates a
> released or unreleased version that has the fix.
>
> I think the field should be mandatory once the issue is resolved/closed
> [4], so we make a deliberate choice about this. I propose we use "Not
> applicable" for all those issues that aren't being resolved as Fixed (e.g.,
> duplicates, working as intended, invalid, etc.) and those that aren't
> released (e.g., website, build system, etc.).
>
> We can then trivially view outstanding work for the next release [2], or
> generate release notes [3].
>
> I'd love to hear if there are any comments! I know that at least JB agrees,
> as he was convincing me on this -- thanks ;).
>
> Thanks,
> Davor
>
> [1]
>
> https://confluence.atlassian.com/adminjiraserver071/managing-versions-802592484.html
> [2]
>
> https://issues.apache.org/jira/browse/BEAM/fixforversion/12335766/?selectedTab=com.atlassian.jira.jira-projects-plugin:version-summary-panel
> [3]
>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12335764
> [4] https://issues.apache.org/jira/browse/INFRA-12120
>


Re: [DISCUSS] Beam data plane serialization tech

2016-06-27 Thread Aljoscha Krettek
; code generation of client libraries
> > > HTTP/2
> > >
> > > As for the encoding, any SDK can choose any serialization it wants such
> > as
> > > Kryo but to get interoperability with other languages that would
> require
> > > others to implement parts of the Kryo serialization spec to be able to
> > > interpret the "bytes". Thus certain types like KV & WindowedValue
> should
> > be
> > > encoded in a way which allows for this interoperability.
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Fri, Jun 17, 2016 at 3:20 AM, Amit Sela <amitsel...@gmail.com>
> wrote:
> > >
> > > > +1 on Aljoscha comment, not sure where's the benefit in having a
> > > > "schematic" serialization.
> > > >
> > > > I know that Spark and I think Flink as well, use Kryo
> > > > <https://github.com/EsotericSoftware/kryo> for serialization (to be
> > > > accurate it's Chill <https://github.com/twitter/chill> for Spark)
> and
> > I
> > > > found it very impressive even comparing to "manual" serializations,
> > > >  i.e., it seems to outperform Spark's "native" Encoders (1.6+) for
> > > > primitives..
> > > > In addition it clearly supports Java and Scala, and there are 3rd
> party
> > > > libraries for Clojure and Objective-C.
> > > >
> > > > I guess my bottom-line here agrees with Kenneth - performance and
> > > > interoperability - but I'm just not sure if schema based serializers
> > are
> > > > *always* the fastest.
> > > >
> > > > As for pipeline serialization, since performance is not the main
> issue,
> > > and
> > > > I think usability would be very important, I say +1 for JSON.
> > > >
> > > > For anyone who spent sometime on benchmarking serialization
> libraries,
> > > know
> > > > is the time to speak up ;)
> > > >
> > > > Thanks,
> > > > Amit
> > > >
> > > > On Fri, Jun 17, 2016 at 12:47 PM Aljoscha Krettek <
> aljos...@apache.org
> > >
> > > > wrote:
> > > >
> > > > > Hi,
> > > > > am I correct in assuming that the transmitted envelopes would
> mostly
> > > > > contain coder-serialized values? If so, wouldn't the header of an
> > > > envelope
> > > > > just be the number of contained bytes and number of values? I'm
> > > probably
> > > > > missing something but with these assumptions I don't see the
> benefit
> > of
> > > > > using something like Avro/Thrift/Protobuf for serializing the
> > > main-input
> > > > > value envelopes. We would just need a system that can send byte
> data
> > > > really
> > > > > fast between languages/VMs.
> > > > >
> > > > > By the way, another interesting question (at least for me) is how
> > other
> > > > > data, such as side-inputs, is going to arrive at the DoFn if we
> want
> > to
> > > > > support a general interface for different languages.
> > > > >
> > > > > Cheers,
> > > > > Aljoscha
> > > > >
> > > > > On Thu, 16 Jun 2016 at 22:33 Kenneth Knowles
> <k...@google.com.invalid
> > >
> > > > > wrote:
> > > > >
> > > > > > (Apologies for the formatting)
> > > > > >
> > > > > > On Thu, Jun 16, 2016 at 12:12 PM, Kenneth Knowles <
> k...@google.com>
> > > > > wrote:
> > > > > >
> > > > > > > Hello everyone!
> > > > > > >
> > > > > > > We are busily working on a Runner API (for building and
> > > transmitting
> > > > > > > pipelines)
> > > > > > > and a Fn API (for invoking user-defined functions found within
> > > > > pipelines)
> > > > > > > as
> > > > > > > outlined in the Beam technical vision [1]. Both of these
> require
> > a
> > > > > > > language-independent serialization technology for
> > interoperability
> > > > > > between
> > > > > > > SDKs
> > > > > > > and runners.
> > > > > > >
> > > > > > > The Fn API includes a 

Re: Sliding-Windowed PCollectionView as SideInput

2016-06-27 Thread Aljoscha Krettek
Hi,
the WindowFn is responsible for mapping from main-input window to
side-input window. Have a look at WindowFn.getSideInputWindow(). For
SlidingWindows this takes the last possible sliding window as the
side-input window.

Cheers,
Aljoscha

On Sun, 26 Jun 2016 at 22:30 Shen Li  wrote:

> Hi,
>
> I am little confused about how the runner should handle SideInput if it
> comes from a sliding-windowed PCollection.
>
> Say we have two PCollections A and B. Apply
> Window.into(SlidingWindows.of...) on B, and create a View from it (call it
> VB).
>
> Then, a Pardo takes the PCollection A as the main input and VB as side
> input: A.apply(ParDo.withSideInputs(VB).of(new DoFun() {...})).
>
> In the DoFun.processElement(), when the user code calls
> ProcessContext.sideInput(VB), the view of which window in VB should be
> returned if the event time of the current element in A corresponds to
> multiple sliding windows in B?
>
>
> Thanks,
>
> Shen
>


Re: Scala DSL

2016-06-26 Thread Aljoscha Krettek
I'm also in favor of branding it a DSL rather than an SDK. Mostly because
it uses the Java SDK and because it does not (necessarily) follow/implement
the Beam model. As the Java SDK does and what the Python SDK is apparently
going for.

On Sat, 25 Jun 2016 at 10:04 Amit Sela  wrote:

> Just looked at some Scio examples - and saw Spark Scala code ;-)
>
> For me, this made some sense - Spark is written in Scala (let's call it
> Scala SDK ?) but it also provides Java API. New version has a unified API
> (Java-Scala interop.) So I see Scio in a similar way, It's Scala API
> because it's built on top of the Java SDK.
> Having said that, Scio could offer more than just Scala API over the Java
> SDK (i.e., repl) so in the lack of a native fit, I'd go with DSL.  And to
> relate to the very valid notes people had about saying "Hi, we support
> Scala!", we can call it Scala API, even if it's under dsls/scio.
>
> So +1 for dsls/scio
>
> Thanks,
> Amit
>
> On Sat, Jun 25, 2016 at 5:06 AM Dan Halperin 
> wrote:
>
> > On Fri, Jun 24, 2016 at 7:05 PM, Dan Halperin 
> wrote:
> >
> > > On Fri, Jun 24, 2016 at 2:03 PM, Raghu Angadi
>  > >
> > > wrote:
> > >
> > >> DSL is a pretty generic term..
> > >>
> > >
> > > I agree and am not married to it. Neville?
> > >
> > >
> > >> The fact that scio uses Java SDK is an implementation detail.
> > >
> > >
> > > Reasonable, which is why I am also not pushing hard for '/java/scio' to
> > be
> > > in the path.
> > >
> > >
> > >> I love the
> > >> name scio. But I think sdks/scala might be most appropriate and would
> > make
> > >> it a first class citizen for Beam.
> > >>
> > >
> > > I am strongly against it being in the 'sdks/' top-level module -- it's
> > not
> > > a Beam SDK. Unlike DSL, SDK is a very specific term in Beam.
> > >
> > >
> > >> Where would a future python sdk reside?
> > >>
> > >
> > > The Python SDK is in the python-sdk branch on Apache already, and it
> > lives
> > > in `sdks/python`. (And it is aiming to become a proper Beam SDK. ;)
> > >
> >
> > Now with a link:
> > https://github.com/apache/incubator-beam/tree/python-sdk/sdks
> >
> > >
> > > Thanks,
> > > Dan
> > >
> > > On Fri, Jun 24, 2016 at 1:50 PM, Jean-Baptiste Onofré  >
> > >> wrote:
> > >>
> > >> > Agree for dsls/scio
> > >> >
> > >> > Regards
> > >> > JB
> > >> >
> > >> >
> > >> > On 06/24/2016 10:22 PM, Lukasz Cwik wrote:
> > >> >
> > >> >> +1 for dsls/scio for the already listed reasons
> > >> >>
> > >> >> On Fri, Jun 24, 2016 at 11:21 AM, Rafal Wojdyla
> > >> 
> > >> >> wrote:
> > >> >>
> > >> >> Hello. When it comes to SDK vs DSL - I fully agree with Frances.
> > About
> > >> >>> dsls/java/scio or dsls/scio - dsls/java/scio may cause confusion,
> > scio
> > >> >>> is a
> > >> >>> scala DSL but lives under java directory (?) - that makes sense
> only
> > >> once
> > >> >>> you get that scio is using java SDK under the hood. Thus, +1 to
> > >> >>> dsls/scio.
> > >> >>> - Rafal
> > >> >>>
> > >> >>> On Fri, Jun 24, 2016 at 2:01 PM, Kenneth Knowles
> > >>  > >> >>> >
> > >> >>> wrote:
> > >> >>>
> > >> >>> My +1 goes to dsls/scio. It already has a cool name, so let's use
> > it.
> > >> And
> > >>  there might be other Scala-based DSLs.
> > >> 
> > >>  On Fri, Jun 24, 2016 at 8:39 AM, Ismaël Mejía  >
> > >>  wrote:
> > >> 
> > >>  ​Hello everyone,
> > >> >
> > >> > Neville, thanks a lot for your contribution. Your work is
> amazing
> > >> and I
> > >> >
> > >>  am
> > >> 
> > >> > really happy that this scala integration is finally happening.
> > >> > Congratulations to you and your team.
> > >> >
> > >> > I *strongly* disagree about the DSL classification for scio for
> > one
> > >> >
> > >>  reason,
> > >> 
> > >> > if you go to the root of the term, Domain Specific Languages are
> > >> about
> > >> >
> > >>  a
> > >> >>>
> > >>  domain, and the domain in this case is writing Beam pipelines,
> > which
> > >> >
> > >>  is a
> > >> >>>
> > >>  really broad domain.
> > >> >
> > >> > I agree with Frances’ argument that scio is not an SDK e.g. it
> > >> reuses
> > >> >
> > >>  the
> > >> >>>
> > >>  existing Beam java SDK. My proposition is that scio will be
> called
> > >> the
> > >> > Scala API because in the end this is what it is. I think the
> > >> confusion
> > >> > comes from the common definition of SDK which is normally an API
> > + a
> > >> > Runtime. In this case scio will share the runtime with what we
> > call
> > >> the
> > >> > Beam Java SDK.
> > >> >
> > >> > One additional point of using the term API is that it sends the
> > >> clear
> > >> > message that Beam has a Scala API too (which is good for
> > visibility
> > >> as
> > >> >
> > >>  JB
> > 

Re: [DISCUSS] PTransform.named vs. named apply

2016-06-23 Thread Aljoscha Krettek
±1 for the named apply

On Thu, Jun 23, 2016, 07:07 Robert Bradshaw 
wrote:

> +1, I think it makes more sense to name the application of a transform
> rather than the transform itself. (Still mulling on how best to do
> this with Python...)
>
> On Wed, Jun 22, 2016 at 9:27 PM, Jean-Baptiste Onofré 
> wrote:
> > +1
> >
> > Regards
> > JB
> >
> >
> > On 06/23/2016 12:17 AM, Ben Chambers wrote:
> >>
> >> Based on a recent PR (https://github.com/apache/incubator-beam/pull/468)
> I
> >> was reminded of the confusion around the use of
> >> .apply(transform.named(someName)) and .apply(someName, transform). This
> is
> >> one of things I’ve wanted to cleanup for a while. I’d like to propose a
> >> path towards removing this redundancy.
> >>
> >> First, some background -- why are there two ways to name things? When we
> >> added support for updating existing pipelines, we needed all
> applications
> >> to have unique user-provided names to allow diff’ing the pipelines. We
> >> found a few problems with the first approach -- using .named() to
> create a
> >> new transform -- which led to the introduction of the named apply:
> >>
> >> 1. When receiving an error about an application not having a name, it is
> >> not obvious that a name should be given to the *transform*
> >> 2. When using .named() to construct a new transform either the type
> >> information is lost or the composite transform has to override .named()
> >>
> >> We now generally suggest the use of .apply(someName, transform). It is
> >> easier to use and doesn’t lead to as much confusion around PTransform
> >> names
> >> and PTransform application names.
> >>
> >> To that end, I'd like to propose the following changes to the code and
> >> documentation:
> >> 1. Replace the usage of .named(name) in all examples and composites with
> >> the named-apply syntax.
> >> 2. Replace .named(name) with a protected PTransform constructor which
> >> takes
> >> a default name. If not provided, the default name will be derived from
> the
> >> class of the PTransform.
> >> 3. Use the protected constructor in composites (where appropriate) to
> >> ensure that the default application has a reasonable name.
> >>
> >> Users will benefit from having a single way of naming applications while
> >> building a pipeline. Any breakages due to the removal of .named should
> be
> >> easily fixed by either using the named application or by passing the
> name
> >> to the constructor of a composite.
> >>
> >> I’d like to hear any comments or opinions on this topic from the wider
> >> community. Please let me know what you think!
> >>
> >> -- Ben
> >>
> >
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
>


Re: [DISCUSS] Beam data plane serialization tech

2016-06-17 Thread Aljoscha Krettek
Hi,
am I correct in assuming that the transmitted envelopes would mostly
contain coder-serialized values? If so, wouldn't the header of an envelope
just be the number of contained bytes and number of values? I'm probably
missing something but with these assumptions I don't see the benefit of
using something like Avro/Thrift/Protobuf for serializing the main-input
value envelopes. We would just need a system that can send byte data really
fast between languages/VMs.

By the way, another interesting question (at least for me) is how other
data, such as side-inputs, is going to arrive at the DoFn if we want to
support a general interface for different languages.

Cheers,
Aljoscha

On Thu, 16 Jun 2016 at 22:33 Kenneth Knowles  wrote:

> (Apologies for the formatting)
>
> On Thu, Jun 16, 2016 at 12:12 PM, Kenneth Knowles  wrote:
>
> > Hello everyone!
> >
> > We are busily working on a Runner API (for building and transmitting
> > pipelines)
> > and a Fn API (for invoking user-defined functions found within pipelines)
> > as
> > outlined in the Beam technical vision [1]. Both of these require a
> > language-independent serialization technology for interoperability
> between
> > SDKs
> > and runners.
> >
> > The Fn API includes a high-bandwidth data plane where bundles are
> > transmitted
> > via some serialization/RPC envelope (inside the envelope, the stream of
> > elements is encoded with a coder) to transfer bundles between the runner
> > and
> > the SDK, so performance is extremely important. There are many choices
> for
> > high
> > performance serialization, and we would like to start the conversation
> > about
> > what serialization technology is best for Beam.
> >
> > The goal of this discussion is to arrive at consensus on the question:
> > What
> > serialization technology should we use for the data plane envelope of the
> > Fn
> > API?
> >
> > To facilitate community discussion, we looked at the available
> > technologies and
> > tried to narrow the choices based on three criteria:
> >
> >  - Performance: What is the size of serialized data? How do we expect the
> >technology to affect pipeline speed and cost? etc
> >
> >  - Language support: Does the technology support the most widespread
> > language
> >for data processing? Does it have a vibrant ecosystem of contributed
> >language bindings? etc
> >
> >  - Community: What is the adoption of the technology? How mature is it?
> > How
> >active is development? How is the documentation? etc
> >
> > Given these criteria, we came up with four technologies that are good
> > contenders. All have similar & adequate schema capabilities.
> >
> >  - Apache Avro: Does not require code gen, but embedding the schema in
> the
> > data
> >could be an issue. Very popular.
> >
> >  - Apache Thrift: Probably a bit faster and compact than Avro. A huge
> > number of
> >language supported.
> >
> >  - Protocol Buffers 3: Incorporates the lessons that Google has learned
> > through
> >long-term use of Protocol Buffers.
> >
> >  - FlatBuffers: Some benchmarks imply great performance from the
> zero-copy
> > mmap
> >idea. We would need to run representative experiments.
> >
> > I want to emphasize that this is a community decision, and this thread is
> > just
> > the conversation starter for us all to weigh in. We just wanted to do
> some
> > legwork to focus the discussion if we could.
> >
> > And there's a minor follow-up question: Once we settle here, is that
> > technology
> > also suitable for the low-bandwidth Runner API for defining pipelines, or
> > does
> > anyone think we need to consider a second technology (like JSON) for
> > usability
> > reasons?
> >
> > [1]
> >
> https://docs.google.com/presentation/d/1E9seGPB_VXtY_KZP4HngDPTbsu5RVZFFaTlwEYa88Zw/present?slide=id.g108d3a202f_0_38
> >
> >
>


Re: [NOTICE] Change on Filter

2016-06-17 Thread Aljoscha Krettek
There has been an issue about this for a while now:
https://issues.apache.org/jira/browse/BEAM-234

On Fri, 17 Jun 2016 at 09:55 Jean-Baptiste Onofré  wrote:

> Hi Ismaël,
>
> I didn't talk a change between Dataflow SDK and Beam, I'm talking about
> a change between two Beam SNAPSHOTs ;)
>
> For the naming of the DirectRunner, I saw it also, and we should align
> the runners naming (we have FlinkPipelineRunner and
> SparkPipelineRunner). I sent an e-mail to Davor and Frances to discuss
> about that.
>
> Regards
> JB
>
> On 06/17/2016 09:42 AM, Ismaël Mejía wrote:
> > Do we have a list of breaking changes (from the Google Dataflow SDK to
> > Beam), because this is going to be important considering other recent
> > breaking changes, for example this two that I found yesterday too:
> >
> > DirectPipelineRunner -> DirectRunner
> > DoFnTester.processBatch -> DoFnTester.processBundle (?)
> >
> > Ismael.
> >
> >
> >
> >
> > On Fri, Jun 17, 2016 at 9:19 AM, Jean-Baptiste Onofré 
> > wrote:
> >
> >> Hi guys,
> >>
> >> I tested the latest Beam SNAPSHOT this morning and a code which was
> >> working yesterday is not broken with the last changes.
> >>
> >> I'm using a filter by predicate:
> >>
> >>  .apply("Filtering", Filter.byPredicate(new
> >> SerializableFunction() {
> >>  public Boolean apply(String input) {
> >>  ...
> >>  })).
> >>
> >> The filter method has been renamed from byPredicate() to by().
> >>
> >> Just to let others know that it can impact their pipelines.
> >>
> >> Regards
> >> JB
> >> --
> >> Jean-Baptiste Onofré
> >> jbono...@apache.org
> >> http://blog.nanthrax.net
> >> Talend - http://www.talend.com
> >>
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: Testing and the Capability Matrix

2016-06-14 Thread Aljoscha Krettek
@Thomas Completely agree, this is also how it is currently handled in the
Flink runner. I was talking about the presentation of the compatibility
matrix on the web site, whether we should have separate columns for Flink
Stream/Batch and Spark Stream/Batch. (And possibly other runners in the
future)

On Tue, 14 Jun 2016 at 18:57 Thomas Groh  wrote:

> It is also worth noting that this document is a snapshot rather than the
> long-term plan. As the SDK evolves, the annotations will almost certainly
> change with it (and will certainly expand).
>
> +Aljoscha
>
> For streaming/batch execution separation, this is better served by
> configuration in the runner's build (e.g. specifying two separate
> executions in the pom.xml, one for streaming and one for batch). Given that
> the tests live in a separate module from the runner, this is likened to how
> RunnableOnService tests are currently executed by all of the runners.
>
> For sink, I think given the current implementations of sink there isn't a
> huge need; however, most sinks should be annotated with some form of
> superclass (although the implementation of sink requires side inputs, so
> this is also worth considering).
>
> +jb
>
> These would live on the tests proper, yes.
>
> On Sun, Jun 12, 2016 at 11:05 PM, Jean-Baptiste Onofré 
> wrote:
>
> > Hi Thomas,
> >
> > it looks good to me.
> >
> > Just curious: the proposed annotations will be directly in the Java SDK
> > Test jar right ?
> >
> > Thanks,
> > Regards
> > JB
> >
> >
> > On 06/11/2016 01:34 AM, Thomas Groh wrote:
> >
> >> Hey Beamers!
> >>
> >> We have a lovely Capability Matrix (
> >> http://beam.incubator.apache.org/capability-matrix/) which describes
> what
> >> runners can do, and what's in the model. However, right now we only have
> >> one way to specify that a test is useful to be executed in a runner, the
> >> RunnableOnService category.
> >>
> >> I've worked on a document to expand the number of annotations to be more
> >> in
> >> line with the capability matrix, which should help runner writers test
> >> more
> >> precisely with regards to the Beam model. The document is located at
> >>
> >>
> https://docs.google.com/document/d/1fICxq32t9yWn9qXhmT07xpclHeHX2VlUyVtpi2WzzGM/edit?usp=sharing
> >> ,
> >> and I've added edit access for all of our committers.
> >>
> >> Feel free to take a look and leave any comments you may have,
> >>
> >> Thanks,
> >>
> >> Thomas
> >>
> >>
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>


Re: [VOTE] Release version 0.1.0-incubating

2016-06-09 Thread Aljoscha Krettek
+1 (binding)

I ran "mvn clean verify" on the source package, executed WordCount using
the FlinkPipelineRunner. NOTICE, LICENSE and DISCLAIMER also look good

On Thu, 9 Jun 2016 at 18:50 Dan Halperin 
wrote:

> +1 (binding)
>
> per checklist 2.1, I decompressed the source-release zip and ensured that
> `mvn clean verify` passed. per 3.6, I confirmed that there are no binary
> files. I also did a few other miscellaneous checks.
>
> On Thu, Jun 9, 2016 at 8:48 AM, Kenneth Knowles 
> wrote:
>
> > +1 (binding)
> >
> > Confirmed that we can run pipelines on Dataflow.
> >
> > Looks good. Very exciting!
> >
> >
> > On Thu, Jun 9, 2016 at 8:16 AM, Jean-Baptiste Onofré 
> > wrote:
> >
> > > Team work ! Special thanks to Davor and Dan ! And thanks to the entire
> > > team: it's a major step forward (the first release is always the
> hardest
> > > one ;)). Let's see how the release will be taken by the IPMC :)
> > >
> > > Regards
> > > JB
> > >
> > >
> > > On 06/09/2016 04:32 PM, Scott Wegner wrote:
> > >
> > >> +1
> > >>
> > >> Thanks JB and Davor for all your hard work putting together this
> > release!
> > >>
> > >> On Wed, Jun 8, 2016, 11:02 PM Jean-Baptiste Onofré 
> > >> wrote:
> > >>
> > >> By the way, I forgot to mention that we will create a 0.1.0-incubating
> > >>> tag (kind of alias to RC3) when the vote passed.
> > >>>
> > >>> Regards
> > >>> JB
> > >>>
> > >>> On 06/09/2016 01:20 AM, Davor Bonaci wrote:
> > >>>
> >  Hi everyone,
> >  Here's the first vote for the first release of Apache Beam --
> version
> >  0.1.0-incubating!
> > 
> >  As a reminder, we aren't looking for any specific new functionality,
> > but
> >  would like to release the existing code, get something to our users'
> > 
> > >>> hands,
> > >>>
> >  and test the processes. Previous discussions and iterations on this
> > 
> > >>> release
> > >>>
> >  have been archived on the dev@ mailing list.
> > 
> >  The complete staging area is available for your review, which
> > includes:
> >  * the official Apache source release to be deployed to
> > dist.apache.org
> > 
> > >>> [1],
> > >>>
> >  and
> >  * all artifacts to be deployed to the Maven Central Repository [2].
> > 
> >  This corresponds to the tag "v0.1.0-incubating-RC3" in source
> control,
> > 
> > >>> [3].
> > >>>
> > 
> >  Please vote as follows:
> >  [ ] +1, Approve the release
> >  [ ] -1, Do not approve the release (please provide specific
> comments)
> > 
> >  For those of us enjoying our first voting experience -- the release
> >  checklist is here [4]. This is a "package release"-type of the
> Apache
> >  voting process [5]. As customary, the vote will be open for 72
> hours.
> > It
> > 
> > >>> is
> > >>>
> >  adopted by majority approval with at least 3 PPMC affirmative votes.
> > If
> >  approved, the proposal will be presented to the Apache Incubator for
> > 
> > >>> their
> > >>>
> >  review.
> > 
> >  Thanks,
> >  Davor
> > 
> >  [1]
> > 
> > 
> > >>>
> >
> https://repository.apache.org/content/repositories/orgapachebeam-1002/org/apache/beam/beam-parent/0.1.0-incubating/beam-parent-0.1.0-incubating-source-release.zip
> > >>>
> >  [2]
> > 
> > >>>
> https://repository.apache.org/content/repositories/orgapachebeam-1002/
> > >>>
> >  [3]
> > https://github.com/apache/incubator-beam/tree/v0.1.0-incubating-RC3
> >  [4]
> > 
> http://incubator.apache.org/guides/releasemanagement.html#check-list
> >  [5] http://www.apache.org/foundation/voting.html
> > 
> > 
> > >>> --
> > >>> Jean-Baptiste Onofré
> > >>> jbono...@apache.org
> > >>> http://blog.nanthrax.net
> > >>> Talend - http://www.talend.com
> > >>>
> > >>>
> > >>
> > > --
> > > Jean-Baptiste Onofré
> > > jbono...@apache.org
> > > http://blog.nanthrax.net
> > > Talend - http://www.talend.com
> > >
> >
>


Re: DoFn Reuse

2016-06-08 Thread Aljoscha Krettek
Ahh, what we could do is artificially induce bundles using either count or
processing time or both. Just so that finishBundle() is called once in a
while.

On Wed, 8 Jun 2016 at 17:12 Aljoscha Krettek <aljos...@apache.org> wrote:

> Pretty sure, yes. The Iterable in a MapPartitionFunction should give you
> all the values in a given partition.
>
> I checked again for streaming execution. We're doing the opposite, right
> now: every element is a bundle in itself, startBundle()/finishBundle() are
> called for every element which seems a bit wasteful. The only other option
> is to see all elements as one bundle, because Flink does not bundle/micro
> batch elements in streaming execution.
>
> On Wed, 8 Jun 2016 at 16:38 Bobby Evans <ev...@yahoo-inc.com.invalid>
> wrote:
>
>> Are you sure about that for Flink?  I thought the iterable finished when
>> you processed a maximum number of elements or the input queue was empty so
>> that it could returned control back to akka for better sharing of the
>> thread pool.
>>
>>
>> https://github.com/apache/incubator-beam/blob/af8f5935ca1866012ceb102b9472c8b1ef102d73/runners/flink/runner/src/main/java/org/apache/beam/runners/flink/translation/functions/FlinkDoFnFunction.java#L99
>> Also in the javadocs for DoFn.Context it explicitly states that you can
>> emit from the finishBundle method.
>>
>>
>> https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/DoFn.java#L104-L110
>> I thought I had seen some example of this being used for batching output
>> to something downstream, like HDFS or Kafka, but I'm not sure on that.  If
>> you can emit from finsihBundle and an new instance of the DoFn will be
>> created around each bundle then I can see some people trying to do
>> aggregations inside a DoFn and then emitting them at the end of the bundle
>> knowing that if a batch fails or is rolled back the system will handle it.
>> If that is not allowed we should really update the javadocs around it to
>> explain the pitfalls of doing this.
>>  - Bobby
>>
>> On Wednesday, June 8, 2016 4:24 AM, Aljoscha Krettek <
>> aljos...@apache.org> wrote:
>>
>>
>>  Hi,
>> a quick related question: In the Flink runner we basically see everything
>> as one big bundle, i.e. we call startBundle() once at the beginning and
>> then keep processing indefinitely, never calling finishBundle(). Is this
>> also correct behavior?
>>
>> Best,
>> Aljoscha
>>
>> On Tue, 7 Jun 2016 at 20:44 Thomas Groh <tg...@google.com.invalid> wrote:
>>
>> > Hey everyone;
>> >
>> > I'm starting to work on BEAM-38 (
>> > https://issues.apache.org/jira/browse/BEAM-38), which enables an
>> > optimization for runners with many small bundles. BEAM-38 allows
>> runners to
>> > reuse DoFn instances so long as that DoFn has not terminated abnormally.
>> > This replaces the previous requirement that a DoFn be used for only a
>> > single bundle if either of startBundle or finishBundle have been
>> > overwritten.
>> >
>> > DoFn deserialization-per-bundle can be a significance performance
>> > bottleneck when there are many small bundles, as is common in streaming
>> > executions. It has also surfaced as the cause of much of the current
>> > slowness in the new InProcessRunner.
>> >
>> > Existing Runners do not require any changes; they may choose to take
>> > advantage of of the new optimization opportunity. However, user DoFns
>> may
>> > need to be revised to properly set up and tear down state in startBundle
>> > and finishBundle, respectively, if the depended on only being used for a
>> > single bundle.
>> >
>> > The first two updates are already in pull requests:
>> >
>> > PR #419 (https://github.com/apache/incubator-beam/pull/419) updates the
>> > Javadoc to the new spec
>> > PR #418 (https://github.com/apache/incubator-beam/pull/418) updates the
>> > DirectRunner to reuse DoFns according to the new policy.
>> >
>> > Yours,
>> >
>> > Thomas
>> >
>>
>>
>>
>
>


Re: DoFn Reuse

2016-06-08 Thread Aljoscha Krettek
Pretty sure, yes. The Iterable in a MapPartitionFunction should give you
all the values in a given partition.

I checked again for streaming execution. We're doing the opposite, right
now: every element is a bundle in itself, startBundle()/finishBundle() are
called for every element which seems a bit wasteful. The only other option
is to see all elements as one bundle, because Flink does not bundle/micro
batch elements in streaming execution.

On Wed, 8 Jun 2016 at 16:38 Bobby Evans <ev...@yahoo-inc.com.invalid> wrote:

> Are you sure about that for Flink?  I thought the iterable finished when
> you processed a maximum number of elements or the input queue was empty so
> that it could returned control back to akka for better sharing of the
> thread pool.
>
>
> https://github.com/apache/incubator-beam/blob/af8f5935ca1866012ceb102b9472c8b1ef102d73/runners/flink/runner/src/main/java/org/apache/beam/runners/flink/translation/functions/FlinkDoFnFunction.java#L99
> Also in the javadocs for DoFn.Context it explicitly states that you can
> emit from the finishBundle method.
>
>
> https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/DoFn.java#L104-L110
> I thought I had seen some example of this being used for batching output
> to something downstream, like HDFS or Kafka, but I'm not sure on that.  If
> you can emit from finsihBundle and an new instance of the DoFn will be
> created around each bundle then I can see some people trying to do
> aggregations inside a DoFn and then emitting them at the end of the bundle
> knowing that if a batch fails or is rolled back the system will handle it.
> If that is not allowed we should really update the javadocs around it to
> explain the pitfalls of doing this.
>  - Bobby
>
> On Wednesday, June 8, 2016 4:24 AM, Aljoscha Krettek <
> aljos...@apache.org> wrote:
>
>
>  Hi,
> a quick related question: In the Flink runner we basically see everything
> as one big bundle, i.e. we call startBundle() once at the beginning and
> then keep processing indefinitely, never calling finishBundle(). Is this
> also correct behavior?
>
> Best,
> Aljoscha
>
> On Tue, 7 Jun 2016 at 20:44 Thomas Groh <tg...@google.com.invalid> wrote:
>
> > Hey everyone;
> >
> > I'm starting to work on BEAM-38 (
> > https://issues.apache.org/jira/browse/BEAM-38), which enables an
> > optimization for runners with many small bundles. BEAM-38 allows runners
> to
> > reuse DoFn instances so long as that DoFn has not terminated abnormally.
> > This replaces the previous requirement that a DoFn be used for only a
> > single bundle if either of startBundle or finishBundle have been
> > overwritten.
> >
> > DoFn deserialization-per-bundle can be a significance performance
> > bottleneck when there are many small bundles, as is common in streaming
> > executions. It has also surfaced as the cause of much of the current
> > slowness in the new InProcessRunner.
> >
> > Existing Runners do not require any changes; they may choose to take
> > advantage of of the new optimization opportunity. However, user DoFns may
> > need to be revised to properly set up and tear down state in startBundle
> > and finishBundle, respectively, if the depended on only being used for a
> > single bundle.
> >
> > The first two updates are already in pull requests:
> >
> > PR #419 (https://github.com/apache/incubator-beam/pull/419) updates the
> > Javadoc to the new spec
> > PR #418 (https://github.com/apache/incubator-beam/pull/418) updates the
> > DirectRunner to reuse DoFns according to the new policy.
> >
> > Yours,
> >
> > Thomas
> >
>
>
>


Re: 0.1.0-incubating release

2016-06-07 Thread Aljoscha Krettek
By the way, is there any document where we keep track of what checks to run
for a release? Maybe I missed something, there.

On Tue, 7 Jun 2016 at 21:29 Jean-Baptiste Onofré  wrote:

> Just submitted: https://github.com/apache/incubator-beam/pull/428
>
> to fix the src distribution content.
>
> Regards
> JB
>
> On 06/07/2016 09:26 PM, Jean-Baptiste Onofré wrote:
> > I have to revert my vote to -1:
> >
> > the source distribution zip file is empty.
> >
> > I gonna submit a new PR to fix that.
> >
> > Sorry about that.
> >
> > Regards
> > JB
> >
> > On 06/07/2016 09:12 PM, Jean-Baptiste Onofré wrote:
> >> +1
> >>
> >> it looks good to me:
> >>
> >> - all files have incubating
> >> - signatures check out (asc, md5, sha1) (and KEYS there)
> >> - disclaimer exists
> >> - LICENSE and NOTICE good
> >> - No unexpected binary in source
> >> - All ASF licensed files have ASF headers
> >> - Source distribution is available, with a correct name, and correct
> >> content:
> >>
> >>
> https://repository.apache.org/content/repositories/orgapachebeam-1001/org/apache/beam/apache-beam/0.1.0-incubating/apache-beam-0.1.0-incubating-src.zip
> >>
> >>
> >> I'm more comfortable to move forward on a formal release vote with this
> >> staging, and forward to the IPMC review.
> >>
> >> Thanks all and especially to Davor (to support me when I bother him
> >> bunch of times a day ;)).
> >>
> >> Regards
> >> JB
> >>
> >> On 06/07/2016 08:58 PM, Davor Bonaci wrote:
> >>> The second release candidate is available for everyone's review [1].
> >>>
> >>> We plan to call for a vote shortly; please comment if there's any
> >>> additional feedback.
> >>>
> >>> [1]
> >>> https://repository.apache.org/content/repositories/orgapachebeam-1001
> >>>
> >>> On Tue, Jun 7, 2016 at 9:33 AM, Kenneth Knowles  >
> >>> wrote:
> >>>
>  +1
> 
>  Lovely. Very readable.
> 
>  The "-parent" artifacts are just leaked implementation details of our
>  build
>  configuration that no one should ever actually reference, right?
> 
>  Kenn
> 
>  On Tue, Jun 7, 2016 at 8:54 AM, Dan Halperin
>  
>  wrote:
> 
> > +2! This seems most concordant with other Apache products and the
> most
> > future-proof.
> >
> > On Mon, Jun 6, 2016 at 9:35 PM, Jean-Baptiste Onofré <
> j...@nanthrax.net>
> > wrote:
> >
> >> +1
> >>
> >> Regards
> >> JB
> >>
> >>
> >> On 06/07/2016 02:49 AM, Davor Bonaci wrote:
> >>
> >>> After a few rounds of discussions and examining patterns of other
> >>> projects,
> >>> I think we are converging towards:
> >>>
> >>> * A flat group structure, where all artifacts belong to the
> >>> org.apache.beam
> >>> group.
> >>> * Prefix all artifact ids with "beam-".
> >>> * Name artifacts according to the existing directory/module layout:
> >>> beam-sdks-java-core, beam-runners-google-cloud-dataflow-java, etc.
> >>> * Suffix all parents with "-parent", e.g., "beam-parent",
> >>> "sdks-java-parent", etc.
> >>> * Create a "distributions" module, for the purpose of packaging the
> > source
> >>> code for the ASF release.
> >>>
> >>> I believe this approach takes into account everybody's feedback so
>  far,
> >>> and
> >>> I think opposing positions have been retracted.
> >>>
> >>> Please comment if that's not the case, or if there are any
> >>> additional
> >>> points that we may have missed. If not, this is implemented in
> >>> pending
> >>> pull
> >>> requests #420 and #423.
> >>>
> >>> Thanks!
> >>>
> >>> On Fri, Jun 3, 2016 at 9:59 AM, Thomas Weise
> >>> 
> >>> wrote:
> >>>
> >>> Another consideration for potential future packaging/distribution
>  solutions
>  is how the artifacts line up as files in a flat directory. For
> that
>  it
>  may
>  be good to have a common prefix in the artifactId and unique
> > artifactId.
> 
>  The name for the source archive (when relying on ASF parent POM)
>  can
> > also
>  be controlled without expanding the artifactId:
> 
> 
>    
>  
>    maven-assembly-plugin
>    
>  apache-beam
>    
>  
>    
> 
> 
>  Thanks,
>  Thomas
> 
>  On Fri, Jun 3, 2016 at 9:39 AM, Davor Bonaci
>   >>
>  wrote:
> 
>  BEAM-315 is definitely important. Normally, I'd always advocate
> for
> >
>  holding
> 
> > the release to pick that fix. For the very first release,
> however,
>  I'd
> > prefer to 

Re: [PROPOSAL] Writing More Expressive Beam Tests

2016-05-24 Thread Aljoscha Krettek
I see, I haven't thought about watermarks for that but it makes sense. In
Flink, we could just observe watermarks in the sources and shut down
sources when we see a Long.MAX_VALUE watermark. This in turn would bring
down the whole pipeline, starting from the sources.

On Tue, 24 May 2016 at 00:07 Thomas Groh <tg...@google.com.invalid> wrote:

> This is different than the quiescence property proposed in the document -
> quiescence is an idleness property ("the pipeline cannot make progress"),
> but not a completeness property ("the pipeline will never make progress").
>
> However, the existing property of watermarks does mostly solve this problem
> - given that allowed lateness is finite, if all of the root transforms of a
> Pipeline advance their watermarks to positive infinity, the pipeline will
> be complete (as all new inputs are droppably late, so they will be
> dropped). This also causes all windows that contain elements to be closed
> (which will cause the execution of the appropriate PAsserts on those
> windows). This notion of completion is runner-independent; however,
> shutting down the pipeline requires runners to either provide hooks to
> allow users to observe this completed state, or the runner to notice that
> all PTransforms have completed and shut down the pipeline. Notably, this
> notion of completion is simpler than quiescence (as it only requires access
> to the watermarks of the system), so runners can implement it
> independently.
>
> On Mon, May 16, 2016 at 9:00 AM, Aljoscha Krettek <aljos...@apache.org>
> wrote:
>
> > Hi,
> > sorry for resurrecting such an old thread but are there already thoughts
> on
> > how the quiescence handling will work for runner-independent tests?
> >
> > I was thinking about how to make the RunnableOnService tests work when
> > executed in "true-streaming" mode, i.e. when the job would normally never
> > finish? Right now, the tests work because the sources finish at some
> point
> > and we verify that the PAssert DoFn sees the correct results. With
> > streaming runners this "finished" bit is hard to do and I feel that it is
> > related to the quiescence idea expression in the document.
> >
> > Cheers,
> > Aljoscha
> >
> > On Thu, 31 Mar 2016 at 19:32 Ben Chambers <bchamb...@google.com.invalid>
> > wrote:
> >
> > > On Mon, Mar 28, 2016 at 4:29 PM Robert Bradshaw
> > > <rober...@google.com.invalid>
> > > wrote:
> > >
> > > > On Fri, Mar 25, 2016 at 4:28 PM, Ben Chambers
> > > > <bchamb...@google.com.invalid> wrote:
> > > > > My only concern is that in the example, you first need to declare
> all
> > > the
> > > > > inputs, then the pipeline to be tested, then all the outputs. This
> > can
> > > > lead
> > > > > to tests that are hard to follow, since what you're really testing
> is
> > > an
> > > > > interleaving more like "When these inputs arrive, I get this
> output.
> > > Then
> > > > > when this happens, I get that output. Etc.".
> > > >
> > > > +1 to pursuing this direction.
> > > >
> > > > > What if instea of returning a PTransform<PBegin, PCollection>
> > we
> > > > had
> > > > > a "TestSource".
> > > >
> > > > I think TestSource is a PTransform<PBegin, PCollection>.
> > > >
> > >
> > > Maybe? If we want it to easily support multiple inputs, maybe you do
> > > `testSource.getInput(tag)` to get the `PTransform<PBegin,
> > PCollection>`
> > > associated with a given tag? But yes, I intended the `TestSource` to be
> > > usable within the pipeline to actually produce the data.
> > >
> > > >
> > > > > so we did something like:
> > > > >
> > > > > TestPipeline p = TestPipeline.create();
> > > > > TestSource source = p.testSource();
> > > > >
> > > > > // Set up pipeline reading from source.
> > > > > PCollection sum = ...;
> > > >
> > > > I'm really curious what the "..." looks like. How are we using the
> > > source?
> > > >
> > >
> > > Either `p.apply(source)` or `p.apply(source.forTag(tag))`. Not sure
> about
> > > naming, of course.
> > >
> > > >
> > > > > BeamAssert sumAssert = BeamAssert.sum();
> > > >
> > > > Did you mean BeamAssert.that(sum)?
> >

Re: Dynamic work rebalancing for Beam

2016-05-19 Thread Aljoscha Krettek
Interesting read, thanks for the link!

On Thu, 19 May 2016 at 07:09 Dan Halperin 
wrote:

> Hey folks,
>
> This morning, my colleagues Eugene & Malo posted *No shard left behind:
> dynamic work rebalancing in Google Cloud Dataflow
> <
> https://cloud.google.com/blog/big-data/2016/05/no-shard-left-behind-dynamic-work-rebalancing-in-google-cloud-dataflow
> >*.
> This article discusses Cloud Dataflow’s solution to the well-known
> straggler problem.
>
> In a large batch processing job with many tasks executing in parallel, some
> of the tasks – the stragglers – can take a much longer time to complete
> than others, perhaps due to imperfect splitting of the work into parallel
> chunks when issuing the job. Typically, waiting for stragglers means that
> the overall job completes later than it should, and may also reserve too
> many machines that may be underutilized at the end. Cloud Dataflow’s
> dynamic work rebalancing can mitigate stragglers in most cases.
>
> What I’d like to highlight for the Apache Beam (incubating) community is
> that Cloud Dataflow’s dynamic work rebalancing is implemented using
> *runner-specific* control logic on top of Beam’s *runner-independent*
> BoundedSource
> API
> <
> https://github.com/apache/incubator-beam/blob/9fa97fb2491bc784df53fb0f044409dbbc2af3d7/sdks/java/core/src/main/java/org/apache/beam/sdk/io/BoundedSource.java
> >.
> Specifically, to steal work from a straggler, a runner need only call the
> reader’s splitAtFraction method. This will generate a new source containing
> leftover work, and then the runner can pass that source off to another idle
> worker. As Beam matures, I hope that other runners are interested in
> figuring out whether these APIs can help them improve performance,
> implementing dynamic work rebalancing, and collaborating on API changes
> that will help solve other pain points.
>
> Dan
>
> (Also posted on Beam blog:
>
> http://beam.incubator.apache.org/blog/2016/05/18/splitAtFraction-method.html
> )
>


Failing Jenkins Runs

2016-05-19 Thread Aljoscha Krettek
Hi,
on all of the recent PRs Jenkins fails with this message:
https://builds.apache.org/job/beam_PreCommit_MavenVerify/1213/console

Does anyone have an idea what might be going on? Also, where is Jenkins
configured? With this I could take a look myself.

-Aljoscha


Re: add component tag to pull request title / commit comment

2016-05-11 Thread Aljoscha Krettek
This will, however, also take precious space in the Commit Title. And some
commits might not be about only one clear-cut component.

On Wed, 11 May 2016 at 11:43 Maximilian Michels  wrote:

> +1 I think it makes it easier to see at a glance to which part of Beam
> a commit belongs. We could use the Jira components as tags.
>
> On Wed, May 11, 2016 at 11:09 AM, Jean-Baptiste Onofré 
> wrote:
> > Hi Manu,
> >
> > good idea. Theoretically the component in the corresponding Jira should
> give
> > the information, but for convenience, we could add a "tag" in the commit
> > comment.
> >
> > Regards
> > JB
> >
> >
> > On 05/10/2016 06:27 AM, Manu Zhang wrote:
> >>
> >> Guys,
> >>
> >> As I've been developing Gearpump runner for Beam, I've closely watched
> the
> >> pull requests updates for API changes. Sometimes, it turns out changes
> are
> >> only to be applied to a specific runner after I go through the codes.
> >> Could
> >> we add a tag in the pull request title / commit comment to mark whether
> >> it's API change or Runner specific change, or something else, like,
> >> [BEAM-XXX] [Direct-Java] Comments... ?
> >>
> >> Thanks,
> >> Manu Zhang
> >>
> >
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
>


Re: [DISCUSS] Adding Some Sort of SideInputRunner

2016-05-03 Thread Aljoscha Krettek
Maybe, I'll try and figure something out. :-)

My problem was that the doc for StateInternals explicitly states that
access to state is always implicitly scoped to the key being processed. In
my understanding this was always the key of an element but it seems that it
can also be a more abstract key, such as the sharding key. The fact that
this could be the case was hidden away in code outside the SDK, it seems.

Thanks for your help!

On Tue, 3 May 2016 at 19:40 Kenneth Knowles <k...@google.com.invalid> wrote:

> I think the answer to your questions might be StateNamespace.
>
> The lowest level of state is always key-scoped, while the StateNamespace
> indicates whether it is global to the key, further scoped to a particular
> window, or even scoped to a particular trigger. When the DoFn needs a side
> input, the key might actually be gone from the user's point of view. It is
> up to the StepContext to provide an appropriately-scoped StateInternals,
> usually by some consistent sharding key such as the key from the upstream
> GBK.
>
> I don't want to go too much into state accessed in the DoFn as I haven't
> yet got a chance to prepare and publish the design doc for that, and I want
> everyone to have access to it for any discussion.
>
> Does this help?
>
> On Tue, May 3, 2016 at 1:58 AM, Aljoscha Krettek <aljos...@apache.org>
> wrote:
>
> > I'm afraid I have yet another question. What's the interplay between the
> > state that holds the buffered main-input elements and possible per-key
> > state that might be used by the DoFn. I guess I'm not seeing all the
> parts
> > but my problem is that one part (the buffering) requires a different type
> > of state scope as the other part (key-scoped state access in the DoFn)
> > while they both seem to be using the same StateInternals form the step
> > context. How does that work?
> >
> > Cheers,
> > Aljoscha
> >
> > On Thu, 28 Apr 2016 at 20:05 Kenneth Knowles <k...@google.com.invalid>
> > wrote:
> >
> > > On Thu, Apr 28, 2016 at 10:19 AM, Aljoscha Krettek <
> aljos...@apache.org>
> > > wrote:
> > >
> > > > No worries :-) and thanks for the detailed answers!
> > > >
> > > > I still have one question, though: you wrote that "The side input is
> > > > considered ready when there has been any data output/added to the
> > > > PCollection that it is being read as a side input. So the upstream
> > > trigger
> > > > controls this." How does this work with side inputs that consist of
> > > > multiple elements, i.e. ListPCollectionView and MapPCollectionView.
> For
> > > > them, do we also consider the side input as ready once the first
> > element
> > > > arrives? That's why I was wondering about the triggers being
> > responsible
> > > > for deciding when a side input is ready.
> > > >
> > >
> > > Yes, just as you describe. The side input window becomes ready once it
> > has
> > > any data. So, combining your items 2.5 and 3, you have a situation
> where
> > > main input elements may be combined with only a speculative subset of
> the
> > > side input data. They will not be reprocessed once more up-to-date side
> > > input values become known. Beyond this initial period of waiting for
> the
> > > very first firing of the side input window, there are no consistency
> > > restrictions/guarantees on main input vs side input windows or
> > triggerings.
> > > It may be that for a given runner updating the side input with the new
> > > value happens at high latency so all the main input elements are
> > processed
> > > and gone before the update goes through. It is a bit of a dangerous
> area
> > > for users. I'm pretty interested in ideas in this space.
> > >
> > > Kenn
> > >
> >
>


Re: [DISCUSS] Adding Some Sort of SideInputRunner

2016-04-28 Thread Aljoscha Krettek
No worries :-) and thanks for the detailed answers!

I still have one question, though: you wrote that "The side input is
considered ready when there has been any data output/added to the
PCollection that it is being read as a side input. So the upstream trigger
controls this." How does this work with side inputs that consist of
multiple elements, i.e. ListPCollectionView and MapPCollectionView. For
them, do we also consider the side input as ready once the first element
arrives? That's why I was wondering about the triggers being responsible
for deciding when a side input is ready.

Cheers,
Aljoscha

On Thu, 28 Apr 2016 at 18:55 Kenneth Knowles <k...@google.com.invalid> wrote:

> On Thu, Apr 28, 2016 at 1:26 AM, Aljoscha Krettek <aljos...@apache.org>
> wrote:
>
> > Bump.
> >
> > I'm afraid this might have gotten lost during the conferences/summits.
> >
> > On Thu, 21 Apr 2016 at 13:30 Aljoscha Krettek <aljos...@apache.org>
> wrote:
> >
> > > Ok, I'll try and start such a design. Before I can start, I have a few
> > > questions about how the side inputs actually work.  Some of it is in
> the
> > > docs, but some is also conjecture on my part about how
> MillWheel/Windmill
> > > (I don't know, is the first the system and the second a worker in
> there?)
> > > works.
> > >
> > > I'll write the questions as a set of assumptions and please correct me
> on
> > > those where I'm wrong:
> >
>
> Sorry for the delayed reply. I may have taken you too literally. Since
> nothing seemed wrong, I didn't say anything at all :-)
>
>
> >
> > > 1. Side inputs are always global/broadcast, they are never scoped to a
> > key.
> >
>
> True. A side input is a way of reading/viewing an entire PCollection.
>
>
> > 2. Mapping of main-input window to side-input window is done by the
> > > side-input WindowFn.
> >
>
> True.
>
>
> > If the side input has a larger window than the main
> > > input, processing of main-input elements will block until the side
> input
> > > becomes available. (Side inputs for a larger side-input window can
> become
> > > available early if using a speculative trigger)
> >
>
> True to this point: The main input element waits for the necessary side
> input window to be ready. It doesn't necessarily have to do with window
> size. It could just be a scheduling thing, or other runner-specific reason.
>
>
> >
> > > 2.5 If we update the side input because a speculative trigger fires
> again
> > > we don't reprocess the main-input elements that were already processed.
> > > Only new elements see the updated side input.
> >
>
> True. This is a place where the decision is due mostly to feasibility of
> implementation. It is easy to create a pipeline where this behavior is not
> ideal.
>
>
> > 3. The decision about whether side inputs are "available", i.e. complete
> > > in case of list/map side inputs is made by a Trigger. (This is also
> true
> > > for updates to side input caused by speculative triggers firing again.)
> > > This uses the continuation trigger feature, which is easy for time
> > triggers
> > > and interesting for the AfterPane.elementCountAtLeast() trigger which
> > > changes to AfterPane.elementCountAtLeast(1) on continuation and other
> > > speculative/late triggers.
> >
>
> True up to this point: The side input is considered ready when there has
> been any data output/added to the PCollection that it is being read as a
> side input. So the upstream trigger controls this. I don't know if the
> continuation trigger is important. The goal of the continuation trigger is
> to handle an implementation detail: Once an upstream trigger has already
> regulated the flow, we want downstream stuff to just proceed as fast as
> reasonable.
>
>
> > 4. About the StreamingSideInputDoFnRunner.java
> > > <
> >
> https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/1e5524a7f5d0d774488cb0206ea6433085461775/sdk/src/main/java/com/google/cloud/dataflow/sdk/runners/worker/StreamingSideInputDoFnRunner.java
> >
> > posted
> > > by Kenneth. It uses StateInternals to buffer the elements of the main
> > input
> > > in a BagState.  I was under the assumption that state is always scoped
> > to a
> > > key but side inputs can also be used on non-keyed streams. In this
> case,
> > > the state is scoped to the key group (the unit/granularity that is used
> > to
> > > rebalance in case of rescaling) and when we access the state we get a

Re: [DISCUSS] Beam IO native IO

2016-04-28 Thread Aljoscha Krettek
+1

I agree with what Robert said and Davor laid out in more detail.
Portability is one of the primary concerns of Beam.

On Thu, 28 Apr 2016 at 18:27 Davor Bonaci  wrote:

> Generally speaking, the SDKs define all user APIs, including all IOs. We
> should strive that users never use any runner-specific APIs directly. As
> such, there should be no runner-provided IOs visible to user. ( Of course,
> some exceptions will have to apply, such as runner-specific configuration
> via PipelineOptions, etc. )
>
> All SDK-provided IO should be written in terms of Source / Sink API. All
> runners should support running pipelines that use this APIs. In that world,
> all IOs would run on all runners. However, neither of this is true
> currently:
>
>- We used to have all sources and sink implemented differently in a
>runner-native way. Over the last month, we have converted TextIO,
> AvroIO,
>and BigQueryIO.Write to follow this design. (Thanks Luke and Pei!)
>- BigQueryIO.Read and PubsubIO are the last pieces left in the old
>design, and there are pending PRs to address those. (Thanks Pei and
> Mark!)
>- Neither Flink or Spark fully support Source / Sink API, AFAIK. There
>are outstanding JIRA issues to address those -- those are high priority,
>IMO.
>
> At execution time, any runner is free to replace the SDK-provided IO with a
> runner-native one, as appropriate. For example, a runner may have a faster
> implementation than the SDK-provided one. That choice should be
> transparent, and made by the runner, not the user.
>
> (Aside, this is why the runner API is so important -- that runners have
> enough information to make the right choice on behalf of the user, without
> needing to delegate implementation details to users -- no user knobs!)
>
> This is the current design, which we believe addresses all scenarios we
> care about:
>
>- All IOs run on all runners.
>- Any runner can provide a better or faster runner-native implementation
>of any IO.
>- Users are abstracted away from all implementation details.
>- All pipelines are runner-portable, because users don't use any
>runner-specific APIs directly.
>
>
> On Thu, Apr 28, 2016 at 8:33 AM, Amit Sela  wrote:
>
> > From the Spark runner point of view, the implementation of the KafkaIO
> (for
> > example) is to define the "settings" required to read from Kafka and
> from a
> > quick look at the SDK's kafkaIO, it looks like it could be used instead
> of
> > the runner's implementation (and if not now, then probably once Spark
> > supports Kafka 0.9 connector API).
> >
> > As for the bigger picture here, as far as I can see, IOs *translation
> *will
> > always be runner-specific because they either create whatever
> PCollections
> > represent from external source output from whatever PCollections
> represent.
> > So I think translation will always be runner-specific for IOs.
> >
> > Back to the IOs themselves, the SDK should allow the runner to extend
> it's
> > implementation of the IO if and where needed, so if the KafkaIO is
> missing
> > Encoder/Decoder kafka serializer settings, it could just add those.
> >
> > Does this make sense ?
> >
> >
> > On Thu, Apr 28, 2016 at 3:45 PM Jean-Baptiste Onofré 
> > wrote:
> >
> > > Hi all,
> > >
> > > regarding the recent threads on the mailing list, I would like to start
> > > a format discussion around the IO.
> > > As we can expect the first contributions on this area (I already have
> > > some work in progress around this ;)), I think it's a fair discussion
> to
> > > have.
> > >
> > > Now, we have two kinds of IO: the one "generic" to Beam, the one
> "local"
> > > to the runners.
> > >
> > > For example, let's take Kafka: we have the KafkaIO (in IO), and for
> > > instance, we have the spark-streaming kafka connector (in Spark
> Runner).
> > >
> > > Right now, we have two approaches for the user:
> > > 1. In the pipeline, we use KafkaIO from Beam: it's the preferred
> > > approach for sure. However, the user may want to use the runner
> specific
> > > IO for two reasons:
> > > * Beam doesn't provide the IO yet (for instance, spark
> cassandra
> > > connector is available whereas we don't have yet any CassandraIO (I'm
> > > working on it anyway ;))
> > > * The runner native IO is optimized or contain more features
> that
> > > the
> > > Beam native IO
> > > 2. So, for the previous reasons, the user could want to use the native
> > > runner IO. The drawback of this approach is that the pipeline will be
> > > tight to a specific runner, which is completely against the Beam
> design.
> > >
> > > I wonder if it wouldn't make sense to add flag on the IO API (and
> > > related on Runner API) like .useNative().
> > >
> > > For instance, the user would be able to do:
> > >
> > >
> > >
> >
> pipeline.apply(KafkaIO.read().withBootstrapServers("...").withTopics("...").useNative(true);
> > >
> > > 

Re: [DISCUSS] Adding Some Sort of SideInputRunner

2016-04-28 Thread Aljoscha Krettek
Bump.

I'm afraid this might have gotten lost during the conferences/summits.

On Thu, 21 Apr 2016 at 13:30 Aljoscha Krettek <aljos...@apache.org> wrote:

> Ok, I'll try and start such a design. Before I can start, I have a few
> questions about how the side inputs actually work.  Some of it is in the
> docs, but some is also conjecture on my part about how MillWheel/Windmill
> (I don't know, is the first the system and the second a worker in there?)
> works.
>
> I'll write the questions as a set of assumptions and please correct me on
> those where I'm wrong:
>
> 1. Side inputs are always global/broadcast, they are never scoped to a key.
>
> 2. Mapping of main-input window to side-input window is done by the
> side-input WindowFn. If the side input has a larger window than the main
> input, processing of main-input elements will block until the side input
> becomes available. (Side inputs for a larger side-input window can become
> available early if using a speculative trigger)
>
> 2.5 If we update the side input because a speculative trigger fires again
> we don't reprocess the main-input elements that were already processed.
> Only new elements see the updated side input.
>
> 3. The decision about whether side inputs are "available", i.e. complete
> in case of list/map side inputs is made by a Trigger. (This is also true
> for updates to side input caused by speculative triggers firing again.)
> This uses the continuation trigger feature, which is easy for time triggers
> and interesting for the AfterPane.elementCountAtLeast() trigger which
> changes to AfterPane.elementCountAtLeast(1) on continuation and other
> speculative/late triggers.
>
> 4. About the StreamingSideInputDoFnRunner.java
> <https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/1e5524a7f5d0d774488cb0206ea6433085461775/sdk/src/main/java/com/google/cloud/dataflow/sdk/runners/worker/StreamingSideInputDoFnRunner.java>
>  posted
> by Kenneth. It uses StateInternals to buffer the elements of the main input
> in a BagState.  I was under the assumption that state is always scoped to a
> key but side inputs can also be used on non-keyed streams. In this case,
> the state is scoped to the key group (the unit/granularity that is used to
> rebalance in case of rescaling) and when we access the state we get all
> elements for the key groups for which our parallel worker instance is
> responsible. (This is the part where I am completely unsure about what is
> happening... :-O)
>
> These are the ones I can come up with for now. :-)
>
> On Wed, 20 Apr 2016 at 23:25 Davor Bonaci <da...@google.com.invalid>
> wrote:
>
>> If we come up with a general approach in the context of the Flink runner,
>> perhaps that piece can go back to the "runner-core" component and be
>> adopted more widely.
>>
>> On Wed, Apr 20, 2016 at 8:13 AM, Kenneth Knowles <k...@google.com.invalid>
>> wrote:
>>
>> > Hi Aljoscha,
>> >
>> > Great idea.
>> >
>> >  - The logic for matching up the windows is WindowFn#getSideInputWindow
>> [1]
>> >  - The SDK used to have something along the lines of what you describe
>> [2]
>> > but we thought it was too runner-specific, directly referencing Dataflow
>> > details, and with a particular model of buffering + timer. Perhaps it
>> is a
>> > starting place for your design?
>> >
>> > Kenn
>> >
>> > [1]
>> >
>> >
>> https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/windowing/WindowFn.java#L131
>> >
>> > [2]
>> >
>> >
>> https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/1e5524a7f5d0d774488cb0206ea6433085461775/sdk/src/main/java/com/google/cloud/dataflow/sdk/runners/worker/StreamingSideInputDoFnRunner.java
>> >
>> > On Wed, Apr 20, 2016 at 4:25 AM, Jean-Baptiste Onofré <j...@nanthrax.net>
>> > wrote:
>> >
>> > > Hi Aljoscha
>> > >
>> > > AFAIR, the Runner API Proposal document (from Kenneth) contains some
>> > > points about side input.
>> > >
>> > >
>> > >
>> >
>> https://drive.google.com/folderview?id=0B-IhJZh9Ab52OFBVZHpsNjc4eXc=sharing
>> > >
>> > > I don't think it goes into the details of side inputs and windows, but
>> > > definitely the document we should extend.
>> > >
>> > > Regards
>> > > JB
>> > >
>> > >
>> > >
>> > > On 04/20/2016 11:55 AM, Aljoscha Krettek wrote:
>> > &

Re: PROPOSAL: Apache Beam (virtual) meeting: 05/11/2016 08:00 - 11:00 Pacific time

2016-04-13 Thread Aljoscha Krettek
Either works for me.

On Tue, 12 Apr 2016 at 22:29 Kenneth Knowles  wrote:

> Either works for me. Thanks James!
>
> On Tue, Apr 12, 2016 at 11:31 AM, Amit Sela  wrote:
>
> > Anytime works for me.
> >
> > On Tue, Apr 12, 2016, 21:24 Jean-Baptiste Onofré 
> wrote:
> >
> > > Hi James,
> > >
> > > 5/4 works for me !
> > >
> > > Thanks,
> > > Regards
> > > JB
> > >
> > > On 04/12/2016 05:05 PM, James Malone wrote:
> > > > Hey JB,
> > > >
> > > > Sorry for the late reply! That is a good point; apologies I missed
> > > noticing
> > > > that conflict. For everyone in the community, how would one of the
> > > > following alternatives work?
> > > >
> > > > 5/4/2016 - 8:00 - 11:00 AM Pacific time
> > > > -or-
> > > > 5/18/2016 - 8:00 - 11:00 AM Pacific time
> > > >
> > > > Best,
> > > >
> > > > James
> > > >
> > > > On Mon, Apr 11, 2016 at 11:17 AM, Lukasz Cwik
>  > >
> > > > wrote:
> > > >
> > > >> That works for me.
> > > >> But it would be best if people just posted when they are available
> > > >> depending on the goal/scope of the meeting and then a date is
> chosen.
> > > >>
> > > >> On Sun, Apr 10, 2016 at 9:40 PM, Jean-Baptiste Onofré <
> > j...@nanthrax.net>
> > > >> wrote:
> > > >>
> > > >>> OK, what about the week before ApacheCon ?
> > > >>>
> > > >>> Regards
> > > >>> JB
> > > >>>
> > > >>>
> > > >>> On 04/11/2016 04:22 AM, Lukasz Cwik wrote:
> > > >>>
> > >  I will be gone May 14th - 31st so would prefer a date before that.
> > > 
> > >  On Fri, Apr 8, 2016 at 10:23 PM, Jean-Baptiste Onofré <
> > > j...@nanthrax.net>
> > >  wrote:
> > > 
> > >  Hi James,
> > > >
> > > > May 11th is during the ApacheCon Vancouver.
> > > >
> > > > As some Beam current and potential contributors could be busy at
> > > > ApacheCon, maybe it's better to postpone to May 18th.
> > > >
> > > > WDYT ?
> > > >
> > > > Regards
> > > > JB
> > > >
> > > >
> > > > On 04/08/2016 10:37 PM, James Malone wrote:
> > > >
> > > > Hello everyone,
> > > >>
> > > >> I'd like to propose holding a meeting in May to discuss a few
> > Apache
> > > >> Beam
> > > >> topics. This could be a good venue to discuss design proposals,
> > > gather
> > > >> technical feedback, and the state of the Beam community. My
> > thinking
> > > >> is
> > > >> we
> > > >> will be able to cover two or three Apache Beam topics in depth
> > over
> > > >> the
> > > >> course of a few hours.
> > > >>
> > > >> To make the meeting accessible to the community, I propose a
> > virtual
> > > >> meeting on:
> > > >>
> > > >> Wednesday May 11th (2016/05/11)
> > > >> 8:00 AM - 11:00 AM Pacific
> > > >>
> > > >> Since time may be limited, I propose agenda items recommended by
> > the
> > > >> PPMC
> > > >> are given preferences. Before the meeting we can finalize the
> > method
> > > >> used
> > > >> for the virtual meeting (like Google hangouts) and the finalized
> > > >> agenda.
> > > >> I'm also happy to volunteer myself for taking notes and
> > coordinating
> > > >> the
> > > >> event.
> > > >>
> > > >> Best,
> > > >>
> > > >> James
> > > >>
> > > >>
> > > >> --
> > > > Jean-Baptiste Onofré
> > > > jbono...@apache.org
> > > > http://blog.nanthrax.net
> > > > Talend - http://www.talend.com
> > > >
> > > >
> > > 
> > > >>> --
> > > >>> Jean-Baptiste Onofré
> > > >>> jbono...@apache.org
> > > >>> http://blog.nanthrax.net
> > > >>> Talend - http://www.talend.com
> > > >>>
> > > >>
> > > >
> > >
> > > --
> > > Jean-Baptiste Onofré
> > > jbono...@apache.org
> > > http://blog.nanthrax.net
> > > Talend - http://www.talend.com
> > >
> >
>


Re: Apache Beam blog

2016-02-12 Thread Aljoscha Krettek
+1

> On 12 Feb 2016, at 18:58, Tyler Akidau  wrote:
> 
> +1
> 
> -Tyler
> 
> On Fri, Feb 12, 2016 at 9:57 AM Amit Sela  wrote:
> 
>> +1
>> 
>> I think we could also publish user's use-case examples and stories. "How we
>> are using Beam" or something like that.
>> 
>> On Fri, Feb 12, 2016, 19:49 James Malone 
>> wrote:
>> 
>>> Hello everyone,
>>> 
>>> Now that we have a skeleton website up (horray!) I wanted to raise the
>> idea
>>> of a "Beam Blog." I am thinking this blog is where we can show news, cool
>>> new Beam-things, examples for Apache Beam, and so on. This blog would
>> live
>>> under the project website (http://beam.incubator.apache.org).
>>> 
>>> To that end, I would like to poll how the larger community feels about
>>> this. Additionally, if people think it's a good idea, I'd like to start
>>> thinking about content which we would want to feature on the blog.
>>> 
>>> I happily volunteer to coordinate, review, and assist with the blog.
>>> 
>>> Cheers!
>>> 
>>> James
>>> 
>>