Re: [DISCUSS] Graduation to a top-level project

2016-11-24 Thread Maximilian Michels
+1

I see a healthy project which deserves to graduate.

On Wed, Nov 23, 2016 at 6:03 PM, Davor Bonaci  wrote:
> Thanks everyone for the enthusiastic support!
>
> Please keep the thread going, as we kick off the process on private@.
> Please don’t forget to bring up any data points that might help strengthen
> our case.
>
> Thanks!
>
> On Wed, Nov 23, 2016 at 8:45 AM, Scott Wegner 
> wrote:
>
>> +1 (beaming)
>>
>> On Wed, Nov 23, 2016 at 8:25 AM Robert Bradshaw
>> 
>> wrote:
>>
>> +1
>>
>> On Wed, Nov 23, 2016 at 7:36 AM, Lukasz Cwik 
>> wrote:
>> > +1
>> >
>> > On Wed, Nov 23, 2016 at 9:48 AM, Stephan Ewen  wrote:
>> >
>> >> +1
>> >> The community if doing very well and behaving very Apache
>> >>
>> >> On Wed, Nov 23, 2016 at 9:54 AM, Etienne Chauchot 
>> >> wrote:
>> >>
>> >> > A big +1 of course, very excited to go forward
>> >> >
>> >> > Etienne
>> >> >
>> >> >
>> >> >
>> >> > Le 22/11/2016 à 19:19, Davor Bonaci a écrit :
>> >> >
>> >> >> Hi everyone,
>> >> >> With all the progress we’ve had recently in Apache Beam, I think it
>> is
>> >> >> time
>> >> >> we start the discussion about graduation as a new top-level project
>> at
>> >> the
>> >> >> Apache Software Foundation.
>> >> >>
>> >> >> Graduation means we are a self-sustaining and self-governing
>> community,
>> >> >> and
>> >> >> ready to be a full participant in the Apache Software Foundation. It
>> >> does
>> >> >> not imply that our community growth is complete or that a particular
>> >> level
>> >> >> of technical maturity has been reached, rather that we are on a solid
>> >> >> trajectory in those areas. After graduation, we will still
>> periodically
>> >> >> report to, and be overseen by, the ASF Board to ensure continued
>> growth
>> >> of
>> >> >> a healthy community.
>> >> >>
>> >> >> Graduation is an important milestone for the project. It is also key
>> to
>> >> >> further grow the user community: many users (incorrectly) see
>> incubation
>> >> >> as
>> >> >> a sign of instability and are much less likely to consider us for a
>> >> >> production use.
>> >> >>
>> >> >> A way to think about graduation readiness is through the Apache
>> Maturity
>> >> >> Model [1]. I think we clearly satisfy all the requirements [2]. It is
>> >> >> probably worth emphasizing the recent community growth: over each of
>> the
>> >> >> past three months, no single organization contributing to Beam has
>> had
>> >> >> more
>> >> >> than ~50% of the unique contributors per month [2, see assumptions].
>> >> >> That’s
>> >> >> a great statistic that shows how much we’ve grown our diversity!
>> >> >>
>> >> >> Process-wise, graduation consists of drafting a board resolution,
>> which
>> >> >> needs to identify the full Project Management Committee, and getting
>> it
>> >> >> approved by the community, the Incubator, and the Board. Within the
>> Beam
>> >> >> community, most of these discussions and votes have to be on the
>> >> private@
>> >> >> mailing list, but, as usual, we’ll try to keep dev@ updated as much
>> as
>> >> >> possible.
>> >> >>
>> >> >> With that in mind, let’s use this discussion on dev@ for two things:
>> >> >> * Collect additional data points on our progress that we may want to
>> >> >> present to the Incubator as a part of the proposal to accept our
>> >> >> graduation.
>> >> >> * Determine whether the community supports graduation. Please reply
>> >> +1/-1
>> >> >> with any additional comments, as appropriate. I’d encourage everyone
>> to
>> >> >> participate -- regardless whether you are an occasional visitor or
>> have
>> >> a
>> >> >> specific role in the project -- we’d love to hear your perspective.
>> >> >>
>> >> >> Data points so far:
>> >> >> * Project’s maturity self-assessment [2].
>> >> >> * 1500 pull requests in incubation, which makes us one of the most
>> >> active
>> >> >> project across all of ASF on this metric.
>> >> >> * 3 releases, each driven by a different release manager.
>> >> >> * 120+ individual contributors.
>> >> >> * 3 new committers added, 2 of which aren’t from the largest
>> >> organization.
>> >> >> * 1027 issues created, 515 resolved.
>> >> >> * 442 dev@ emails in October alone, sent by 51 individuals.
>> >> >> * 50 user@ emails in the last 30 days, sent by 22 individuals.
>> >> >>
>> >> >> Thanks!
>> >> >>
>> >> >> Davor
>> >> >>
>> >> >> [1] http://community.apache.org/apache-way/apache-project-
>> >> >> maturity-model.html
>> >> >> [2] http://beam.incubator.apache.org/contribute/maturity-model/
>> >> >>
>> >> >>
>> >> >
>> >>
>>


Re: [VOTE] Release 0.3.0-incubating, release candidate #1

2016-10-26 Thread Maximilian Michels
+1 (binding)

Thanks for managing the release, Aljoscha!

-Max


On Wed, Oct 26, 2016 at 6:46 AM, Jean-Baptiste Onofré  wrote:
> Agree. We already discussed about that on the mailing list. I mentionned this 
> some weeks ago.
>
> Regards
> JB
>
> ⁣
>
> On Oct 26, 2016, 02:26, at 02:26, Dan Halperin  
> wrote:
>>My reading of the LEGAL threads is that since we are not including
>>(shading
>>or bundling) the ASL-licensed code we are fine to distribute kinesis-io
>>module. This was the original conclusion that LEGAL-198 got to, and
>>that
>>thread has not been resolved differently (even if Spark went ahead and
>>broke the assembly). The beam-sdks-java-io-kinesis module is an
>>optional
>>part (Beam materially works just fine without it).
>>
>>So I think we're fine to keep this vote open.
>>
>>+1 (binding) on the release
>>
>>Thanks Aljoscha!
>>
>>
>>On Tue, Oct 25, 2016 at 12:07 PM, Aljoscha Krettek
>>
>>wrote:
>>
>>> Yep, I was looking at those same threads when I reviewing the
>>artefacts.
>>> The release was already close to being finished so I went through
>>with it
>>> but if we think it's not good to have them in we should quickly
>>cancel in
>>> favour of a new RC without a published Kinesis connector.
>>>
>>> On Tue, 25 Oct 2016 at 20:46 Dan Halperin
>>
>>> wrote:
>>>
>>> > I can't tell whether it is a problem that we are distributing the
>>> > beam-sdks-java-io-kinesis module [0].
>>> >
>>> > Here is the dev@ discussion thread [1] and the (unanswered)
>>relevant
>>> LEGAL
>>> > thread [2].
>>> > We linked through to a Spark-related discussion [3], and here is
>>how to
>>> > disable distribution of the KinesisIO module [4].
>>> >
>>> > [0]
>>> >
>>> > https://repository.apache.org/content/repositories/staging/
>>> org/apache/beam/beam-sdks-java-io-kinesis/
>>> > [1]
>>> >
>>> > https://lists.apache.org/thread.html/6784bc005f329d93fd59d0f8759ed4
>>> 745e72f105e39d869e094d9645@%3Cdev.beam.apache.org%3E
>>> > [2]
>>> >
>>> > https://issues.apache.org/jira/browse/LEGAL-198?
>>> focusedCommentId=15471529=com.atlassian.jira.
>>> plugin.system.issuetabpanels:comment-tabpanel#comment-15471529
>>> > [3] https://issues.apache.org/jira/browse/SPARK-17418
>>> > [4] https://github.com/apache/spark/pull/15167/files
>>> >
>>> > Dan
>>> >
>>> > On Tue, Oct 25, 2016 at 11:01 AM, Seetharam Venkatesh <
>>> > venkat...@innerzeal.com> wrote:
>>> >
>>> > > +1
>>> > >
>>> > > Thanks!
>>> > >
>>> > > On Mon, Oct 24, 2016 at 2:30 PM Aljoscha Krettek
>>
>>> > > wrote:
>>> > >
>>> > > > Hi Team!
>>> > > >
>>> > > > Please review and vote at your leisure on release candidate #1
>>for
>>> > > version
>>> > > > 0.3.0-incubating, as follows:
>>> > > > [ ] +1, Approve the release
>>> > > > [ ] -1, Do not approve the release (please provide specific
>>comments)
>>> > > >
>>> > > > The complete staging area is available for your review, which
>>> includes:
>>> > > > * JIRA release notes [1],
>>> > > > * the official Apache source release to be deployed to
>>> dist.apache.org
>>> > > > [2],
>>> > > > * all artifacts to be deployed to the Maven Central Repository
>>[3],
>>> > > > * source code tag "v0.3.0-incubating-RC1" [4],
>>> > > > * website pull request listing the release and publishing the
>>API
>>> > > reference
>>> > > > manual [5].
>>> > > >
>>> > > > Please keep in mind that this release is not focused on
>>providing new
>>> > > > functionality. We want to refine the release process and make
>>stable
>>> > > source
>>> > > > and binary artefacts available to our users.
>>> > > >
>>> > > > The vote will be open for at least 72 hours. It is adopted by
>>> majority
>>> > > > approval, with at least 3 PPMC affirmative votes.
>>> > > >
>>> > > > Cheers,
>>> > > > Aljoscha
>>> > > >
>>> > > > [1]
>>> > > >
>>> > > > https://issues.apache.org/jira/secure/ReleaseNote.jspa?
>>> > > projectId=12319527=12338051
>>> > > > [2]
>>> > > >
>>> >
>>https://dist.apache.org/repos/dist/dev/incubator/beam/0.3.0-incubating/
>>> > > > [3]
>>> > > > https://repository.apache.org/content/repositories/staging/
>>> > > org/apache/beam/
>>> > > > [4]
>>> > > >
>>> > > > https://git-wip-us.apache.org/repos/asf?p=incubator-beam.
>>> git;a=tag;h=
>>> > > 5d86ff7f04862444c266142b0d5acecb5a6b7144
>>> > > > [5] https://github.com/apache/incubator-beam-site/pull/52
>>> > > >
>>> > >
>>> >
>>>


Re: [ANNOUNCEMENT] New committers!

2016-10-24 Thread Maximilian Michels
Congrats and a warm welcome!

-Max


On Sun, Oct 23, 2016 at 6:02 AM, Robert Bradshaw
 wrote:
> Congrats and welcome to all three of you!
>
> On Sat, Oct 22, 2016 at 9:02 AM, Thomas Weise  wrote:
>> Thanks everyone!
>>
>>
>> On Sat, Oct 22, 2016 at 12:59 AM, Aljoscha Krettek 
>> wrote:
>>
>>> Welcome everyone! +3 :-)
>>>
>>> On Sat, 22 Oct 2016 at 06:43 Jean-Baptiste Onofré  wrote:
>>>
>>> > Just a small thing.
>>> >
>>> > If it's not already done, don't forget to sign a ICLA and let us know
>>> > your apache ID.
>>> >
>>> > Thanks,
>>> > Regards
>>> > JB
>>> >
>>> > On 10/22/2016 12:18 AM, Davor Bonaci wrote:
>>> > > Hi everyone,
>>> > > Please join me and the rest of Beam PPMC in welcoming the following
>>> > > contributors as our newest committers. They have significantly
>>> > contributed
>>> > > to the project in different ways, and we look forward to many more
>>> > > contributions in the future.
>>> > >
>>> > > * Thomas Weise
>>> > > Thomas authored the Apache Apex runner for Beam [1]. This is an
>>> exciting
>>> > > new runner that opens a new user base. It is a large contribution,
>>> which
>>> > > starts the whole new component with a great potential.
>>> > >
>>> > > * Jesse Anderson
>>> > > Jesse has contributed significantly by promoting Beam. He has
>>> > co-developed
>>> > > a Beam tutorial and delivered it at a top big data conference. He
>>> > published
>>> > > several blog posts positioning Beam, Q with the Apache Beam team,
>>> and a
>>> > > demo video how to run Beam on multiple runners [2]. On the side, he has
>>> > > authored 7 pull requests and reported 6 JIRA issues.
>>> > >
>>> > > * Thomas Groh
>>> > > Since starting incubation, Thomas has contributed the most commits to
>>> the
>>> > > project [3], a total of 226 commits, which is more than anybody else.
>>> He
>>> > > has contributed broadly to the project, most significantly by
>>> developing
>>> > > from scratch the DirectRunner that supports the full model semantics.
>>> > > Additionally, he has contributed a new set of APIs for testing
>>> unbounded
>>> > > pipelines. He published a blog highlighting this work.
>>> > >
>>> > > Congratulations to all three! Welcome!
>>> > >
>>> > > Davor
>>> > >
>>> > > [1] https://github.com/apache/incubator-beam/tree/apex-runner
>>> > > [2] http://www.smokinghand.com/
>>> > > [3] https://github.com/apache/incubator-beam/graphs/contributors
>>> > > ?from=2016-02-01=2016-10-14=c
>>> > >
>>> >
>>> > --
>>> > Jean-Baptiste Onofré
>>> > jbono...@apache.org
>>> > http://blog.nanthrax.net
>>> > Talend - http://www.talend.com
>>> >
>>>


Re: Start of release 0.3.0-incubating

2016-10-21 Thread Maximilian Michels
+1 for the release. We have plenty of fixes in and users have already
asked for a new release.

-Max


On Fri, Oct 21, 2016 at 10:22 AM, Jean-Baptiste Onofré  
wrote:
> Hi Aljoscha,
>
> OK for me, you can go ahead ;)
>
> Thanks again to tackle this release !
>
> Regards
> JB
>
>
> On 10/21/2016 08:51 AM, Aljoscha Krettek wrote:
>>
>> +1 @JB
>>
>> We should definitely keep that in mind for the next releases. I think this
>> one is now sufficiently announced so I'll get started on the process.
>> (Which will take me a while since I have to do all the initial setup.)
>>
>>
>>
>> On Fri, 21 Oct 2016 at 06:32 Jean-Baptiste Onofré  wrote:
>>
>>> Hi Dan,
>>>
>>> No problem, MQTT and other IOs will be in the next release..
>>>
>>> IMHO, it would be great to have:
>>> 1. A release reminder couple of days before a release. Just to ask
>>> everyone if there's no objection (something like this:
>>>
>>>
>>> https://lists.apache.org/thread.html/80de75df0115940ca402132338b221e5dd5f669fd1bf915cd95e15c3@%3Cdev.karaf.apache.org%3E
>>> )
>>> 2. A roughly release schedule on the website (something like this:
>>> http://karaf.apache.org/download.html#container-schedule for instance).
>>>
>>> Just my $0.01 ;)
>>>
>>> Regards
>>> JB
>>>
>>> On 10/20/2016 06:30 PM, Dan Halperin wrote:

 Hi JB,

 This is a great discussion to have! IMO, there's no special
 functionality
 requirements for these pre-TLP releases. It's more important to make
 sure
 we keep the process going. (I think we should start the release as soon
>>>
>>> as

 possible, because it's been 2 months since the last one.)

 If we hold a release a week for MQTT, we'll hold it another week for
 some
 other new feature, and then hold it again for some other new feature.

 Can you make a strong argument for why MQTT in particular should be
>>>
>>> release

 blocking?

 Dan

 On Thu, Oct 20, 2016 at 9:26 AM, Jean-Baptiste Onofré 
 wrote:

> +1
>
> Thanks Aljosha !!
>
> Do you mind to wait the week end or Monday to start the release ? I
>>>
>>> would
>
> like to include MqttIO if possible.
>
> Thanks !
> Regards
> JB
>
> ⁣
>
> On Oct 20, 2016, 18:07, at 18:07, Dan Halperin
>>>
>>> 
>
> wrote:
>>
>> On Thu, Oct 20, 2016 at 12:37 AM, Aljoscha Krettek
>> 
>> wrote:
>>
>>> Hi,
>>> thanks for taking the time and writing this extensive doc!
>>>
>>> If no-one is against this I would like to be the release manager for
>>
>> the
>>>
>>> next (0.3.0-incubating) release. I would work with the guide and
>>
>> update it
>>>
>>> with anything that I learn along the way. Should I open a new thread
>>
>> for
>>>
>>> this or is it ok of nobody objects here?
>>>
>>> Cheers,
>>> Aljoscha
>>>
>>
>> Spinning this out as a separate thread.
>>
>> +1 -- Sounds great to me!
>>
>> Dan
>>
>> On Thu, Oct 20, 2016 at 12:37 AM, Aljoscha Krettek
>> 
>> wrote:
>>
>>> Hi,
>>> thanks for taking the time and writing this extensive doc!
>>>
>>> If no-one is against this I would like to be the release manager for
>>
>> the
>>>
>>> next (0.3.0-incubating) release. I would work with the guide and
>>
>> update it
>>>
>>> with anything that I learn along the way. Should I open a new thread
>>
>> for
>>>
>>> this or is it ok of nobody objects here?
>>>
>>> Cheers,
>>> Aljoscha
>>>
>>> On Thu, 20 Oct 2016 at 07:10 Jean-Baptiste Onofré 
>>
>> wrote:
>>>
>>>
 Hi,

 well done.

 As already discussed, it looks good to me ;)

 Regards
 JB

 On 10/20/2016 01:24 AM, Davor Bonaci wrote:
>
> Hi everybody,
> As a project, I think we should have a Release Guide to document
>>
>> the
>
> process, have consistent releases, on-board additional release
>>>
>>> managers,
>
> and generally share knowledge. It is also one of the project
>>
>> graduation
>
> guidelines.
>
> Dan and I wrote a draft version, documenting the process we did
>>
>> for the
>
> first two releases. It is currently in a pull request [1]. I'd
>>
>> invite
>
> everyone interested to take a peek and comment, either on the
>>
>> pull

 request
>
> itself or here on mailing list, as appropriate.
>
> Thanks,
> Davor
>
> [1] https://github.com/apache/incubator-beam-site/pull/49
>

 

Re: [KUDOS] Contributed runner: Apache Apex!

2016-10-18 Thread Maximilian Michels
Great to have another Runner on board! Congrats!

-Max


On Tue, Oct 18, 2016 at 8:10 AM, Jean-Baptiste Onofré  wrote:
> Awesome !
>
> Great job guys !
>
> Thanks to Thomas, Vlad, Guaray and Ken for this.
>
> Regards
> JB
>
>
> On 10/17/2016 06:51 PM, Kenneth Knowles wrote:
>>
>> Hi all,
>>
>> I would to, once again, call attention to a great addition to Beam: a
>> runner for Apache Apex.
>>
>> After lots of review and much thoughtful revision, pull request #540 has
>> been merged to the apex-runner feature branch today. Please do take a
>> look,
>> and help us put the finishing touches on it to get it ready for the master
>> branch.
>>
>> And please also congratulate and thank Thomas Weise for this large
>> endeavor, Vlad Rosov who helped get the integration tests working, and
>> Guarav Gupta who contributed review comments.
>>
>> Kenn
>>
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com


Re: [Proposal] Add waitToFinish(), cancel(), waitToRunning() to PipelineResult.

2016-10-13 Thread Maximilian Michels
The Flink runner currently only supports blocking execution. I'll open
a pull request to at least fix waitUntilFinish().

-Max


On Thu, Oct 13, 2016 at 11:10 AM, Amit Sela  wrote:
> Hi Pei,
>
> I have someone on my time who started to work on this, I'll follow-up,
> thanks for the bum ;-)
>
> Amit
>
> On Thu, Oct 13, 2016 at 8:38 AM Jean-Baptiste Onofré 
> wrote:
>
>> Hi Pei,
>>
>> good one !
>>
>> We now have to update the 'other' runners.
>>
>> Thanks.
>>
>> Regards
>> JB
>>
>> On 10/12/2016 10:48 PM, Pei He wrote:
>> > Hi,
>> > I just want to bump this thread, and brought it to attention.
>> >
>> > PipelineResult now have cancel() and waitUntilFinish(). However,
>> currently
>> > only DataflowRunner supports it in DataflowPipelineJob.
>> >
>> > We agreed that users should do "p.run().waitUntilFinish()" if they want
>> to
>> > block. But, if they do it now, direct, flink, spark runners will throw
>> > exceptions.
>> >
>> > I have following jira issues opened, I am wondering could any people help
>> > on them?
>> >
>> > https://issues.apache.org/jira/browse/BEAM-596
>> > https://issues.apache.org/jira/browse/BEAM-595
>> > https://issues.apache.org/jira/browse/BEAM-593
>> >
>> > Thanks
>> > --
>> > Pei
>> >
>> >
>> >
>> >
>> > On Tue, Jul 26, 2016 at 10:54 AM, Amit Sela 
>> wrote:
>> >
>> >> +1 and Thanks!
>> >>
>> >> On Tue, Jul 26, 2016 at 2:01 AM Robert Bradshaw
>> >> 
>> >> wrote:
>> >>
>> >>> +1, sounds great. Thanks Pei.
>> >>>
>> >>> On Mon, Jul 25, 2016 at 3:28 PM, Lukasz Cwik > >
>> >>> wrote:
>>  +1 for your proposal Pei
>> 
>>  On Mon, Jul 25, 2016 at 5:54 PM, Pei He 
>> >>> wrote:
>> 
>> > Looks to me that followings are agreed:
>> > (1). adding cancel() and waitUntilFinish() to PipelineResult.
>> > (In streaming mode, "all data watermarks reach to infinity" is
>> > considered as finished.)
>> > (2). PipelineRunner.run() should return relatively quick as soon as
>> > the pipeline/job is started/running. The blocking logic should be
>> left
>> > to users' code to handle with PipelineResult.waitUntilFinish(). (Test
>> > runners that finish quickly can block run() until the execution is
>> > done. So, it is cleaner to verify test results after run())
>> >
>> > I will send out PR for (1), and create jira issues to improve runners
>> >>> for
>> > (2).
>> >
>> > waitToRunning() is controversial, and we have several half way agreed
>> > proposals.
>> > I will pull them out from this thread, so we can close this proposal
>> > with cancel() and waitUntilFinish(). And, i will create a jira issue
>> > to track how to support ''waiting until other states".
>> >
>> > Does that sound good with anyone?
>> >
>> > Thanks
>> > --
>> > Pei
>> >
>> > On Thu, Jul 21, 2016 at 4:32 PM, Robert Bradshaw
>> >  wrote:
>> >> On Thu, Jul 21, 2016 at 4:18 PM, Ben Chambers > >>>
>> > wrote:
>> >>> This health check seems redundant with just waiting a while and
>> >> then
>> >>> checking on the status, other than returning earlier in the case of
>> >>> reaching a terminal state. What about adding:
>> >>>
>> >>> /**
>> >>>  * Returns the state after waiting the specified duration. Will
>> >>> return
>> >>> earlier if the pipeline
>> >>>  * reaches a terminal state.
>> >>>  */
>> >>> State getStateAfter(Duration duration);
>> >>>
>> >>> This seems to be a useful building block, both for the user's
>> >>> pipeline
>> > (in
>> >>> case they wanted to build something like wait and then check
>> >> health)
>> >>> and
>> >>> also for the SDK (to implement waitUntilFinished, etc.)
>> >>
>> >> A generic waitFor(Duration) which may return early if a terminal
>> >> state
>> >> is entered seems useful. I don't know that we need a return value
>> >> here, given that we an then query the PipelineResult however we want
>> >> once this returns. waitUntilFinished is simply
>> >> waitFor(InfiniteDuration).
>> >>
>> >>> On Thu, Jul 21, 2016 at 4:11 PM Pei He 
>> > wrote:
>> >>>
>>  I am not in favor of supporting wait for every states or
>>  waitUntilState(...).
>>  One reason is PipelineResult.State is not well defined and is not
>>  agreed upon runners.
>>  Another reason is users might not want to wait for a particular
>> >>> state.
>>  For example,
>>  waitUntilFinish() is to wait for a terminal state.
>>  So, even runners have different states, we still can define shared
>>  properties, such as finished/terminal.
>> >>
>> >> +1. Running is an intermediate state that doesn't have an obvious
>> >> mapping onto 

Re: [DISCUSS] UnboundedSource and the KafkaIO.

2016-10-10 Thread Maximilian Michels
Just to add a comment from the Flink side and its
UnboundedSourceWrapper. We experienced the only way to guarantee
deterministic splitting of the source, was to generate the splits upon
creation of the source and then checkpoint the assignment during
runtime. When restoring from a checkpoint, the same reader
configuration is restored. It's not possible to change the splitting
after the initial splitting has taken place. However, Flink will soon
be able to repartition the operator state upon restart/rescaling of a
job.

Does Spark have a way to pass state of a previous mini batch to the
current mini batch? If so, you could restore the last configuration
and continue reading from the checkpointed offset. You just have to
checkpoint before the mini batch ends.

-Max

On Mon, Oct 10, 2016 at 10:38 AM, Jean-Baptiste Onofré  
wrote:
> Hi Amit,
>
> thanks for the explanation.
>
> For 4, you are right, it's slightly different from DataXchange (related to
> the elements in the PCollection). I think storing the "starting point" for a
> reader makes sense.
>
> Regards
> JB
>
>
> On 10/10/2016 10:33 AM, Amit Sela wrote:
>>
>> Inline, thanks JB!
>>
>> On Mon, Oct 10, 2016 at 9:01 AM Jean-Baptiste Onofré 
>> wrote:
>>
>>> Hi Amit,
>>>
>>>
>>>
>>> For 1., the runner is responsible of the checkpoint storage (associated
>>>
>>> with the source). It's the way for the runner to retry and know the
>>>
>>> failed bundles.
>>>
>> True, this was a recap/summary of another, not-so-clear, thread.
>>
>>>
>>>
>>>
>>> For 4, are you proposing that KafkaRecord store additional metadata for
>>>
>>> that ? It sounds like what I proposed in the "Technical Vision" appendix
>>>
>>> document: there I proposed to introduce a DataXchange object that store
>>>
>>> some additional metadata (like offset) used by the runner. It would be
>>>
>>> the same with SDF as the tracker state should be persistent as well.
>>>
>> I think I was more focused on persisting the "starting point" for a
>> reader,
>> even if no records were read (yet), so that the next time the reader
>> attempts to read it will pick of there. This has more to do with how the
>> CheckpointMark handles this.
>> I have to say that I'm not familiar with your DataXchange proposal, I will
>> take a look though.
>>
>>>
>>>
>>>
>>> Regards
>>>
>>> JB
>>>
>>>
>>>
>>> On 10/08/2016 01:55 AM, Amit Sela wrote:
>>>
 I started a thread about (suggesting) UnboundedSource splitId's and it
>>>
>>>
 turned into an UnboundedSource/KafkaIO discussion, and I think it's best
>>>
>>> to
>>>
 start over in a clear [DISCUSS] thread.
>>>
>>>

>>>
 When working on UnboundedSource support for the Spark runner, I've
 raised
>>>
>>>
 some questions, some were general-UnboundedSource, and others
>>>
>>>
 Kafka-specific.
>>>
>>>

>>>
 I'd like to recap them here, and maybe have a more productive and
>>>
>>>
 well-documented discussion for everyone.
>>>
>>>

>>>
1. UnboundedSource id's - I assume any runner persists the
>>>
>>>
UnboundedSources's CheckpointMark for fault-tolerance, but I wonder
>>>
>>> how it
>>>
matches the appropriate split (of the UnboundedSource) to it's
>>>
>>> previously
>>>
persisted CheckpointMark in any specific worker ?
>>>
>>>
*Thomas Groh* mentioned that Source splits have to have an
>>>
>>>
 associated identifier,
>>>
>>>
and so the runner gets to tag splits however it pleases, so long as
>>>
>>>
those tags don't allow splits to bleed into each other.
>>>
>>>
2. Consistent splitting - an UnboundedSource splitting seems to
>>>
>>> require
>>>
consistent splitting if it were to "pick-up where it left", correct ?
>>>
>>> this
>>>
is not mentioned as a requirement or a recommendation in
>>>
>>>
UnboundedSource#generateInitialSplits(), so is this a Kafka-only
>>>
>>> issue ?
>>>
*Raghu Angadi* mentioned that Kafka already does so by applying
>>>
>>>
partitions to readers in a round-robin manner.
>>>
>>>
*Thomas Groh* also added that while the UnboundedSource API doesn't
>>>
>>>
require deterministic splitting (although it's recommended), a
>>>
>>>
PipelineRunner
>>>
>>>
should keep track of the initially generated splits.
>>>
>>>
3. Support reading of Kafka partitions that were added to topic/s
>>>
>>> while
>>>
a Pipeline reads from them - BEAM-727
>>>
>>>
 was filed.
>>>
>>>
4. Reading/persisting Kafka start offsets - since Spark works in
>>>
>>>
micro-batches, if "latest" was applied on a fairly sparse topic each
>>>
>>> worker
>>>
would actually begin reading only after it saw a message during the
>>>
>>> time
>>>
window it had to read messages. This is because fetching the offsets
>>>
>>> is
>>>
done by the worker running the Reader. This means that 

Re: [PROPOSAL] Introduce review mailing list and provide update on open discussion

2016-10-07 Thread Maximilian Michels
Hi JB!

> 1. We create a new mailing list: rev...@beam.incubator.apache.org.
> 2. We configure github integration to send all pull request comments on 
> review mailing list. It would allow to track and simplify the way to read the 
> comments and to keep up to date.

I already have it organized that way through filters but having a
dedicated mailing list is a much better idea.

> 3. A technical discussion should be send on dev mailing list with the 
> [DISCUSS] keyword in the subject.
> 4. Once a discussion is open, the author should periodically send an update 
> on the discussion (once a week) >containing a summary of the last exchanges 
> happened on the Jira or github (quick and direct summary).

We can try that on a best-effort basis. Enforcing this seems to be
difficult and could also introduce verbosity on the mailing list.

> 5. Once we consider the discussion close (no update in the last two weeks), 
> the author send a [CLOSE] e-mail on the thread.

I think it is hard to decide when a discussion is closed. Two weeks
seems like a too short amount of time.

In general, +1 for an open development process.

-Max

On Fri, Oct 7, 2016 at 4:05 AM, Jungtaek Lim  wrote:
> +1 except [4] for me, too. [4] may be replaced with linking DISCUSSION mail
> thread archive to JIRA.
> Yes it doesn't update news on discussion to JIRA and/or Github, but at
> least someone needed to see can find out manually.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> 2016년 10월 7일 (금) 오전 11:00, Satish Duggana 님이 작성:
>
>> +1 for proposal except for [4]. Agree with Raghu on [4] as it may be
>> burdensome to update with summaries and folks may start replying comments
>> on those summaries etc and conclusions are updated on respective design
>> docs. We may want to start without [4].
>>
>> Thanks,
>> Satish.
>>
>> On Fri, Oct 7, 2016 at 12:00 AM, Raghu Angadi 
>> wrote:
>>
>> > +1 for rev...@beam.incubator.apache.org. Open lists are critically
>> > important.
>> >
>> > My comment earlier was mainly about (4). Sorry about the not being clear.
>> >
>> > On Thu, Oct 6, 2016 at 11:00 AM, Lukasz Cwik 
>> > wrote:
>> >
>> > > +1 for supporting different working styles.
>> > >
>> > > On Thu, Oct 6, 2016 at 10:58 AM, Kenneth Knowles
>> > > >
>> > > wrote:
>> > >
>> > > > +1 to rev...@beam.incubator.apache.org if it is turnkey for infra to
>> > set
>> > > > up, aka points 1 and 2.
>> > > >
>> > > > Even though I would not personally read it via email, getting the
>> > > > information in yet another format and infrastructure (and
>> stewardship)
>> > is
>> > > > valuable for search, archival, and supporting diverse work styles.
>> The
>> > > > benefit might not be huge, but I think it will be enough to justify
>> the
>> > > > (hopefully negligible) cost.
>> > > >
>> > > > Kenn
>> > > >
>> > > > On Thu, Oct 6, 2016 at 4:54 AM Jean-Baptiste Onofré > >
>> > > > wrote:
>> > > >
>> > > > Hi team,
>> > > >
>> > > > following the discussion we had about technical discussion that
>> should
>> > > > happen on the mailing list, I would like to propose the following:
>> > > >
>> > > > 1. We create a new mailing list: rev...@beam.incubator.apache.org.
>> > > > 2. We configure github integration to send all pull request comments
>> on
>> > > > review mailing list. It would allow to track and simplify the way to
>> > > > read the comments and to keep up to date.
>> > > > 3. A technical discussion should be send on dev mailing list with the
>> > > > [DISCUSS] keyword in the subject.
>> > > > 4. Once a discussion is open, the author should periodically send an
>> > > > update on the discussion (once a week) containing a summary of the
>> last
>> > > > exchanges happened on the Jira or github (quick and direct summary).
>> > > > 5. Once we consider the discussion close (no update in the last two
>> > > > weeks), the author send a [CLOSE] e-mail on the thread.
>> > > >
>> > > > WDYT ?
>> > > >
>> > > > Regards
>> > > > JB
>> > > > --
>> > > > Jean-Baptiste Onofré
>> > > > jbono...@apache.org
>> > > > http://blog.nanthrax.net
>> > > > Talend - http://www.talend.com
>> > > >
>> > >
>> >
>>


Re: We've hit 1000 PRs!

2016-09-27 Thread Maximilian Michels
Indeed, that number is pretty impressive! For one, it shows Beam is a
very active project, but also it reflects Beam's rigorous PR policy.
+1 more of that!

On Tue, Sep 27, 2016 at 10:35 AM, Aljoscha Krettek  wrote:
> Sweet! :-)
>
> On Mon, 26 Sep 2016 at 23:47 Dan Halperin 
> wrote:
>
>> Hey folks!
>>
>> Just wanted to send out a note -- we've hit 1000 PRs in GitHub as of
>> Saturday! That's a tremendous amount of work for the 7 months since PR#1.
>>
>> I bet we hit 2000 in much fewer than 7 months ;)
>>
>> Dan
>>


Re: Support for Flink 1.1.0 in release-0.2.0-incubating

2016-09-20 Thread Maximilian Michels
That is totally understandable. However, upgrading to a new version of
Flink is also a big change that could require additional changes which are
out of scope for a Beam minor release.

My advice, if you want to use the latest version but prevent changes coming
in constantly, you could use a fixed snapshot release. For example:


   org.apache.beam
   beam-runners-flink_2.10
   0.3.0-incubating-20160920.071715-50



You can derive this version umber from the snapshot repository: If you look
at the snapshot repository:
https://repository.apache.org/content/repositories/snapshots/org/apache/beam/beam-runners-flink_2.10/0.3.0-incubating-SNAPSHOT/maven-metadata.xml

It would be worthwhile to discuss how we handle version upgrades of
backends in the Beam release cycle. Unfortunately, for the Runner to be
compatible across Flink versions, we need a bit more than just API
stability because some parts integrate also with the Flink runtime. I can
see this becoming an issue once more runners are part of Beam.

Best,
Max

On Sun, Sep 18, 2016 at 11:04 PM, Chawla,Sumit <sumitkcha...@gmail.com>
wrote:

> Hi Max
>
> Thanks for the information. I agree with you that 0.3.0 is the way ahead,
> but i am hesitant to use 0.3.0-SNAPSHOT due to its changing nature.
>
> Regards
> Sumit Chawla
>
>
> On Fri, Sep 16, 2016 at 5:51 AM, Maximilian Michels <m...@apache.org>
> wrote:
>
>> Hi Sumit,
>>
>> Thanks for the PR. Your changes looks good. I think there are
>> currently no plans for a minor release 0.2.1-incubating. A lot of
>> issues were fixed on the latest master which should give you a better
>> experience than the 0.2.0-incubating release.
>>
>> These are the current issues which will be fixed in 0.3.0-incubating:
>> https://issues.apache.org/jira/browse/BEAM-102?jql=project%2
>> 0%3D%20BEAM%20AND%20fixVersion%20%3D%200.3.0-incubating%20AN
>> D%20component%20%3D%20runner-flink
>>
>> 1. Most notably, it fixes an issue with comparing keys unencoded in
>> streaming mode which was an issue if you had not implemented
>> equals/hashCode for your objects.
>>
>> 2. Further, we support side inputs in streaming now. In the course, we
>> have also unified execution paths in streaming mode.
>>
>> 3. There was an issue with checkpointed sources that has been resolved.
>>
>> If you could try out the latest version, that would be great. If not,
>> we can probably merge your PR and think about a minor release.
>>
>> Best,
>> Max
>>
>> On Fri, Sep 16, 2016 at 6:10 AM, Chawla,Sumit <sumitkcha...@gmail.com>
>> wrote:
>> > Hi Max
>> >
>> > I have opened a PR - https://github.com/apache/incubator-beam/pull/963
>> for
>> > adding support of Flink 1.1.2 in Beam 0.2.0 release.
>> >
>> > Regards
>> > Sumit Chawla
>> >
>> >
>> > On Wed, Sep 14, 2016 at 1:32 PM, Chawla,Sumit <sumitkcha...@gmail.com>
>> > wrote:
>> >
>> >> Hi Max
>> >>
>> >> I was able to compile 0.2.0 with Flink 1.1.0 with small modification,
>> and
>> >> run a simple pipeline.
>> >>
>> >>@Override
>> >> -  public void restoreState(StreamTaskState taskState, long
>> >> recoveryTimestamp) throws Exception {
>> >> -super.restoreState(taskState, recoveryTimestamp);
>> >> +  public void restoreState(StreamTaskState taskState) throws
>> Exception {
>> >> +super.restoreState(taskState);
>> >>
>> >>
>> >> Can i get a sense of the changes that have happened in 0.3.0 for
>> Flink?  I
>> >> observed some classes completely reworked.  It will be crucial for me
>> to
>> >> understand the scope of change and impact before making a move to 0.3.0
>> >>
>> >>
>> >>
>> >> Regards
>> >> Sumit Chawla
>> >>
>> >>
>> >> On Wed, Sep 14, 2016 at 3:03 AM, Maximilian Michels <m...@apache.org>
>> >> wrote:
>> >>
>> >>> We support Flink 1.1.2 on the latest snapshot version
>> >>> 0.3.0-incubating-SNAPSHOT. Would it be possible for you to work with
>> >>> this version?
>> >>>
>> >>> On Tue, Sep 13, 2016 at 11:55 PM, Chawla,Sumit <
>> sumitkcha...@gmail.com>
>> >>> wrote:
>> >>> > When trying to use Beam 0.2.0 with Flink 1.1.0 jar, i am seeing
>> >>> following
>> >>> > error:
>> >>> >
>> >>> &

Re: Support for Flink 1.1.0 in release-0.2.0-incubating

2016-09-16 Thread Maximilian Michels
Hi Sumit,

Thanks for the PR. Your changes looks good. I think there are
currently no plans for a minor release 0.2.1-incubating. A lot of
issues were fixed on the latest master which should give you a better
experience than the 0.2.0-incubating release.

These are the current issues which will be fixed in 0.3.0-incubating:
https://issues.apache.org/jira/browse/BEAM-102?jql=project%20%3D%20BEAM%20AND%20fixVersion%20%3D%200.3.0-incubating%20AND%20component%20%3D%20runner-flink

1. Most notably, it fixes an issue with comparing keys unencoded in
streaming mode which was an issue if you had not implemented
equals/hashCode for your objects.

2. Further, we support side inputs in streaming now. In the course, we
have also unified execution paths in streaming mode.

3. There was an issue with checkpointed sources that has been resolved.

If you could try out the latest version, that would be great. If not,
we can probably merge your PR and think about a minor release.

Best,
Max

On Fri, Sep 16, 2016 at 6:10 AM, Chawla,Sumit <sumitkcha...@gmail.com> wrote:
> Hi Max
>
> I have opened a PR - https://github.com/apache/incubator-beam/pull/963 for
> adding support of Flink 1.1.2 in Beam 0.2.0 release.
>
> Regards
> Sumit Chawla
>
>
> On Wed, Sep 14, 2016 at 1:32 PM, Chawla,Sumit <sumitkcha...@gmail.com>
> wrote:
>
>> Hi Max
>>
>> I was able to compile 0.2.0 with Flink 1.1.0 with small modification, and
>> run a simple pipeline.
>>
>>@Override
>> -  public void restoreState(StreamTaskState taskState, long
>> recoveryTimestamp) throws Exception {
>> -super.restoreState(taskState, recoveryTimestamp);
>> +  public void restoreState(StreamTaskState taskState) throws Exception {
>> +super.restoreState(taskState);
>>
>>
>> Can i get a sense of the changes that have happened in 0.3.0 for Flink?  I
>> observed some classes completely reworked.  It will be crucial for me to
>> understand the scope of change and impact before making a move to 0.3.0
>>
>>
>>
>> Regards
>> Sumit Chawla
>>
>>
>> On Wed, Sep 14, 2016 at 3:03 AM, Maximilian Michels <m...@apache.org>
>> wrote:
>>
>>> We support Flink 1.1.2 on the latest snapshot version
>>> 0.3.0-incubating-SNAPSHOT. Would it be possible for you to work with
>>> this version?
>>>
>>> On Tue, Sep 13, 2016 at 11:55 PM, Chawla,Sumit <sumitkcha...@gmail.com>
>>> wrote:
>>> > When trying to use Beam 0.2.0 with Flink 1.1.0 jar, i am seeing
>>> following
>>> > error:
>>> >
>>> > java.lang.NoSuchMethodError:
>>> > org.apache.flink.streaming.api.operators.StreamingRuntimeCon
>>> text.registerTimer(JLorg/apache/flink/streaming/
>>> runtime/operators/Triggerable;)V
>>> > at org.apache.beam.runners.flink.translation.wrappers.streaming
>>> .io.UnboundedSourceWrapper.setNextWatermarkTimer(UnboundedSo
>>> urceWrapper.java:381)
>>> > at org.apache.beam.runners.flink.translation.wrappers.streaming
>>> .io.UnboundedSourceWrapper.run(UnboundedSourceWrapper.java:233)
>>> > at org.apache.flink.streaming.api.operators.StreamSource.run(
>>> StreamSource.java:80)
>>> > at org.apache.flink.streaming.api.operators.StreamSource.run(
>>> StreamSource.java:53)
>>> > at org.apache.flink.streaming.runtime.tasks.SourceStreamTask.
>>> run(SourceStreamTask.java:56)
>>> > at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(
>>> StreamTask.java:266)
>>> > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:584)
>>> > at java.lang.Thread.run(Thread.java:745)
>>> >
>>> >
>>> > Regards
>>> > Sumit Chawla
>>> >
>>> >
>>> > On Tue, Sep 13, 2016 at 2:20 PM, Chawla,Sumit <sumitkcha...@gmail.com>
>>> > wrote:
>>> >
>>> >> Hi All
>>> >>
>>> >> The release-0.2.0-incubating supports Flink 1.0.3. With Flink 1.1.0
>>> out,
>>> >> is there a plan to support it with any 0.2.0 patch? I tried compiling
>>> 0.2.0
>>> >> with Flink 1.1.0,
>>> >> and got couple of compliation errors in FlinkGroupAlsoByWindowWrapper.
>>> java.
>>> >> Going back to master i see lots of change in Flink translation
>>> wrappers,
>>> >> and
>>> >> FlinkGroupAlsoByWindowWrapper.java being removed.
>>> >>
>>> >> Just want to get a sense of things here, on what would it take to
>>> support Flink
>>> >> 1.1.0 with release-0.2.0. Would appreciate views of people who are
>>> already
>>> >> working on upgrading it to Flink 1.1.0
>>> >>
>>> >> Regards
>>> >> Sumit Chawla
>>> >>
>>> >>
>>>
>>
>>


Re: Support for Flink 1.1.0 in release-0.2.0-incubating

2016-09-14 Thread Maximilian Michels
We support Flink 1.1.2 on the latest snapshot version
0.3.0-incubating-SNAPSHOT. Would it be possible for you to work with
this version?

On Tue, Sep 13, 2016 at 11:55 PM, Chawla,Sumit  wrote:
> When trying to use Beam 0.2.0 with Flink 1.1.0 jar, i am seeing following
> error:
>
> java.lang.NoSuchMethodError:
> org.apache.flink.streaming.api.operators.StreamingRuntimeContext.registerTimer(JLorg/apache/flink/streaming/runtime/operators/Triggerable;)V
> at 
> org.apache.beam.runners.flink.translation.wrappers.streaming.io.UnboundedSourceWrapper.setNextWatermarkTimer(UnboundedSourceWrapper.java:381)
> at 
> org.apache.beam.runners.flink.translation.wrappers.streaming.io.UnboundedSourceWrapper.run(UnboundedSourceWrapper.java:233)
> at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:80)
> at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:53)
> at 
> org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:56)
> at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:266)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:584)
> at java.lang.Thread.run(Thread.java:745)
>
>
> Regards
> Sumit Chawla
>
>
> On Tue, Sep 13, 2016 at 2:20 PM, Chawla,Sumit 
> wrote:
>
>> Hi All
>>
>> The release-0.2.0-incubating supports Flink 1.0.3. With Flink 1.1.0 out,
>> is there a plan to support it with any 0.2.0 patch? I tried compiling 0.2.0
>> with Flink 1.1.0,
>> and got couple of compliation errors in FlinkGroupAlsoByWindowWrapper.java.
>> Going back to master i see lots of change in Flink translation wrappers,
>> and
>> FlinkGroupAlsoByWindowWrapper.java being removed.
>>
>> Just want to get a sense of things here, on what would it take to support 
>> Flink
>> 1.1.0 with release-0.2.0. Would appreciate views of people who are already
>> working on upgrading it to Flink 1.1.0
>>
>> Regards
>> Sumit Chawla
>>
>>


Re: [MENTOR] Resigning as a Beam mentor

2016-09-08 Thread Maximilian Michels
Hi Bertrand,

Thanks for your work and hope to see you again some time here or elsewhere!

Best,
Max

On Thu, Sep 8, 2016 at 11:03 AM, Bertrand Delacretaz
 wrote:
> On Thu, Sep 8, 2016 at 10:59 AM, Jean-Baptiste Onofré  
> wrote:
>> ...thanks a lot for all the commitment and feedback you gave to
>> the Beam podling...
>
> You're welcome! I have removed my name from
> http://incubator.apache.org/projects/beam.html and will inform
> private@incubator.a.o
>
> -Bertrand


Re: Suggestion for Writing Sink Implementation

2016-08-17 Thread Maximilian Michels
Hi Kenneth,

The problem is that the Write transform is not supported in streaming
execution of the Flink Runner because the streaming execution doesn't
currently support side inputs. PR is open to fix that..

Cheers,
Max

On Thu, Jul 28, 2016 at 8:56 PM, Kenneth Knowles  
wrote:
> Hi Sumit,
>
> I see what has happened here, from that snippet you pasted from the Flink
> runner's code [1]. Thanks for looking into it!
>
> The Flink runner today appears to reject Write.Bounded transforms in
> streaming mode if the sink is not an instance of UnboundedFlinkSink. The
> intent of that code, I believe, was to special case UnboundedFlinkSink to
> make it easy to use an existing Flink sink, not to disable all other Write
> transforms. What do you think, Max?
>
> Until we fix this issue, you should use ParDo transforms to do the writing.
> If you can share a little about your sink, we may be able to suggest
> patterns for implementing it. Like Eugene said, the Write.of(Sink)
> transform is just a specialized pattern of ParDo's, not a Beam primitive.
>
> Kenn
>
> [1]
> https://github.com/apache/incubator-beam/blob/master/runners/flink/runner/src/main/java/org/apache/beam/runners/flink/translation/FlinkStreamingTransformTranslators.java#L203
>
>
> On Wed, Jul 27, 2016 at 5:57 PM, Eugene Kirpichov <
> kirpic...@google.com.invalid> wrote:
>
>> Thanks Sumit. Looks like your question is, indeed, specific to the Flink
>> runner, and I'll then defer to somebody familiar with it.
>>
>> On Wed, Jul 27, 2016 at 5:25 PM Chawla,Sumit 
>> wrote:
>>
>> > Thanks a lot Eugene.
>> >
>> > >>>My immediate requirement is to run this Sink on FlinkRunner. Which
>> > mandates that my implementation must also implement SinkFunction<>.  In
>> > that >>>case, none of the Sink<> methods get called anyway.
>> >
>> > I am using FlinkRunner. The Sink implementation that i was writing by
>> > extending Sink<> class had to implement Flink Specific SinkFunction for
>> the
>> > correct translation.
>> >
>> > private static class WriteSinkStreamingTranslator implements
>> >
>> FlinkStreamingPipelineTranslator.StreamTransformTranslator
>> > {
>> >
>> >   @Override
>> >   public void translateNode(Write.Bound transform,
>> > FlinkStreamingTranslationContext context) {
>> > String name = transform.getName();
>> > PValue input = context.getInput(transform);
>> >
>> > Sink sink = transform.getSink();
>> > if (!(sink instanceof UnboundedFlinkSink)) {
>> >   throw new UnsupportedOperationException("At the time, only
>> > unbounded Flink sinks are supported.");
>> > }
>> >
>> > DataStream inputDataSet =
>> > context.getInputDataStream(input);
>> >
>> > inputDataSet.flatMap(new FlatMapFunction()
>> {
>> >   @Override
>> >   public void flatMap(WindowedValue value, Collector
>> > out) throws Exception {
>> > out.collect(value.getValue());
>> >   }
>> > }).addSink(((UnboundedFlinkSink)
>> > sink).getFlinkSource()).name(name);
>> >   }
>> > }
>> >
>> >
>> >
>> >
>> > Regards
>> > Sumit Chawla
>> >
>> >
>> > On Wed, Jul 27, 2016 at 4:53 PM, Eugene Kirpichov <
>> > kirpic...@google.com.invalid> wrote:
>> >
>> > > Hi Sumit,
>> > >
>> > > All reusable parts of a pipeline, including connectors to storage
>> > systems,
>> > > should be packaged as PTransform's.
>> > >
>> > > Sink is an advanced API that you can use under the hood to implement
>> the
>> > > transform, if this particular connector benefits from this API - but
>> you
>> > > don't have to, and many connectors indeed don't need it, and are
>> simpler
>> > to
>> > > implement just as wrappers around a couple of ParDo's writing the data.
>> > >
>> > > Even if the connector is implemented using a Sink, packaging the
>> > connector
>> > > as a PTransform is important because it's easier to apply in a pipeline
>> > and
>> > > because it's more future-proof (the author of the connector may later
>> > > change it to use something else rather than Sink under the hood without
>> > > breaking existing users).
>> > >
>> > > Sink is, currently, useful in the following case:
>> > > - You're writing a bounded amount of data (we do not yet have an
>> > unbounded
>> > > Sink analogue)
>> > > - The location you're writing to is known at pipeline construction
>> time,
>> > > and does not depend on the data itself (support for "data-dependent"
>> > sinks
>> > > is on the radar https://issues.apache.org/jira/browse/BEAM-92)
>> > > - The storage system you're writing to has a distinct "initialization"
>> > and
>> > > "finalization" step, allowing the write operation to appear atomic
>> > (either
>> > > all data is written or none). This mostly applies to files (where
>> writing
>> > > is done by first writing to a temporary directory, and then renaming
>> all
>> > > files to their final location), but there can be other cases too.
>> > >
>> > > Here's an example GCP 

Re: Adding DoFn Setup and Teardown methods

2016-07-18 Thread Maximilian Michels
Hoping it becomes usual as soon as we have this useful addition :)

On Mon, Jul 18, 2016 at 1:53 PM, Aljoscha Krettek <aljos...@apache.org>
wrote:

> Did you mean "usual" or "useful"? ;-)
>
> On Mon, 18 Jul 2016 at 12:42 Maximilian Michels <m...@apache.org> wrote:
>
> > +1 for setup() and teardown() methods. Very usual for proper
> initialization
> > and cleanup of DoFn related data structures.
> >
> > On Wed, Jun 29, 2016 at 9:34 PM, Aljoscha Krettek <aljos...@apache.org>
> > wrote:
> >
> > > +1 I think some people might already mistake the
> > > startBundle()/finishBundle() methods for what the new methods are
> > supposed
> > > to be
> > >
> > > On Tue, 28 Jun 2016 at 19:38 Raghu Angadi <rang...@google.com.invalid>
> > > wrote:
> > >
> > > > This is terrific!
> > > > Thanks for the proposal.
> > > >
> > > > On Tue, Jun 28, 2016 at 9:06 AM, Thomas Groh
> <tg...@google.com.invalid
> > >
> > > > wrote:
> > > >
> > > > > Hey Everyone:
> > > > >
> > > > > We've recently started to be permitted to reuse DoFn instances in
> > > > Beam[1].
> > > > > Beyond the efficiency gains from not having to deserialize new DoFn
> > > > > instances for every bundle, DoFn reuse also provides the ability to
> > > > > minimize expensive setup work done per-bundle, which hasn't
> formerly
> > > been
> > > > > possible. Additionally, it has also enabled more failure cases,
> where
> > > > > element-based state leaks improperly across bundles.
> > > > >
> > > > > I've written a document proposing that two methods are added to the
> > API
> > > > of
> > > > > DoFn, setup and teardown, which both provides hooks for users to
> > write
> > > > > efficient DoFns, as well as signals that DoFns will be reused.
> > > > >
> > > > > The document is located at
> > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1LLQqggSePURt3XavKBGV7SZJYQ4NW8yCu63lBchzMRk/edit?ts=5771458f#
> > > > > and committers have edit access
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Thomas
> > > > >
> > > > > [1] https://github.com/apache/incubator-beam/pull/419
> > > > >
> > > >
> > >
> >
>


Re: Adding DoFn Setup and Teardown methods

2016-07-18 Thread Maximilian Michels
+1 for setup() and teardown() methods. Very usual for proper initialization
and cleanup of DoFn related data structures.

On Wed, Jun 29, 2016 at 9:34 PM, Aljoscha Krettek 
wrote:

> +1 I think some people might already mistake the
> startBundle()/finishBundle() methods for what the new methods are supposed
> to be
>
> On Tue, 28 Jun 2016 at 19:38 Raghu Angadi 
> wrote:
>
> > This is terrific!
> > Thanks for the proposal.
> >
> > On Tue, Jun 28, 2016 at 9:06 AM, Thomas Groh 
> > wrote:
> >
> > > Hey Everyone:
> > >
> > > We've recently started to be permitted to reuse DoFn instances in
> > Beam[1].
> > > Beyond the efficiency gains from not having to deserialize new DoFn
> > > instances for every bundle, DoFn reuse also provides the ability to
> > > minimize expensive setup work done per-bundle, which hasn't formerly
> been
> > > possible. Additionally, it has also enabled more failure cases, where
> > > element-based state leaks improperly across bundles.
> > >
> > > I've written a document proposing that two methods are added to the API
> > of
> > > DoFn, setup and teardown, which both provides hooks for users to write
> > > efficient DoFns, as well as signals that DoFns will be reused.
> > >
> > > The document is located at
> > >
> > >
> >
> https://docs.google.com/document/d/1LLQqggSePURt3XavKBGV7SZJYQ4NW8yCu63lBchzMRk/edit?ts=5771458f#
> > > and committers have edit access
> > >
> > > Thanks,
> > >
> > > Thomas
> > >
> > > [1] https://github.com/apache/incubator-beam/pull/419
> > >
> >
>


Re: add component tag to pull request title / commit comment

2016-05-12 Thread Maximilian Michels
@Aljoscha: Yes, less space for the commit message if it has to fit
into 70 chars. I'm not sure how important a 70 character limit is
nowadays.

@Davor I think your observation is correct. In Flink we also like to
do tagged commits but enforcing it is almost impossible. Automatic
tagging would be super cool.

On Wed, May 11, 2016 at 4:11 PM, Davor Bonaci <da...@google.com.invalid> wrote:
> In general, I think this is fine to try, but I think its reliability will
> likely end up being low. There's a human factor involved -- people will
> forget, pull request will evolve in unexpected ways -- all leading to a low
> reliability of this indicator.
>
> I think it would be really awesome to have automatic tagging. Not really on
> the priority list right now, but I'm hopeful we'll have something like that
> in the not-so-distant future.
>
> On Wed, May 11, 2016 at 3:00 AM, Aljoscha Krettek <aljos...@apache.org>
> wrote:
>
>> This will, however, also take precious space in the Commit Title. And some
>> commits might not be about only one clear-cut component.
>>
>> On Wed, 11 May 2016 at 11:43 Maximilian Michels <m...@apache.org> wrote:
>>
>> > +1 I think it makes it easier to see at a glance to which part of Beam
>> > a commit belongs. We could use the Jira components as tags.
>> >
>> > On Wed, May 11, 2016 at 11:09 AM, Jean-Baptiste Onofré <j...@nanthrax.net>
>> > wrote:
>> > > Hi Manu,
>> > >
>> > > good idea. Theoretically the component in the corresponding Jira should
>> > give
>> > > the information, but for convenience, we could add a "tag" in the
>> commit
>> > > comment.
>> > >
>> > > Regards
>> > > JB
>> > >
>> > >
>> > > On 05/10/2016 06:27 AM, Manu Zhang wrote:
>> > >>
>> > >> Guys,
>> > >>
>> > >> As I've been developing Gearpump runner for Beam, I've closely watched
>> > the
>> > >> pull requests updates for API changes. Sometimes, it turns out changes
>> > are
>> > >> only to be applied to a specific runner after I go through the codes.
>> > >> Could
>> > >> we add a tag in the pull request title / commit comment to mark
>> whether
>> > >> it's API change or Runner specific change, or something else, like,
>> > >> [BEAM-XXX] [Direct-Java] Comments... ?
>> > >>
>> > >> Thanks,
>> > >> Manu Zhang
>> > >>
>> > >
>> > > --
>> > > Jean-Baptiste Onofré
>> > > jbono...@apache.org
>> > > http://blog.nanthrax.net
>> > > Talend - http://www.talend.com
>> >
>>


Re: [DISCUSS] Beam IO native IO

2016-05-03 Thread Maximilian Michels
Correct, Kafka doesn't support rollbacks of the producer. In Flink
there is the RollingSink which supports transactional rolling files.
Admittedly, that is the only one. Still, checkpointing sinks in Beam
could be useful for users who are concerned about exactly once
semantics. I'm not sure whether we can implement something similar
with the bundle mechanism.

On Mon, May 2, 2016 at 11:50 PM, Raghu Angadi
<rang...@google.com.invalid> wrote:
> What are good examples of streaming sinks that support checkpointing (or
> transactions/rollbacks)? I don't Kafka supports a rollback.
>
> On Mon, May 2, 2016 at 2:54 AM, Maximilian Michels <m...@apache.org> wrote:
>
>> Yes, I would expect sinks to provide similar additional interfaces
>> like sources, e.g. checkpointing. We could also use the
>> startBundle/processElement/finishBundle lifecycle methods to implement
>> checkpointing. I just wonder, if we want to make it more explicit.
>> Also, does it make sense that sinks can return a PCollection? You can
>> return PDone but you don't have to.
>>
>> Since sinks are fundamental in streaming pipelines, it just seemed odd
>> to me that there is not dedicated interface. I understand a bit
>> clearer now that it is not viewed as crucial because we can use
>> existing primitives to create sinks. In a way, that might be elegant
>> but also less explicit.
>>
>> On Fri, Apr 29, 2016 at 11:00 PM, Frances Perry <f...@google.com.invalid>
>> wrote:
>> >>
>> >> @Frances Sources are not simple DoFns. They add additional
>> >> functionality, e.g. checkpointing, watermark generation, creating
>> >> splits. If we want sinks to be portable, we should think about a
>> >> dedicated interface. At least for the checkpointing.
>> >>
>> >
>> > We might be mixing sources and sinks in this conversation. ;-) Sources
>> > definitely provide additional functionality as you mentioned. But at
>> least
>> > currently, sinks don't provide any new primitive functionality. Are you
>> > suggestion there needs to be a checkpointing interface for sinks beyond
>> > DoFn's bundle finalization? (Note that the existing Write for batch is
>> just
>> > a PTransform based around ParDo.)
>>


Re: IO timelines (Was: How to read/write avro data using FlinkKafka Consumer/Producer)

2016-05-03 Thread Maximilian Michels
Hi Davor, hi Dan,

The discussion was partly held in JB's thread about "useNative()" for
Beam transforms. I think we reached consensus that we prefer portable
out of the box Beam IO over Runner specific IO.

On Thu, Apr 28, 2016 at 10:03 AM, Davor Bonaci <da...@google.com.invalid>
wrote:
>
> Of course, there's a long way to go, but there should *not* be any users that 
> are
> blocked or scenarios that are impossible.


Exactly, while we are transitioning, let's not make any scenarios
impossible. For example, if users want to use Kafka 8, they should be
able to use the Flink Consumer/Producer as long as there is no support
yet. Exchanging sources/sinks in Beam programs is relatively easy to
do; users still get to keep all the nice semantics. Once we have a
decent support for IO, these wrappers should go away.

>  - Complete conversion of existing IOs to the Source / Sink API. ETA: a
>   week or two for full completion.


Which ones are there which have not been converted? From a first
glance, I see AvroIO, BigQueryIO, and PubsubIO. Only sources should be
affected because they have a dedicated interface; sinks are ParDos.

>   - Make sure Spark & Flink runners fully support Source / Sink API, and
>   that ties into the new Runner / Fn API discussion.


Yes, it's not hard to fix those. We will fix those as soon as possible.

Cheers,
Max


On Tue, May 3, 2016 at 1:06 AM, Dan Halperin
<dhalp...@google.com.invalid> wrote:
>
> Coming back from vacation, sorry for delay.
>
> I agree with Davor. While it's nice to have a `UnboundedFlinkSource`
> <https://github.com/apache/incubator-beam/blob/master/runners/flink/runner/src/main/java/org/apache/beam/runners/flink/translation/wrappers/streaming/io/UnboundedFlinkSource.java>
> wrapper that can be used to convert a Flink-specific source into one that
> can be used in Beam, I'd hope this is a temporary stop-gap until we have
> Beam-API versions of these for all important connectors. If a
> Flink-specific source has major advantages over one that can be implemented
> in Beam, we should explore why and see whether we have gaps in our APIs.
>
> (And similar analogs for Spark and for Sinks, Transforms libraries, and
> everything else).
>
> Thanks!
> Dan
>
> On Thu, Apr 28, 2016 at 10:03 AM, Davor Bonaci <da...@google.com.invalid>
> wrote:
>
> > [ Moving over to the dev@ list ]
> >
> > I think we should be aiming a little higher than "trying out Beam" ;)
> >
> > Beam SDK currently has built-in IOs for Kafka, as well as for all important
> > Google Cloud Platform services. Additionally, there are pull requests for
> > Firebase and Cassandra. This is not bad, particularly talking into account
> > that we have APIs for user to develop their own IO connectors. Of course,
> > there's a long way to go, but there should *not* be any users that are
> > blocked or scenarios that are impossible.
> >
> > In terms of the runner support, Cloud Dataflow runner supports all IOs,
> > including any user-written ones. Other runners don't as extensively, but
> > this is a high priority item to address.
> >
> > In my mind, we should strive to address the following:
> >
> >- Complete conversion of existing IOs to the Source / Sink API. ETA: a
> >week or two for full completion.
> >- Make sure Spark & Flink runners fully support Source / Sink API, and
> >that ties into the new Runner / Fn API discussion.
> >- Increase the set of built-in IOs. No ETA; iterative process over time.
> >There are 2 pending pull requests, others in development.
> >
> > I'm hopeful we can address all of these items in a relatively short period
> > of time -- in a few months or so -- and likely before we can call any
> > release "stable". (This is why the new Runner / Fn API discussions are so
> > important.)
> >
> > In summary, in my mind, "long run" here means "< few months".
> >
> > -- Forwarded message --
> > From: Maximilian Michels <m...@apache.org>
> > Date: Thu, Apr 28, 2016 at 3:20 AM
> > Subject: Re: How to read/write avro data using FlinkKafka Consumer/Producer
> > (Beam Pipeline) ?
> > To: u...@beam.incubator.apache.org
> >
> > On Wed, Apr 27, 2016 at 11:12 PM, Jean-Baptiste Onofré <j...@nanthrax.net>
> > wrote:
> > > generally speaking, we have to check that all runners work fine with the
> > provided IO. I don't think it's a good idea that the runners themselves
> > implement any IO: they should use "out of the box" IO.
> >
> > In the long run, big yes and I liked to help to make it possible!
> > However, there is still a gap between what Beam and its Runners
> > provide and what users want to do. For the time being, I think the
> > solution we have is fine. It gives users the option to try out Beam
> > with sources and sinks that they expect to be available in streaming
> > systems.
> >


Re: [DISCUSS] Beam IO native IO

2016-05-02 Thread Maximilian Michels
Yes, I would expect sinks to provide similar additional interfaces
like sources, e.g. checkpointing. We could also use the
startBundle/processElement/finishBundle lifecycle methods to implement
checkpointing. I just wonder, if we want to make it more explicit.
Also, does it make sense that sinks can return a PCollection? You can
return PDone but you don't have to.

Since sinks are fundamental in streaming pipelines, it just seemed odd
to me that there is not dedicated interface. I understand a bit
clearer now that it is not viewed as crucial because we can use
existing primitives to create sinks. In a way, that might be elegant
but also less explicit.

On Fri, Apr 29, 2016 at 11:00 PM, Frances Perry  wrote:
>>
>> @Frances Sources are not simple DoFns. They add additional
>> functionality, e.g. checkpointing, watermark generation, creating
>> splits. If we want sinks to be portable, we should think about a
>> dedicated interface. At least for the checkpointing.
>>
>
> We might be mixing sources and sinks in this conversation. ;-) Sources
> definitely provide additional functionality as you mentioned. But at least
> currently, sinks don't provide any new primitive functionality. Are you
> suggestion there needs to be a checkpointing interface for sinks beyond
> DoFn's bundle finalization? (Note that the existing Write for batch is just
> a PTransform based around ParDo.)


Re: [PROPOSAL] Nightly builds by Jenkins

2016-04-05 Thread Maximilian Michels
Hey JB,

I would also propose three Jenkins jobs (apart from the Cloud Dataflow tests):

- Test coverage of pull requests (beam_PreCommit)
- Test coverage of the master and all other branches (beam_MavenVerify)
- A daily job that deploys artifacts to the snapshot repository (beam_Nightly)

Keeping the last two separate makes it easier to change the deployment
of snapshots. Daily deployment should be enough. If we ever want to
deploy more frequently, we can easily adjust the schedule of the
build. In addition, we can identify problems regarding the deployment
more easily and prevent breakage of the test execution of the
development branches.

Best,
Max

On Tue, Apr 5, 2016 at 9:57 AM, Jason Kuster
 wrote:
> Hey JB,
>
> Just want to clarify - do you mean that beam_nightly would continue to run
> on the schedule it currently has (SCM poll/hourly), plus one run at
> midnight?
>
> I think Dan's question centers around whether beam_nightly build would just
> run once every 24h. We want our postsubmit coverage to run more often than
> that is my impression. Doing a deploy every time the SCM poll returns some
> changes seems like an aggressive schedule to me, but I welcome your
> thoughts there. Otherwise we could keep beam_mavenverify running on the scm
> poll/hourly schedule and add the beam_nightly target which just does a
> single deploy every 24h.
>
> Thanks,
>
> Jason
>
> On Tue, Apr 5, 2016 at 12:41 AM, Jean-Baptiste Onofré 
> wrote:
>
>> Hi Dan,
>>
>> you can have both mvn clean deploy and mvn clean verify, but IMHO, the
>> first already covers the second ;)
>>
>> So, I think mvn clean deploy is fine instead of clean verify.
>>
>> WDYT ?
>>
>> Regards
>> JB
>>
>>
>> On 04/05/2016 08:05 AM, Dan Halperin wrote:
>>
>>> I am completely behind producing nightly jars.
>>>
>>> But, I don't think that `beam_MavenVerify` is completely redundant -- I
>>> was
>>> under the impression it was our main post-submit coverage. Is that wrong?
>>>
>>> If I'm not wrong, then I think this should simply be a fourth Jira target
>>> that runs every 24h.
>>>
>>> On Mon, Apr 4, 2016 at 10:50 PM, Jean-Baptiste Onofré 
>>> wrote:
>>>
>>> Hi beamers,

 Now, on Jenkins, we have three jobs:

 - beam_PreCommit does a mvn clean verify for each opened PR
 - beam_MavenVerify does a mvn clean verify on master branch
 - beam_RunnableOnService_GoogleCloudDataflow does a mvn clean verify
 -PDataflowPipelineTests on master branch

 As discussed last week, Davor and I are working on renaming (especially
 package).

 Once this renaming done (it should take a week or so), I propose to
 change
 beam_MavenVerify as beam_Nightly: it will do a mvn clean deploy deploying
 SNAPSHOTs on the Apache SNAPSHOT repo (deploy phase includes verify and
 test of course) with a schedule every night and SCM change.

 It will allow people to test and try beam without building.

 Thoughts ?

 Regards
 JB
 --
 Jean-Baptiste Onofré
 jbono...@apache.org
 http://blog.nanthrax.net
 Talend - http://www.talend.com


>>>
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>
>
>
> --
> ---
> Jason Kuster
> Apache Beam (Incubating) / Google Cloud Dataflow


Re: Draft Contribution Guide

2016-03-29 Thread Maximilian Michels
@Ben: Fixing a subtle bug that was introduced and that users are affected
by. Fixing instable builds. Minor cosmetic changes. Changes to JavaDoc.
However, PRs make an excellent tool for code reviews most of the time. I
see that we probably can't stress that enough.

I'd actually think that my paragraph gave more reasons to use pull requests
than the current draft does. I'd be happy if we incorporated some of my
suggestions. I have removed the "Whenever possible" but kept my other
points about communication and follow-up discussions:

"Committers should never commit anything without going through a pull
request, since that would bypass test coverage and potentially cause the
build to fail due to checkstyle, etc. *In addition, pull requests ensure
that changes are communicated properly and potential flaws or improvements
can be spotted*. Always go through the pull request, even if you won’t wait
for the code review. *Even then, comments can be provided in the pull
requests after it has been merged to work on follow-ups.*"

Best,
Max

On Wed, Mar 23, 2016 at 8:19 PM, Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:
> Hi Max,
>
> I would keep a "stronger statement", something like:
>
> "Committers always provide a pull request. This pull request has to be
> merged cleanly, and doesn't break the build (including test, checkstyle,
> documentation). The pull request has to be reviewed and only pushed on the
> upstream when the reviewer gives the LGTM keyword as comment."
>
> All other situations where the committer doesn't/can't provide a PR should
> be approved on the dev mailing list.
>
> My $0.01
>
> Regards
> JB
>
>
> On 03/23/2016 07:22 PM, Maximilian Michels wrote:
>>
>> I didn't see this paragraph before:
>>
>> "Committers should never commit anything without going through a pull
>> request, since that would bypass test coverage and potentially cause
>> the build to fail due to checkstyle, etc. Always go through the pull
>> request, even if you won’t wait for the code review."
>>
>> How about:
>>
>> "Whenever possible, commits should be reviewed in a pull request. Pull
>> requests ensure that changes can be communicated properly with the
>> community and potential flaws or improvements can be spotted. In
>> addition, pull requests ensure proper test coverage and verification
>> of the build. Whenever possible, go through the pull request, even if
>> you won’t wait for the code review."
>>
>> - Max
>>
>>
>> On Wed, Mar 23, 2016 at 5:33 PM, Jean-Baptiste Onofré <j...@nanthrax.net>
>> wrote:
>>>
>>> +1
>>>
>>> Regards
>>> JB
>>>
>>>
>>> On 03/23/2016 05:30 PM, Davor Bonaci wrote:
>>>>
>>>>
>>>> Thanks everyone for commenting!
>>>>
>>>> There were no new comments in the last several days, so we'll start
>>>> moving
>>>> the doc over to the Beam website.
>>>>
>>>> Of course, there's nothing here set in stone -- please reopen the
>>>> discussion about any particular point at any time in the future.
>>>>
>>>> On Fri, Mar 18, 2016 at 4:44 AM, Maximilian Michels <m...@apache.org>
>>>> wrote:
>>>>
>>>>> Hi Frances,
>>>>>
>>>>> Very nice comprehensive guide. I'll leave some comments in the doc.
>>>>>
>>>>> Cheers,
>>>>> Max
>>>>>
>>>>> On Fri, Mar 18, 2016 at 11:51 AM, Sandeep Deshmukh
>>>>> <sand...@datatorrent.com> wrote:
>>>>>>
>>>>>>
>>>>>> The document captures the process very well and has right amount of
>>>>>
>>>>>
>>>>> details
>>>>>>
>>>>>>
>>>>>> for newbies too.
>>>>>>
>>>>>> Great work!!!
>>>>>>
>>>>>> Regards,
>>>>>> Sandeep
>>>>>>
>>>>>> On Fri, Mar 18, 2016 at 10:46 AM, Siva Kalagarla <
>>>>>
>>>>>
>>>>> siva.kalaga...@gmail.com>
>>>>>>
>>>>>>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks Frances,  This document is helpful for newbies like myself.
>>>>>>> Will
>>>>>>> follow these steps over this weekend.
>>>>>>>
>>>>>>> On Thu, Mar 17, 2016 at 2:19 PM, Frances Perry
>>>>>>> <f...@google.com.invalid>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Beamers!
>>>>>>>>
>>>>>>>> We've started a draft
>>>>>>>> <
>>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>>>>
https://docs.google.com/document/d/1syFyfqIsGOYDE_Hn3ZkRd8a6ylcc64Kud9YtrGHgU0E/comment
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> for the Beam contribution guide. Please take a look and provide
>>>>>
>>>>>
>>>>> feedback.
>>>>>>>>
>>>>>>>>
>>>>>>>> Once things settle, we'll get this moved over on to the Beam
>>>>>>>> website.
>>>>>>>>
>>>>>>>> Frances
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>> Siva Kalagarla
>>>>>>> @SivaKalagarla <https://twitter.com/SivaKalagarla>
>>>>>>>
>>>>>
>>>>
>>>
>>> --
>>> Jean-Baptiste Onofré
>>> jbono...@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com


Re: Draft Contribution Guide

2016-03-23 Thread Maximilian Michels
I didn't see this paragraph before:

"Committers should never commit anything without going through a pull
request, since that would bypass test coverage and potentially cause
the build to fail due to checkstyle, etc. Always go through the pull
request, even if you won’t wait for the code review."

How about:

"Whenever possible, commits should be reviewed in a pull request. Pull
requests ensure that changes can be communicated properly with the
community and potential flaws or improvements can be spotted. In
addition, pull requests ensure proper test coverage and verification
of the build. Whenever possible, go through the pull request, even if
you won’t wait for the code review."

- Max


On Wed, Mar 23, 2016 at 5:33 PM, Jean-Baptiste Onofré <j...@nanthrax.net> wrote:
> +1
>
> Regards
> JB
>
>
> On 03/23/2016 05:30 PM, Davor Bonaci wrote:
>>
>> Thanks everyone for commenting!
>>
>> There were no new comments in the last several days, so we'll start moving
>> the doc over to the Beam website.
>>
>> Of course, there's nothing here set in stone -- please reopen the
>> discussion about any particular point at any time in the future.
>>
>> On Fri, Mar 18, 2016 at 4:44 AM, Maximilian Michels <m...@apache.org>
>> wrote:
>>
>>> Hi Frances,
>>>
>>> Very nice comprehensive guide. I'll leave some comments in the doc.
>>>
>>> Cheers,
>>> Max
>>>
>>> On Fri, Mar 18, 2016 at 11:51 AM, Sandeep Deshmukh
>>> <sand...@datatorrent.com> wrote:
>>>>
>>>> The document captures the process very well and has right amount of
>>>
>>> details
>>>>
>>>> for newbies too.
>>>>
>>>> Great work!!!
>>>>
>>>> Regards,
>>>> Sandeep
>>>>
>>>> On Fri, Mar 18, 2016 at 10:46 AM, Siva Kalagarla <
>>>
>>> siva.kalaga...@gmail.com>
>>>>
>>>> wrote:
>>>>
>>>>> Thanks Frances,  This document is helpful for newbies like myself.
>>>>> Will
>>>>> follow these steps over this weekend.
>>>>>
>>>>> On Thu, Mar 17, 2016 at 2:19 PM, Frances Perry <f...@google.com.invalid>
>>>>> wrote:
>>>>>
>>>>>> Hi Beamers!
>>>>>>
>>>>>> We've started a draft
>>>>>> <
>>>>>>
>>>>>
>>>
>>> https://docs.google.com/document/d/1syFyfqIsGOYDE_Hn3ZkRd8a6ylcc64Kud9YtrGHgU0E/comment
>>>>>>>
>>>>>>>
>>>>>> for the Beam contribution guide. Please take a look and provide
>>>
>>> feedback.
>>>>>>
>>>>>> Once things settle, we'll get this moved over on to the Beam website.
>>>>>>
>>>>>> Frances
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>>
>>>>> Regards,
>>>>> Siva Kalagarla
>>>>> @SivaKalagarla <https://twitter.com/SivaKalagarla>
>>>>>
>>>
>>
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com


Re: Renaming process: first step Maven coordonates

2016-03-21 Thread Maximilian Michels
I would be in favor of one group id. For the developer, hierarchies
are really important. They are visible in the directory layout of the
Maven project and in the dependency tree. For the user, it shouldn't
matter how the project is structured. He pulls in artifacts simply
from the "org.apache.beam" group. I think it makes outside-interaction
easier when we have a fixed group id.

Cheers,
Max

On Mon, Mar 21, 2016 at 5:02 PM, Jean-Baptiste Onofré  wrote:
> Hi Lukasz,
>
> both are possible.
>
> Some projects use different groupId. It's the case for Karaf or Camel for
> instance:
>
> http://repo.maven.apache.org/maven2/org/apache/karaf/
>
> You can see there the different groupId, containing the different artifacts.
>
> On the other hand, other projects use an unique groupId and multiple
> artifactId. It's the case in Spark or Flink for instance.
>
> At first glance, I had a preference to groupId for "global" Beam kind of
> artifacts (like io, runner, etc). But, it would make sense to work more on
> the artifactId.
>
> Regards
> JB
>
>
> On 03/21/2016 04:50 PM, Lukasz Cwik wrote:
>>
>> I like the single groupId since it makes it simpler to find all related
>> components for a project.
>>
>> Is there a common practice in maven for multi-module vs inheritance
>> projects for choosing the groupId?
>>
>> On Mon, Mar 21, 2016 at 7:32 AM, Jean-Baptiste Onofré 
>> wrote:
>>
>>> Hi beamers,
>>>
>>> I updated the PR according to your comments.
>>>
>>> I have couple of points I want to discuss:
>>>
>>> 1. All modules use the same groupId (org.apache.beam). In order to have a
>>> cleaner structure on the Maven repo, I wonder if it's not better to have
>>> different groupId depending of the artifacts. For instance,
>>> org.apache.beam.sdk, containing a module with java as artifactId (it will
>>> contain new artifacts with id python, scala, ...),
>>> org.apache.beam.runners
>>> containing modules with flink and spark as artifactId, etc. Thoughts ?
>>> 2. The version has been set to 0.1.0-incubating-SNAPSHOT for all
>>> artifacts, including the runners. It doesn't mean that the runners will
>>> have to use the same version as parent (they can have their own release
>>> cycle). However, as we "bootstrap" the project, I used the same version
>>> in
>>> all modules.
>>>
>>> Now, I'm starting two new commits:
>>> - renaming of the packages
>>> - folders re-organization
>>>
>>> Thanks !
>>> Regards
>>> JB
>>>
>>>
>>> On 03/21/2016 01:56 PM, Jean-Baptiste Onofré wrote:
>>>
 Hi Davor,

 thank you so much for your comments. I'm updating the PR according to
 your PR (and will provide explanation to some changes).

 Thanks dude !

 Regards
 JB

 On 03/21/2016 06:29 AM, Davor Bonaci wrote:

> I left a few comments on PR #46.
>
> Thanks JB for doing this; a clear improvement.
>
> On Mon, Mar 14, 2016 at 6:04 PM, Jean-Baptiste Onofré 
> wrote:
>
> Hi all,
>>
>>
>> I started the renaming process from Dataflow to Beam.
>>
>> I submitted a first PR about the Maven coordinates:
>>
>> https://github.com/apache/incubator-beam/pull/46
>>
>> I will start the packages renaming (updating the same PR). For the
>> directories structure, I would like to talk with Frances, Dan, Tyler,
>> and
>> Davor first.
>>
>> Regards
>> JB
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>>
>

>>> --
>>> Jean-Baptiste Onofré
>>> jbono...@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com


Re: Unable to serialize exception running KafkaWindowedWordCountExample

2016-03-19 Thread Maximilian Michels
@Dan: You're right that the PipelineOptions shouldn't be cached like
this. In this particular wrapper, it was not even necessary.

@Jiankang: I've pushed a fix to the repository with a few
improvements. Could you please try again? You will have to recompile.

Thanks,
Max

On Thu, Mar 17, 2016 at 8:44 AM, Dan Halperin  wrote:
> +Max for the Flink Runner, and +Luke who wrote most of the initial code
> around PipelineOptions.
>
> The UnboundedFlinkSource is caching the `PipelineOptions` object, here:
> https://github.com/apache/incubator-beam/blob/071e4dd67021346b0cab2aafa0900ec7e34c4ef8/runners/flink/runner/src/main/java/org/apache/beam/runners/flink/translation/wrappers/streaming/io/UnboundedFlinkSource.java#L36
>
> I think this is a mismatch with how we intended them to be used. For
> example, the PipelineOptions may be changed by a Runner between graph
> construction time (when the UnboundedFlinkSource is created) and actual
> pipeline execution time. This is partially why, for example, PipelineOptions
> are provided by the Runner as an argument to functions like
> DoFn.startBundle, .processElement, and .finishBundle.
>
> PipelineOptions itself does not extend Serializable, and per the
> PipelineOptions documentation it looks like we intend for it to be
> serialized through Jackson rather than through Java serialization. I bet the
> Flink runner does this, and we probably just need to remove this cached
> PipelineOptions from the unbounded source.
>
> I'll let Luke and Max correct me on any or all of the above :)
>
> Thanks,
> Dan
>
> On Wed, Mar 16, 2016 at 10:57 PM, 刘见康  wrote:
>>
>> Hi guys,
>>
>> Failed to run KafkaWindowedWordCountExample with Unable to serialize
>> exception, the stack exception as below:
>>
>> 16/03/17 13:49:09 INFO flink.FlinkPipelineRunner:
>> PipelineOptions.filesToStage was not specified. Defaulting to files from
>> the classpath: will stage 160 files. Enable logging at DEBUG level to see
>> which files will be staged.
>> 16/03/17 13:49:09 INFO flink.FlinkPipelineExecutionEnvironment: Creating
>> the required Streaming Environment.
>> 16/03/17 13:49:09 INFO kafka.FlinkKafkaConsumer08: Trying to get topic
>> metadata from broker localhost:9092 in try 0/3
>> 16/03/17 13:49:09 INFO kafka.FlinkKafkaConsumerBase: Consumer is going to
>> read the following topics (with number of partitions):
>> Exception in thread "main" java.lang.IllegalArgumentException: unable to
>> serialize
>>
>> org.apache.beam.runners.flink.translation.wrappers.streaming.io.UnboundedFlinkSource@2d29b4ee
>> at
>>
>> com.google.cloud.dataflow.sdk.util.SerializableUtils.serializeToByteArray(SerializableUtils.java:54)
>> at
>>
>> com.google.cloud.dataflow.sdk.util.SerializableUtils.ensureSerializable(SerializableUtils.java:84)
>> at com.google.cloud.dataflow.sdk.io.Read$Unbounded.(Read.java:194)
>> at com.google.cloud.dataflow.sdk.io.Read$Unbounded.(Read.java:189)
>> at com.google.cloud.dataflow.sdk.io.Read.from(Read.java:69)
>> at
>>
>> org.apache.beam.runners.flink.examples.streaming.KafkaWindowedWordCountExample.main(KafkaWindowedWordCountExample.java:129)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>>
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>> at
>>
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:497)
>> at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)
>> Caused by: java.io.NotSerializableException:
>> com.google.cloud.dataflow.sdk.options.ProxyInvocationHandler
>> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184)
>> at
>>
>> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>> at
>> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>> at
>>
>> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>> at
>>
>> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>> at
>> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>> at
>>
>> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
>> at
>>
>> com.google.cloud.dataflow.sdk.util.SerializableUtils.serializeToByteArray(SerializableUtils.java:50)
>> ... 10 more
>>
>> I found there is a similar issue in flink-dataflow
>> https://github.com/dataArtisans/flink-dataflow/issues/8.
>>
>> Do you have an idea about this error?
>>
>> Thanks
>> Jiankang
>
>


Re: Unable to serialize exception running KafkaWindowedWordCountExample

2016-03-19 Thread Maximilian Michels
Hi Jiankang,

Thanks for reporting again. I'm sorry that you ran into another
problem. This example had been working but it has some small problems
with the new code base we just migrated to.

I've fixed and tested the example and would invite you to try again.

Thanks,
Max

On Thu, Mar 17, 2016 at 1:25 PM, 刘见康 <jkang...@gmail.com> wrote:
> @Max:
> Thanks for your quick fix, this serializable exception has been solved.
> However, it reported another one:
> 16/03/17 20:14:23 INFO flink.FlinkPipelineRunner:
> PipelineOptions.filesToStage was not specified. Defaulting to files from
> the classpath: will stage 158 files. Enable logging at DEBUG level to see
> which files will be staged.
> 16/03/17 20:14:23 INFO flink.FlinkPipelineExecutionEnvironment: Creating
> the required Streaming Environment.
> 16/03/17 20:14:23 INFO kafka.FlinkKafkaConsumer08: Trying to get topic
> metadata from broker localhost:9092 in try 0/3
> 16/03/17 20:14:23 INFO kafka.FlinkKafkaConsumerBase: Consumer is going to
> read the following topics (with number of partitions):
> Exception in thread "main" java.lang.RuntimeException: Flink Sources are
> supported only when running with the FlinkPipelineRunner.
> at
> org.apache.beam.runners.flink.translation.wrappers.streaming.io.UnboundedFlinkSource.getDefaultOutputCoder(UnboundedFlinkSource.java:71)
> at
> com.google.cloud.dataflow.sdk.io.Read$Unbounded.getDefaultOutputCoder(Read.java:230)
> at
> com.google.cloud.dataflow.sdk.transforms.PTransform.getDefaultOutputCoder(PTransform.java:294)
> at
> com.google.cloud.dataflow.sdk.transforms.PTransform.getDefaultOutputCoder(PTransform.java:309)
> at
> com.google.cloud.dataflow.sdk.values.TypedPValue.inferCoderOrFail(TypedPValue.java:167)
> at
> com.google.cloud.dataflow.sdk.values.TypedPValue.getCoder(TypedPValue.java:48)
> at
> com.google.cloud.dataflow.sdk.values.PCollection.getCoder(PCollection.java:137)
> at
> com.google.cloud.dataflow.sdk.values.TypedPValue.finishSpecifying(TypedPValue.java:88)
> at com.google.cloud.dataflow.sdk.Pipeline.applyInternal(Pipeline.java:331)
> at com.google.cloud.dataflow.sdk.Pipeline.applyTransform(Pipeline.java:274)
> at
> com.google.cloud.dataflow.sdk.values.PCollection.apply(PCollection.java:161)
> at
> org.apache.beam.runners.flink.examples.streaming.KafkaWindowedWordCountExample.main(KafkaWindowedWordCountExample.java:127)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)
>
> Dive into the UnboundedFlinkSource class, it just like a simple class imply
> the UnboundedSource interface with throw RuntimeException.
> I just wonder if this Kafka Streaming example is runnable?
>
> Thanks
> Jiankang
>
>
> On Thu, Mar 17, 2016 at 7:35 PM, Maximilian Michels <m...@apache.org> wrote:
>
>> @Dan: You're right that the PipelineOptions shouldn't be cached like
>> this. In this particular wrapper, it was not even necessary.
>>
>> @Jiankang: I've pushed a fix to the repository with a few
>> improvements. Could you please try again? You will have to recompile.
>>
>> Thanks,
>> Max
>>
>> On Thu, Mar 17, 2016 at 8:44 AM, Dan Halperin <dhalp...@google.com> wrote:
>> > +Max for the Flink Runner, and +Luke who wrote most of the initial code
>> > around PipelineOptions.
>> >
>> > The UnboundedFlinkSource is caching the `PipelineOptions` object, here:
>> >
>> https://github.com/apache/incubator-beam/blob/071e4dd67021346b0cab2aafa0900ec7e34c4ef8/runners/flink/runner/src/main/java/org/apache/beam/runners/flink/translation/wrappers/streaming/io/UnboundedFlinkSource.java#L36
>> >
>> > I think this is a mismatch with how we intended them to be used. For
>> > example, the PipelineOptions may be changed by a Runner between graph
>> > construction time (when the UnboundedFlinkSource is created) and actual
>> > pipeline execution time. This is partially why, for example,
>> PipelineOptions
>> > are provided by the Runner as an argument to functions like
>> > DoFn.startBundle, .processElement, and .finishBundle.
>> >
>> > PipelineOptions itself does not extend Serializable, and per the
>> > PipelineOptions documentation it looks like we intend for it to be
>> > serialized through Jackson rather than through Java serialization. I bet
>> the
>> > Flink runner does this, and we probably just need to re

Re: Capability Matrix

2016-03-19 Thread Maximilian Michels
Well done. The matrix provides a good basis for improving the existing
runners. Moreover, new backends can use it to evaluate capabilities
for creating a runner.

On Fri, Mar 18, 2016 at 1:15 AM, Jean-Baptiste Onofré  wrote:
> Catcha, thanks !
>
> Regards
> JB
>
>
> On 03/18/2016 12:51 AM, Frances Perry wrote:
>>
>> That's "partially". Check out the full matrix for complete details:
>> http://beam.incubator.apache.org/capability-matrix/
>>
>> On Thu, Mar 17, 2016 at 4:50 PM, Jean-Baptiste Onofré 
>> wrote:
>>
>>> Great job !
>>>
>>> By the way, when you use ~ in the matrix, does it mean that it works only
>>> in some cases (depending of the pipeline or transform) or it doesn't work
>>> as expected ? Just curious for the Aggregators and the meaning in the
>>> Beam
>>> Model.
>>>
>>> Thanks,
>>> Regards
>>> JB
>>>
>>>
>>> On 03/18/2016 12:45 AM, Tyler Akidau wrote:
>>>
 Just pushed the capability matrix and an attendant blog post to the
 site:

  - Blog post:


 http://beam.incubator.apache.org/beam/capability/2016/03/17/capability-matrix.html
  - Matrix: http://beam.incubator.apache.org/capability-matrix/

 For those of you that want to keep the matrix up to date as your runner
 evolves, you'll want to make updates in the _data/capability-matrix.yml
 file:


 https://github.com/apache/incubator-beam-site/blob/asf-site/_data/capability-matrix.yml

 Thanks to everyone for helping fill out the initial set of capabilities!
 Looking forward to updates as things progress. :-)

 And thanks also to Max for moving all the website stuff to git!

 -Tyler


 On Sat, Mar 12, 2016 at 9:37 AM Tyler Akidau  wrote:

 Thanks all! At this point, it looks like most all of the fields have
 been
>
> filled out. I'm in the process of migrating the spreadsheet contents to
> YAML within the website source, so I've revoked edit access from the
> doc
> to
> keep things from changing while I'm doing that. If you have further
> edits
> to make, feel free to leave a comment, and I'll incorporate it into the
> YAML.
>
> -Tyler
>
>
> On Thu, Mar 10, 2016 at 12:43 AM Jean-Baptiste Onofré 
> wrote:
>
> Hi Tyler,
>>
>>
>> good idea !
>>
>> I like it !
>>
>> Regards
>> JB
>>
>> On 03/09/2016 11:14 PM, Tyler Akidau wrote:
>>
>>> I just filed BEAM-104
>>> 
>>> regarding publishing a capability matrix on the Beam website. We've
>>>
>> seeded
>>
>>> the spreadsheet linked there (
>>>
>>>
>>
>> https://docs.google.com/spreadsheets/d/1OM077lZBARrtUi6g0X0O0PHaIbFKCD6v0djRefQRE1I/edit
>>
>>> )
>>> with an initial proposed set of capabilities, as well as descriptions
>>>
>> for
>>
>>> the model and Cloud Dataflow. If folks for other runners (currently
>>>
>> Flink
>>
>>> and Spark) could please make sure their columns are filled out as
>>> well,
>>> it'd be much appreciated. Also let us know if there are capabilities
>>> you
>>> think we've missed.
>>>
>>> Our hope is to get this up and published soon, since we've been
>>> getting
>>>
>> a
>>
>>> lot of questions regarding runner capabilities, portability, etc.
>>>
>>> -Tyler
>>>
>>>
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>>
>

>>> --
>>> Jean-Baptiste Onofré
>>> jbono...@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com


Re: Sorry for un-fixed-up PR merge

2016-03-11 Thread Maximilian Michels
Hi Kenneth,

Thanks for the notice. It happens, we're all human :)

Cheers,
Max

On Fri, Mar 11, 2016 at 12:13 AM, Kenneth Knowles
 wrote:
> I want to apologize for leaving fixup commits in a PR merge I just
> performed. I'm leaving as-is rather than mess about with `git push -f` to
> rewrite a prettier history. Just don't want anyone to think that I would
> normally go about like that.
>
> Kenn


Re: Travis for pull requests

2016-03-10 Thread Maximilian Michels
Well done :)

About the Flink tests in Jenkins: I wonder why they don't execute.
Just had a look at the Jenkins job. They seem to run fine:
https://builds.apache.org/job/beam_MavenVerify/35/org.apache.beam$flink-runner/console

On Thu, Mar 10, 2016 at 7:40 AM, Jean-Baptiste Onofré <j...@nanthrax.net> wrote:
> Awesome ! Thanks Davor.
>
> Regards
> JB
>
>
> On 03/10/2016 01:10 AM, Davor Bonaci wrote:
>>
>> I'm happy to announce that we now have both Travis and Jenkins set up in
>> Beam.
>>
>> Both systems are building our master branch. The most recent status is
>> incorporated into the top-level README.md file. Clicking the badge will
>> take you to the specific build results. Additionally, we have automatic
>> coverage for each pull request, with results integrated into the GitHub
>> pull request UI.
>>
>> Exciting!
>>
>> Low-level details:
>> The systems aren't exactly equal. Travis will run on any branch, while
>> Jenkins will run on master only. Travis will run multi-OS, multi-JDK
>> version, while Jenkins does just one combination. Notifications to Travis
>> are pushed, Jenkins periodically polls for changes. Flink tests may not be
>> running in Jenkins right now -- we need to investigate why.
>>
>> On Wed, Mar 9, 2016 at 8:57 AM, Davor Bonaci <da...@google.com> wrote:
>>
>>> Sounds like we are all in agreement. Great!
>>>
>>> On Wed, Mar 9, 2016 at 8:49 AM, Jean-Baptiste Onofré <j...@nanthrax.net>
>>> wrote:
>>>
>>>> I agree, and it's what I mean (assuming the signing is OK).
>>>>
>>>> Basically, a release requires the following action:
>>>>
>>>> - mvn release:prepare && mvn release:perform (with pgp signing, etc): it
>>>> can be done by Jenkins, BUT it requires some credentials in
>>>> .m2/settings.xml (for signing and upload on nexus), etc. In lot of
>>>> Apache
>>>> projects, you have some guys dedicated for the releases, and a release
>>>> is
>>>> simply an unique command line to execute (or a procedure to follow)
>>>> - check the release content (human)
>>>> - close the staging repository on nexus (human)
>>>> - send the vote e-mail (human)
>>>> - once the vote passed:
>>>> -- promote the staging repo (human)
>>>> -- update Jira (human)
>>>> -- publish artifacts on dist.apache.org (human)
>>>> -- update reporter.apache.org (human)
>>>> -- send announcement e-mail on the mailing lists (human)
>>>>
>>>> Regards
>>>> JB
>>>>
>>>>
>>>> On 03/09/2016 05:38 PM, Davor Bonaci wrote:
>>>>
>>>>> I think a release manager (a person) should be driving it, but his/her
>>>>> actions can still be automated through Jenkins. For example, a Jenkins
>>>>> job
>>>>> that release manager manually triggers is often better than a set of
>>>>> manual
>>>>> command-line actions. Reasons: less error prone, repeatable, log of
>>>>> actions
>>>>> is kept and is visible to everyone, etc.
>>>>>
>>>>> On Wed, Mar 9, 2016 at 1:25 AM, Jean-Baptiste Onofré <j...@nanthrax.net>
>>>>> wrote:
>>>>>
>>>>> Hi Max,
>>>>>>
>>>>>>
>>>>>> I agree to use Jenkins for snapshots, but I don't think it's a good
>>>>>> idea
>>>>>> for release (it's better that a release manager does it IMHO).
>>>>>>
>>>>>> Regards
>>>>>> JB
>>>>>>
>>>>>>
>>>>>> On 03/09/2016 10:12 AM, Maximilian Michels wrote:
>>>>>>
>>>>>> I'm in favor of Travis too. We use it very extensively at Flink. It is
>>>>>>>
>>>>>>> true that Jenkins can provide a much more sophisticated workflow.
>>>>>>> However, its UI is outdated and it is not as nicely integrated with
>>>>>>> GitHub. For outside contributions, IMHO Travis is the best CI system.
>>>>>>>
>>>>>>> We might actually use Jenkins for releases or snapshot deployment.
>>>>>>> Jenkins is very flexible and nicely integrated with the ASF
>>>>>>> infrastructure which makes some things like providing credentials a
>>>>>>> piece of cake.
>>>>>>>
>>>>&

Re: New beam website!

2016-03-07 Thread Maximilian Michels
Hi JB,

I've pushed the web site to the empty repository and I'll tell Infra
to switch to the new repository.

Cheers,
Max


On Fri, Mar 4, 2016 at 5:00 PM, Jean-Baptiste Onofré <j...@nanthrax.net> wrote:
> Hi Max,
>
> fair enough.
>
> Regards
> JB
>
>
> On 03/04/2016 03:52 PM, Maximilian Michels wrote:
>>
>> Hi JB,
>>
>> Good question. I didn't create the page but one problem about relative
>> URLs is changing the root directory or switching to https. So instead
>> of relative URLs which should use {{ site.baseurl }}/resource for all
>> links. That way, we can simply change baseurl in the _config.yml and
>> we're good to go. For local testing, we set jekyll --baseurl "". That
>> approach has worked well for the Flink website which is also built
>> with Jekyll.
>>
>> Cheers,
>> Max
>>
>> On Fri, Mar 4, 2016 at 1:48 PM, Jean-Baptiste Onofré <j...@nanthrax.net>
>> wrote:
>>>
>>> Hi Max,
>>>
>>> I just cloned your repo and built using jekyll.
>>>
>>> I just wonder why not using relative URL (for images and js location)
>>> instead of absolute ? It would allow us to directly open the website in a
>>> browser. WDYT ?
>>>
>>> Regards
>>> JB
>>>
>>> On 03/01/2016 01:59 PM, Maximilian Michels wrote:
>>>>
>>>>
>>>> As a summary from
>>>> https://issues.apache.org/jira/servicedesk/customer/portal/1/INFRA-11318
>>>> it would work as follows:
>>>>
>>>> We use the 'asf-site' branch of the repository. When we change the
>>>> website, we execute "jekyll build" followed by "jekyll serve" to
>>>> preview the site. Everything is generated in the 'content' directory.
>>>> We then push the changes and they are deployed.
>>>>
>>>> I've prepared everything in my fork:
>>>> https://github.com/mxm/beam-site/tree/asf-site
>>>>
>>>> Unfortunately, I couldn't push the changes to the new repository. The
>>>> permissions don't seem to be set up correctly.
>>>>
>>>> On Tue, Mar 1, 2016 at 10:24 AM, Maximilian Michels <m...@apache.org>
>>>> wrote:
>>>>>
>>>>>
>>>>> Quick update. The Git repository is ready under
>>>>> https://git-wip-us.apache.org/repos/asf/incubator-beam-site.git
>>>>>
>>>>> I'm sorting out the last things and will push the website thereafter.
>>>>> Infra can then proceed to do the pubsub switch to deploy the website
>>>>> from there.
>>>>>
>>>>> On Thu, Feb 25, 2016 at 11:31 PM, Maximilian Michels <m...@apache.org>
>>>>> wrote:
>>>>>>
>>>>>>
>>>>>> Hi JB,
>>>>>>
>>>>>> Greetings to Mexico! I was using Infra's "SVN to Git migration"
>>>>>> service desk. That seems like a standard way for migration to me. It
>>>>>> has also worked fine in the past.
>>>>>>
>>>>>> Could you explain the role of the SCM-Publish Maven plugin? What would
>>>>>> be different from just committing the changes of the website to the
>>>>>> Git/SVN repository? Is it necessary that people use another wrapper
>>>>>> around a version control system?
>>>>>>
>>>>>> After all, what counts is that we can use a Git repository to check in
>>>>>> website changes and use the GitHub mirror. That is not only much
>>>>>> faster (pushing in SVN takes a long time and there is no local
>>>>>> repository) but also more convenient for most people that use Git on a
>>>>>> daily basis. How the actual website is served, shouldn't matter to the
>>>>>> developer.
>>>>>>
>>>>>> Cheers,
>>>>>> Max
>>>>>>
>>>>>> On Thu, Feb 25, 2016 at 11:23 PM, Jean-Baptiste Onofré
>>>>>> <j...@nanthrax.net>
>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I agree and it's what we have: the website sources are on my github
>>>>>>> and
>>>>>>> we will move it on apache git. Anyway the source is not where you
>>>>>>> promote
>>>>>>> with the scm-pub

Re: New beam website!

2016-02-25 Thread Maximilian Michels
Alright, I've asked the Infra team to migrate the repository to Git
and setup a GitHub sync.

Git migration: 
https://issues.apache.org/jira/servicedesk/agent/INFRA/issue/INFRA-11318
GitHub sync: 
https://issues.apache.org/jira/servicedesk/agent/INFRA/issue/INFRA-11319

Cheers,
Max

On Wed, Feb 24, 2016 at 9:00 PM, James Malone
<jamesmal...@google.com.invalid> wrote:
> That would be awesome. Full disclosure - I 
> Happy to help however I can and any experience anyone has for making that
> happen is welcomed. :)
>
> On Wed, Feb 24, 2016 at 9:05 AM, Maximilian Michels <m...@apache.org> wrote:
>
>> Hi James,
>>
>> The updated website looks good to me. I agree that it will simplify
>> work for a lot of people who are used to Jekyll.
>>
>> The website is already in the SVN repository. After we have sorted out
>> the CCLAs, the only thing that is left is a GitHub sync to make it
>> more accessible for contributors. The prerequisite for that is that we
>> change it to Git. Apache Infra can do that (we have done likewise for
>> the Flink website repository and it works great).
>>
>> Cheers,
>> Max
>>
>> On Wed, Feb 24, 2016 at 7:04 AM, James Malone
>> <jamesmal...@google.com.invalid> wrote:
>> > Hello everyone,
>> >
>> > Since we're in the process of getting bootstrapped, I wanted to improve
>> on
>> > our website a bit, especially since we have a new logo.
>> >
>> > The first site was built using Maven sites. To be honest, it presented a
>> > bit of a problem because many of the Beam committers have used GitHub
>> docs
>> > to date and wanted a system which was somewhat similar. Since GH docs is
>> > built on Jekyll (as far as I know) I figured that might actually be an
>> > easier way to build the website anyway.
>> >
>> > So, I just updated the incubator website. with an exact copy of the old
>> > site but using Jekyll + Bootstrap 3. That should hopefully make it easier
>> > for anyone and everyone to work against (and to move existing docs.)
>> >
>> > The repository for the current site is here:
>> >
>> > https://github.com/evilsoapbox/beam-site
>> >
>> > I also have a tgz of the old site in case there's any concern or
>> > disagreement. Needless to say, when we sort out the CCLAs (which should
>> be
>> > very soon) I'd like to get this in a project repo somewhere so we can
>> have
>> > better version control and review.
>> >
>> > Finally, the theme is basic bootstrap with a few changes (like using
>> Roboto
>> > as the font to match the logo.) I figure if/when there's interest in
>> > detailed design, we can cover that as a seperate discussion.
>> >
>> > Cheers!
>> >
>> > James
>>


Re: Flink Runner - Current State & Roadmap

2016-02-15 Thread Maximilian Michels
Just saw there is already a JIRA for including the Flink Runner code:
https://issues.apache.org/jira/browse/BEAM-5

On Mon, Feb 15, 2016 at 11:42 AM, Maximilian Michels <m...@apache.org> wrote:
> Hi,
>
> Thanks you all for the positive feedback!
>
> @Mark: Yes, the current GitHub version relies on 1.0.0 of the
> DataflowSDK. Naturally, some things have changed since that release
> but we figured that we freeze to this release until more users
> requested a newer version. Thanks a lot for the offer. I think it
> would be great if you tried to adapt the Flink Runner while doing the
> refactoring of the final to-be-contributed Beam code. As long as the
> code is not out yet, this is totally fine. Afterwards, we'll continue
> the development to the Beam community.
>
> We deliberately chose the low-level approach of translating the API
> because we think that semantics and correctness are the top priority.
> An optimization in terms of a more direct translation (and possibly
> improved performance), would be the next step. The question is, if
> Beam Runners have to implement more own functionality (like windows or
> triggers) or if we leave this as an optional optimization that the
> Runner chooses to do? That'll be one of the things we will have to
> figure out.
>
> Best,
> Max
>
>
>
> On Sat, Feb 13, 2016 at 5:28 PM, bakey pan <bakey1...@gmail.com> wrote:
>> Hi,Max:
>> I am reading the source code of Beam and Flink.
>> I am also interested in contributing code to the Flink runner.May be we
>> can talk more about which features is more suitable for me.
>>
>> 2016-02-13 15:17 GMT+08:00 Jean-Baptiste Onofré <j...@nanthrax.net>:
>>
>>> Hi Max,
>>>
>>> it sounds good to me !
>>>
>>> Thanks,
>>> Regards
>>> JB
>>>
>>> On 02/12/2016 08:06 PM, Maximilian Michels wrote:
>>>
>>>> Hi Beamers,
>>>>
>>>> Now that things are getting started and we discuss the technical
>>>> vision of Beam, we would like to contribute the Flink runner and start
>>>> by sharing some details about the status and the roadmap.
>>>>
>>>> The Flink Runner integrates deeply and naturally with the Dataflow SDK
>>>> (the Beam precursor), because the Flink DataStream API shares many
>>>> concepts with the Dataflow model.
>>>> Based on whether the program input is bounded or unbounded, the
>>>> program goes against Flink's DataStream or DataSet API.
>>>>
>>>> A quick preview at some of the nice features of the runner:
>>>>
>>>>- Support for stream transformations, event time, watermarks
>>>>- The full Dataflow windowing semantics, including fixed/sliding
>>>> time windows, and session windows
>>>>- Integration with Flink's streaming sources (Kafka, RabbitMQ, ...)
>>>>
>>>>- Batch (bounded sources) integrates fully with Flink's managed
>>>> memory techniques and out-of-core algorithms, supporting huge joins
>>>> and aggregations.
>>>>- Integration with Flink's batch API sources (plain text, CSV, Avro,
>>>> JDBC, HBase, ...)
>>>>
>>>>- Integration with Flink's fault tolerance - both batch and
>>>> streaming program recover from failures
>>>>- After upgrading the dependency to Flink 1.0, one could even use
>>>> the Flink Savepoints feature (save streaming state for later resuming)
>>>> with the Dataflow programs.
>>>>
>>>> Attached you can find the document we drew up with more information
>>>> about the current state of the Runner and the roadmap for its upcoming
>>>> features:
>>>>
>>>>
>>>> https://docs.google.com/document/d/1QM_X70VvxWksAQ5C114MoAKb1d9Vzl2dLxEZM4WYogo/edit?usp=sharing
>>>>
>>>> The Runner executes the quasi complete Beam streaming model (well,
>>>> Dataflow, actually, because Beam is not there, yet).
>>>>
>>>> Given the current excitement and buzz around Beam, we could add this
>>>> runner to the Beam repository and link it as a "preview" for the
>>>> people that want to get a feeling of what it will be like to write and
>>>> run streaming (unbounded) Beam programs. That would give people
>>>> something tangible until the actual Beam code is available.
>>>>
>>>> What do you think?
>>>>
>>>> Best,
>>>> Max
>>>>
>>>>
>>> --
>>> Jean-Baptiste Onofré
>>> jbono...@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>
>>
>>
>> --
>>  Best Regards
>>BakeyPan


Re: Flink Runner - Current State & Roadmap

2016-02-15 Thread Maximilian Michels
Hi,

Thanks you all for the positive feedback!

@Mark: Yes, the current GitHub version relies on 1.0.0 of the
DataflowSDK. Naturally, some things have changed since that release
but we figured that we freeze to this release until more users
requested a newer version. Thanks a lot for the offer. I think it
would be great if you tried to adapt the Flink Runner while doing the
refactoring of the final to-be-contributed Beam code. As long as the
code is not out yet, this is totally fine. Afterwards, we'll continue
the development to the Beam community.

We deliberately chose the low-level approach of translating the API
because we think that semantics and correctness are the top priority.
An optimization in terms of a more direct translation (and possibly
improved performance), would be the next step. The question is, if
Beam Runners have to implement more own functionality (like windows or
triggers) or if we leave this as an optional optimization that the
Runner chooses to do? That'll be one of the things we will have to
figure out.

Best,
Max



On Sat, Feb 13, 2016 at 5:28 PM, bakey pan <bakey1...@gmail.com> wrote:
> Hi,Max:
> I am reading the source code of Beam and Flink.
> I am also interested in contributing code to the Flink runner.May be we
> can talk more about which features is more suitable for me.
>
> 2016-02-13 15:17 GMT+08:00 Jean-Baptiste Onofré <j...@nanthrax.net>:
>
>> Hi Max,
>>
>> it sounds good to me !
>>
>> Thanks,
>> Regards
>> JB
>>
>> On 02/12/2016 08:06 PM, Maximilian Michels wrote:
>>
>>> Hi Beamers,
>>>
>>> Now that things are getting started and we discuss the technical
>>> vision of Beam, we would like to contribute the Flink runner and start
>>> by sharing some details about the status and the roadmap.
>>>
>>> The Flink Runner integrates deeply and naturally with the Dataflow SDK
>>> (the Beam precursor), because the Flink DataStream API shares many
>>> concepts with the Dataflow model.
>>> Based on whether the program input is bounded or unbounded, the
>>> program goes against Flink's DataStream or DataSet API.
>>>
>>> A quick preview at some of the nice features of the runner:
>>>
>>>- Support for stream transformations, event time, watermarks
>>>- The full Dataflow windowing semantics, including fixed/sliding
>>> time windows, and session windows
>>>- Integration with Flink's streaming sources (Kafka, RabbitMQ, ...)
>>>
>>>- Batch (bounded sources) integrates fully with Flink's managed
>>> memory techniques and out-of-core algorithms, supporting huge joins
>>> and aggregations.
>>>- Integration with Flink's batch API sources (plain text, CSV, Avro,
>>> JDBC, HBase, ...)
>>>
>>>- Integration with Flink's fault tolerance - both batch and
>>> streaming program recover from failures
>>>- After upgrading the dependency to Flink 1.0, one could even use
>>> the Flink Savepoints feature (save streaming state for later resuming)
>>> with the Dataflow programs.
>>>
>>> Attached you can find the document we drew up with more information
>>> about the current state of the Runner and the roadmap for its upcoming
>>> features:
>>>
>>>
>>> https://docs.google.com/document/d/1QM_X70VvxWksAQ5C114MoAKb1d9Vzl2dLxEZM4WYogo/edit?usp=sharing
>>>
>>> The Runner executes the quasi complete Beam streaming model (well,
>>> Dataflow, actually, because Beam is not there, yet).
>>>
>>> Given the current excitement and buzz around Beam, we could add this
>>> runner to the Beam repository and link it as a "preview" for the
>>> people that want to get a feeling of what it will be like to write and
>>> run streaming (unbounded) Beam programs. That would give people
>>> something tangible until the actual Beam code is available.
>>>
>>> What do you think?
>>>
>>> Best,
>>> Max
>>>
>>>
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>
>
>
> --
>  Best Regards
>BakeyPan


Re: Apache Beam blog

2016-02-12 Thread Maximilian Michels
+1 Looks nice.

I'm sure we'll iterate over the design :)

On Fri, Feb 12, 2016 at 6:58 PM, Tyler Akidau  wrote:
> +1
>
> -Tyler
>
> On Fri, Feb 12, 2016 at 9:57 AM Amit Sela  wrote:
>
>> +1
>>
>> I think we could also publish user's use-case examples and stories. "How we
>> are using Beam" or something like that.
>>
>> On Fri, Feb 12, 2016, 19:49 James Malone 
>> wrote:
>>
>> > Hello everyone,
>> >
>> > Now that we have a skeleton website up (horray!) I wanted to raise the
>> idea
>> > of a "Beam Blog." I am thinking this blog is where we can show news, cool
>> > new Beam-things, examples for Apache Beam, and so on. This blog would
>> live
>> > under the project website (http://beam.incubator.apache.org).
>> >
>> > To that end, I would like to poll how the larger community feels about
>> > this. Additionally, if people think it's a good idea, I'd like to start
>> > thinking about content which we would want to feature on the blog.
>> >
>> > I happily volunteer to coordinate, review, and assist with the blog.
>> >
>> > Cheers!
>> >
>> > James
>> >
>>


Flink Runner - Current State & Roadmap

2016-02-12 Thread Maximilian Michels
Hi Beamers,

Now that things are getting started and we discuss the technical
vision of Beam, we would like to contribute the Flink runner and start
by sharing some details about the status and the roadmap.

The Flink Runner integrates deeply and naturally with the Dataflow SDK
(the Beam precursor), because the Flink DataStream API shares many
concepts with the Dataflow model.
Based on whether the program input is bounded or unbounded, the
program goes against Flink's DataStream or DataSet API.

A quick preview at some of the nice features of the runner:

  - Support for stream transformations, event time, watermarks
  - The full Dataflow windowing semantics, including fixed/sliding
time windows, and session windows
  - Integration with Flink's streaming sources (Kafka, RabbitMQ, ...)

  - Batch (bounded sources) integrates fully with Flink's managed
memory techniques and out-of-core algorithms, supporting huge joins
and aggregations.
  - Integration with Flink's batch API sources (plain text, CSV, Avro,
JDBC, HBase, ...)

  - Integration with Flink's fault tolerance - both batch and
streaming program recover from failures
  - After upgrading the dependency to Flink 1.0, one could even use
the Flink Savepoints feature (save streaming state for later resuming)
with the Dataflow programs.

Attached you can find the document we drew up with more information
about the current state of the Runner and the roadmap for its upcoming
features:

https://docs.google.com/document/d/1QM_X70VvxWksAQ5C114MoAKb1d9Vzl2dLxEZM4WYogo/edit?usp=sharing

The Runner executes the quasi complete Beam streaming model (well,
Dataflow, actually, because Beam is not there, yet).

Given the current excitement and buzz around Beam, we could add this
runner to the Beam repository and link it as a "preview" for the
people that want to get a feeling of what it will be like to write and
run streaming (unbounded) Beam programs. That would give people
something tangible until the actual Beam code is available.

What do you think?

Best,
Max


Re: PPMC

2016-02-05 Thread Maximilian Michels
Hello Tyler,

Thanks for summing it up for the newly registered. Thanks to everyone
at Google for being so straightforward on this matter.

Best,
Max

On Thu, Feb 4, 2016 at 11:18 PM, Tyler Akidau  wrote:
> Hello Beamers!
>
>
> To summarize a discussion that started while infrastructure was being set
> up:
>
>- Google folks proposed a PPMC composed of a subset of the initial
>committers.
>- Multiple folks pointed out that the rules say the PPMC is to be
>initial committers + mentors.
>- Multiple folks also noted they would be fine with the more limited
>PPMC.
>- Google folks agreed a PPMC of initial committers + mentors is what the
>rules state, so we should go with that.
>
> Please feel free to jump in and correct anything I've misrepresented in my
> summary, or if you feel there's anything further to discuss. Given that
> we're simply going with what is stated in the rules, seems there is no need
> for a vote of any kind.
>
> -Tyler