Re: [VOTE] Go SDK

2018-05-21 Thread Jean-Baptiste Onofré

Hi Henning,

SGA has been filed for the entire project during the incubation period.

Here, we have to check if SGA/IP donation is clean for the Go SDK.

We don't have a lot to do, just checked that we are clean on this front.

Regards
JB

On 22/05/2018 06:42, Henning Rohde wrote:

Thanks everyone!

Davor -- regarding your two comments:
   * Robert mentioned that "SGA should have probably already been filed" 
in the previous thread. I got the impression that nothing further was 
needed. I'll follow up.
   * The standard Go tooling basically always pulls directly from 
github, so there is no real urgency here.


Thanks,
  Henning


On Mon, May 21, 2018 at 9:30 PM Jean-Baptiste Onofré > wrote:


+1 (binding)

I just want to check about SGA/IP/Headers.

Thanks !
Regards
JB

On 22/05/2018 03:02, Henning Rohde wrote:
 > Hi everyone,
 >
 > Now that the remaining issues have been resolved as discussed,
I'd like
 > to propose a formal vote on accepting the Go SDK into master. The
main
 > practical difference is that the Go SDK would be part of the
Apache Beam
 > release going forward.
 >
 > Highlights of the Go SDK:
 >   * Go user experience with natively-typed DoFns with (simulated)
 > generic types
 >   * Covers most of the Beam model: ParDo, GBK, CoGBK, Flatten,
Combine,
 > Windowing, ..
 >   * Includes several IO connectors: Datastore, BigQuery, PubSub,
 > extensible textio.
 >   * Supports the portability framework for both batch and streaming,
 > notably the upcoming portable Flink runner
 >   * Supports a direct runner for small batch workloads and testing.
 >   * Includes pre-commit tests and post-commit integration tests.
 >
 > And last but not least
 >   *  includes contributions from several independent users and
 > developers, notably an IO connector for Datastore!
 >
 > Website: https://beam.apache.org/documentation/sdks/go/
 > Code: https://github.com/apache/beam/tree/master/sdks/go
 > Design: https://s.apache.org/beam-go-sdk-design-rfc
 >
 > Please vote:
 > [ ] +1, Approve that the Go SDK becomes an official part of Beam
 > [ ] -1, Do not approve (please provide specific comments)
 >
 > Thanks,
 >   The Gophers of Apache Beam
 >
 >



Re: [VOTE] Go SDK

2018-05-21 Thread Reuven Lax
+1 (binding)

While this is the third language for Beam, I believe it's the first new one
since Beam began. Including this in Beam will be a nice milestone for the
project.

Reuven

On Mon, May 21, 2018 at 9:43 PM Henning Rohde  wrote:

> Thanks everyone!
>
> Davor -- regarding your two comments:
>   * Robert mentioned that "SGA should have probably already been filed"
> in the previous thread. I got the impression that nothing further was
> needed. I'll follow up.
>   * The standard Go tooling basically always pulls directly from github,
> so there is no real urgency here.
>
> Thanks,
>  Henning
>
>
> On Mon, May 21, 2018 at 9:30 PM Jean-Baptiste Onofré 
> wrote:
>
>> +1 (binding)
>>
>> I just want to check about SGA/IP/Headers.
>>
>> Thanks !
>> Regards
>> JB
>>
>> On 22/05/2018 03:02, Henning Rohde wrote:
>> > Hi everyone,
>> >
>> > Now that the remaining issues have been resolved as discussed, I'd like
>> > to propose a formal vote on accepting the Go SDK into master. The main
>> > practical difference is that the Go SDK would be part of the Apache
>> Beam
>> > release going forward.
>> >
>> > Highlights of the Go SDK:
>> >   * Go user experience with natively-typed DoFns with (simulated)
>> > generic types
>> >   * Covers most of the Beam model: ParDo, GBK, CoGBK, Flatten, Combine,
>> > Windowing, ..
>> >   * Includes several IO connectors: Datastore, BigQuery, PubSub,
>> > extensible textio.
>> >   * Supports the portability framework for both batch and streaming,
>> > notably the upcoming portable Flink runner
>> >   * Supports a direct runner for small batch workloads and testing.
>> >   * Includes pre-commit tests and post-commit integration tests.
>> >
>> > And last but not least
>> >   *  includes contributions from several independent users and
>> > developers, notably an IO connector for Datastore!
>> >
>> > Website: https://beam.apache.org/documentation/sdks/go/
>> > Code: https://github.com/apache/beam/tree/master/sdks/go
>> > Design: https://s.apache.org/beam-go-sdk-design-rfc
>> >
>> > Please vote:
>> > [ ] +1, Approve that the Go SDK becomes an official part of Beam
>> > [ ] -1, Do not approve (please provide specific comments)
>> >
>> > Thanks,
>> >   The Gophers of Apache Beam
>> >
>> >
>>
>


Re: [VOTE] Go SDK

2018-05-21 Thread Henning Rohde
Thanks everyone!

Davor -- regarding your two comments:
  * Robert mentioned that "SGA should have probably already been filed" in
the previous thread. I got the impression that nothing further was needed.
I'll follow up.
  * The standard Go tooling basically always pulls directly from github, so
there is no real urgency here.

Thanks,
 Henning


On Mon, May 21, 2018 at 9:30 PM Jean-Baptiste Onofré 
wrote:

> +1 (binding)
>
> I just want to check about SGA/IP/Headers.
>
> Thanks !
> Regards
> JB
>
> On 22/05/2018 03:02, Henning Rohde wrote:
> > Hi everyone,
> >
> > Now that the remaining issues have been resolved as discussed, I'd like
> > to propose a formal vote on accepting the Go SDK into master. The main
> > practical difference is that the Go SDK would be part of the Apache Beam
> > release going forward.
> >
> > Highlights of the Go SDK:
> >   * Go user experience with natively-typed DoFns with (simulated)
> > generic types
> >   * Covers most of the Beam model: ParDo, GBK, CoGBK, Flatten, Combine,
> > Windowing, ..
> >   * Includes several IO connectors: Datastore, BigQuery, PubSub,
> > extensible textio.
> >   * Supports the portability framework for both batch and streaming,
> > notably the upcoming portable Flink runner
> >   * Supports a direct runner for small batch workloads and testing.
> >   * Includes pre-commit tests and post-commit integration tests.
> >
> > And last but not least
> >   *  includes contributions from several independent users and
> > developers, notably an IO connector for Datastore!
> >
> > Website: https://beam.apache.org/documentation/sdks/go/
> > Code: https://github.com/apache/beam/tree/master/sdks/go
> > Design: https://s.apache.org/beam-go-sdk-design-rfc
> >
> > Please vote:
> > [ ] +1, Approve that the Go SDK becomes an official part of Beam
> > [ ] -1, Do not approve (please provide specific comments)
> >
> > Thanks,
> >   The Gophers of Apache Beam
> >
> >
>


Re: [VOTE] Go SDK

2018-05-21 Thread Jean-Baptiste Onofré

+1 (binding)

I just want to check about SGA/IP/Headers.

Thanks !
Regards
JB

On 22/05/2018 03:02, Henning Rohde wrote:

Hi everyone,

Now that the remaining issues have been resolved as discussed, I'd like 
to propose a formal vote on accepting the Go SDK into master. The main 
practical difference is that the Go SDK would be part of the Apache Beam 
release going forward.


Highlights of the Go SDK:
  * Go user experience with natively-typed DoFns with (simulated) 
generic types
  * Covers most of the Beam model: ParDo, GBK, CoGBK, Flatten, Combine, 
Windowing, ..
  * Includes several IO connectors: Datastore, BigQuery, PubSub, 
extensible textio.
  * Supports the portability framework for both batch and streaming, 
notably the upcoming portable Flink runner

  * Supports a direct runner for small batch workloads and testing.
  * Includes pre-commit tests and post-commit integration tests.

And last but not least
  *  includes contributions from several independent users and 
developers, notably an IO connector for Datastore!


Website: https://beam.apache.org/documentation/sdks/go/
Code: https://github.com/apache/beam/tree/master/sdks/go
Design: https://s.apache.org/beam-go-sdk-design-rfc

Please vote:
[ ] +1, Approve that the Go SDK becomes an official part of Beam
[ ] -1, Do not approve (please provide specific comments)

Thanks,
  The Gophers of Apache Beam




Re: [VOTE] Go SDK

2018-05-21 Thread Davor Bonaci
+1 (binding), with the following caveats:

* [before closing the vote] Completion of IP clearance process, as we've
been requested. It is easier to do it than having to argue why it is not
necessary.
* [at any time, possibly later] Figuring out the release mechanics.

Great work; across the board!

On Mon, May 21, 2018 at 6:29 PM, Jason Kuster 
wrote:

> +1! So excited to have gotten to this point -- congrats to all. I've been
> excited to do some reviews of the Go SDK since becoming a committer; really
> happy about this.
>
> On Mon, May 21, 2018 at 6:03 PM Henning Rohde  wrote:
>
>> Hi everyone,
>>
>> Now that the remaining issues have been resolved as discussed, I'd like
>> to propose a formal vote on accepting the Go SDK into master. The main
>> practical difference is that the Go SDK would be part of the Apache Beam
>> release going forward.
>>
>> Highlights of the Go SDK:
>>  * Go user experience with natively-typed DoFns with (simulated) generic
>> types
>>  * Covers most of the Beam model: ParDo, GBK, CoGBK, Flatten, Combine,
>> Windowing, ..
>>  * Includes several IO connectors: Datastore, BigQuery, PubSub,
>> extensible textio.
>>  * Supports the portability framework for both batch and streaming,
>> notably the upcoming portable Flink runner
>>  * Supports a direct runner for small batch workloads and testing.
>>  * Includes pre-commit tests and post-commit integration tests.
>>
>> And last but not least
>>  *  includes contributions from several independent users and developers,
>> notably an IO connector for Datastore!
>>
>> Website: https://beam.apache.org/documentation/sdks/go/
>> Code: https://github.com/apache/beam/tree/master/sdks/go
>> Design: https://s.apache.org/beam-go-sdk-design-rfc
>>
>> Please vote:
>> [ ] +1, Approve that the Go SDK becomes an official part of Beam
>> [ ] -1, Do not approve (please provide specific comments)
>>
>> Thanks,
>>  The Gophers of Apache Beam
>>
>>
>>
>
> --
> ---
> Jason Kuster
> Apache Beam / Google Cloud Dataflow
>
> See something? Say something. go/jasonkuster-feedback
>


Re: [VOTE] Go SDK

2018-05-21 Thread Jason Kuster
+1! So excited to have gotten to this point -- congrats to all. I've been
excited to do some reviews of the Go SDK since becoming a committer; really
happy about this.

On Mon, May 21, 2018 at 6:03 PM Henning Rohde  wrote:

> Hi everyone,
>
> Now that the remaining issues have been resolved as discussed, I'd like to
> propose a formal vote on accepting the Go SDK into master. The main
> practical difference is that the Go SDK would be part of the Apache Beam
> release going forward.
>
> Highlights of the Go SDK:
>  * Go user experience with natively-typed DoFns with (simulated) generic
> types
>  * Covers most of the Beam model: ParDo, GBK, CoGBK, Flatten, Combine,
> Windowing, ..
>  * Includes several IO connectors: Datastore, BigQuery, PubSub,
> extensible textio.
>  * Supports the portability framework for both batch and streaming,
> notably the upcoming portable Flink runner
>  * Supports a direct runner for small batch workloads and testing.
>  * Includes pre-commit tests and post-commit integration tests.
>
> And last but not least
>  *  includes contributions from several independent users and developers,
> notably an IO connector for Datastore!
>
> Website: https://beam.apache.org/documentation/sdks/go/
> Code: https://github.com/apache/beam/tree/master/sdks/go
> Design: https://s.apache.org/beam-go-sdk-design-rfc
>
> Please vote:
> [ ] +1, Approve that the Go SDK becomes an official part of Beam
> [ ] -1, Do not approve (please provide specific comments)
>
> Thanks,
>  The Gophers of Apache Beam
>
>
>

-- 
---
Jason Kuster
Apache Beam / Google Cloud Dataflow

See something? Say something. go/jasonkuster-feedback


Jenkins build is back to normal : beam_SeedJob #1753

2018-05-21 Thread Apache Jenkins Server
See 



[VOTE] Go SDK

2018-05-21 Thread Henning Rohde
Hi everyone,

Now that the remaining issues have been resolved as discussed, I'd like to
propose a formal vote on accepting the Go SDK into master. The main
practical difference is that the Go SDK would be part of the Apache Beam
release going forward.

Highlights of the Go SDK:
 * Go user experience with natively-typed DoFns with (simulated) generic
types
 * Covers most of the Beam model: ParDo, GBK, CoGBK, Flatten, Combine,
Windowing, ..
 * Includes several IO connectors: Datastore, BigQuery, PubSub, extensible
textio.
 * Supports the portability framework for both batch and streaming, notably
the upcoming portable Flink runner
 * Supports a direct runner for small batch workloads and testing.
 * Includes pre-commit tests and post-commit integration tests.

And last but not least
 *  includes contributions from several independent users and developers,
notably an IO connector for Datastore!

Website: https://beam.apache.org/documentation/sdks/go/
Code: https://github.com/apache/beam/tree/master/sdks/go
Design: https://s.apache.org/beam-go-sdk-design-rfc

Please vote:
[ ] +1, Approve that the Go SDK becomes an official part of Beam
[ ] -1, Do not approve (please provide specific comments)

Thanks,
 The Gophers of Apache Beam


Build failed in Jenkins: beam_SeedJob #1752

2018-05-21 Thread Apache Jenkins Server
See 

--
GitHub pull request #5406 of commit 93fc6ef85c449a912affece91e26919a75041756, 
no merge conflicts.
Setting status of 93fc6ef85c449a912affece91e26919a75041756 to PENDING with url 
https://builds.apache.org/job/beam_SeedJob/1752/ and message: 'Build started 
sha1 is merged.'
Using context: Jenkins: Seed Job
[EnvInject] - Loading node environment variables.
Building remotely on beam14 (beam) in workspace 

 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/apache/beam.git # timeout=10
Fetching upstream changes from https://github.com/apache/beam.git
 > git --version # timeout=10
 > git fetch --tags --progress https://github.com/apache/beam.git 
 > +refs/heads/*:refs/remotes/origin/* 
 > +refs/pull/5406/*:refs/remotes/origin/pr/5406/*
 > git rev-parse refs/remotes/origin/pr/5406/merge^{commit} # timeout=10
 > git rev-parse refs/remotes/origin/origin/pr/5406/merge^{commit} # timeout=10
Checking out Revision 152a25533693dd3b81682c172f43381669e1c560 
(refs/remotes/origin/pr/5406/merge)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 152a25533693dd3b81682c172f43381669e1c560
Commit message: "Merge 93fc6ef85c449a912affece91e26919a75041756 into 
9faf6a994e8347aa7aba185cef55862f883a8544"
First time build. Skipping changelog.
Cleaning workspace
 > git rev-parse --verify HEAD # timeout=10
Resetting working tree
 > git reset --hard # timeout=10
 > git clean -fdx # timeout=10
Processing DSL script job_00_seed.groovy
Processing DSL script job_Dependency_Check.groovy
ERROR: (job_Dependency_Check.groovy, line 40) No signature of method: 
javaposse.jobdsl.dsl.helpers.step.GradleContext.switches() is applicable for 
argument types: (java.lang.String, java.lang.String) values: 
[-Drevision=release, -DreportfileName=java_dependency_report]
Possible solutions: switches(java.lang.String), getSwitches()



Proposal for Beam Python User State and Timer APIs

2018-05-21 Thread Charles Chen
I want to share a proposal for adding user state and timer support to the
Beam Python SDK and get the community's thoughts on how such an API should
look: https://s.apache.org/beam-python-user-state-and-timers

Let me know what you think and please add any comments and suggestions you
may have.

Best,
Charles


Re: The Go SDK got accidentally merged - options to deal with the pain

2018-05-21 Thread Henning Rohde
Hi everyone,

 Thanks again for your patience. The last remaining Go SDK items are now
resolved and the beam website has been updated! I'll start a separate
thread for the formal vote shortly.

Thanks,
 Henning


On Thu, Apr 19, 2018 at 5:42 PM Henning Rohde  wrote:

> Hi everyone,
>
>  Thank you all for your patience. The last major identified feature (Go
> windowing) is now in review: https://github.com/apache/beam/pull/5179.
> The remaining work listed under
>
>  https://issues.apache.org/jira/browse/BEAM-2083
>
> is integration tests and documentation (quickstart, etc). I expect that
> will take a few weeks after which we should be in a position to do a vote
> about making the Go SDK an official Beam SDK. To this end, please do take a
> look at the listed tasks and let me know if there us anything missing.
>
> Lastly, I have a practical question: how should we order the PRs to the
> beam site documentation wrt the vote? Should we get PRs accepted, but not
> committed before a vote? Or just commit them as they are ready to avoid
> potential merge conflicts?
>
> Thanks!
>
> Henning
>
>
>
>
> On Sat, Mar 10, 2018 at 10:45 AM Henning Rohde  wrote:
>
>> Thank you all! I've added the remaining work -- as I understand it -- as
>> dependencies to the overall Go SDK issue (tracking the "official" merge to
>> master):
>>
>> https://issues.apache.org/jira/browse/BEAM-2083
>>
>> Please feel free to add to this list or expand the items, if there is
>> anything I overlooked. If this presence of the Go SDK in master cause
>> issues for other modules, please simply file a bug against me and I'll take
>> care of it.
>>
>> Robert - I understand your last reply as addressing Davor's points.
>> Please let me know if there is anything I need to do in that regard.
>>
>> Henning
>>
>>
>>
>> On Fri, Mar 9, 2018 at 8:39 AM, Ismaël Mejía  wrote:
>>
>>> +1 to let it evolve in master (+Davor points), having ongoing work on
>>> master makes sense given the state of advance + the hope that this
>>> won't add any issue for the other modules.
>>>
>>> On Thu, Mar 8, 2018 at 7:30 PM, Robert Bradshaw 
>>> wrote:
>>> > +1 to both of these points. SGA should have probably already been
>>> filed, and
>>> > excising this from releases should be easy, but I added a line item to
>>> the
>>> > validation checklist template to make sure we don't forget.
>>> >
>>> > On Thu, Mar 8, 2018 at 7:13 AM Davor Bonaci  wrote:
>>> >>
>>> >> I support leaving things as they stand now -- thanks for finding a
>>> good
>>> >> way out of an uncomfortable situation.
>>> >>
>>> >> That said, two things need to happen:
>>> >> (1) SGA needs to be filed asap, per Board feedback in the last
>>> report, and
>>> >> (2) releases cannot contain any code from the Go SDK before formally
>>> voted
>>> >> on the new component and accepted. This includes source releases that
>>> are
>>> >> created through "assembly", so manual exclusion in the configuration
>>> is
>>> >> likely needed.
>>> >>
>>> >> On Wed, Mar 7, 2018 at 1:54 PM, Kenneth Knowles 
>>> wrote:
>>> >>>
>>> >>> Re-reading the old thread, I see these desirata:
>>> >>>
>>> >>>  - "enough IO to write end-to-end examples such as WordCount and
>>> >>> demonstrate what IOs would look like"
>>> >>>  - "accounting and tracking the fact that each element has an
>>> associated
>>> >>> window and timestamp"
>>> >>>  - "test suites and test utilities"
>>> >>>
>>> >>> Browsing the code, it looks like these each exist to some level of
>>> >>> completion.
>>> >>>
>>> >>> Kenn
>>> >>>
>>> >>>
>>> >>> On Wed, Mar 7, 2018 at 1:38 PM Robert Bradshaw 
>>> >>> wrote:
>>> 
>>>  I was actually thinking along the same lines: what was yet lacking
>>> to
>>>  "officially" merge the Go branch in? The thread we started on this
>>> seems to
>>>  have fizzled out over the holidays, but windowing support is the
>>> only
>>>  must-have missing technical feature in my book (assuming
>>> documentation and
>>>  testing are, or are brought up to snuff).
>>> 
>>> 
>>>  On Wed, Mar 7, 2018 at 1:35 PM Henning Rohde 
>>> wrote:
>>> >
>>> > One thought: the Go SDK is actually not that far away from
>>> satisfying
>>> > the guidelines for merging to master anyway (as discussed here
>>> [1]). If we
>>> > decide to simply leave the code in master -- which seems to be
>>> what this
>>> > thread is leaning towards -- I'll gladly sign up to do the
>>> remaining aspects
>>> > (I believe it's only windowing, validation tests and documentation)
>>> > reasonably quickly to get to an official vote for accepting it and
>>> in turn
>>> > get master into a sound state. It would seem like the path of
>>> least hassle.
>>> > Of course, I'm happy to go with whatever the community is
>>> comfortable with
>>> > -- just trying to make 

Re: Proposal: keeping precommit times fast

2018-05-21 Thread Mikhail Gryzykhin
What we can do here is estimate how much effort we want to put in and set
remote target.
Such as:
Third quarter 2018 -- 1hr SLO
Forth quarter 2018 -- 30min SLO,
etc.

Combined with policy for newly added tests, this can give us some goal to
aim for.

--Mikhail

Have feedback ?


On Mon, May 21, 2018 at 2:06 PM Scott Wegner  wrote:

> Thanks for the proposal, I left comments in the doc. Overall I think it's
> a great idea.
>
> I've seen other projects with much faster pre-commits, and it requires
> strict guidelines on unit test design and keeping tests isolated in-memory
> as much as possible. That's not currently the case in Java; we have
> pre-commits which submit pipelines to Dataflow service.
>
> I don't know if it's feasible to get Java down to 15-20 mins in the short
> term, but a good starting point would be to document the requirements for a
> test to run as pre-commit, and start enforcing it for new tests.
>
>
> On Fri, May 18, 2018 at 3:25 PM Henning Rohde  wrote:
>
>> Good proposal. I think it should be considered in tandem with the "No
>> commit on red post-commit" proposal and could be far more ambitious than 2
>> hours. For example, something in the <15-20 mins range, say, would be much
>> less of an inconvenience to the development effort. Go takes ~3 mins, which
>> means that it is practical to wait until a PR is green before asking anyone
>> to look at it. If I need to wait for a Java or Python pre-commit, I task
>> switch and come back later. If the post-commits are enforced to be green,
>> we could possibly gain a much more productive flow at the cost of the
>> occasional post-commit break, compared to now. Maybe IOs can be less
>> extensively tested pre-commit, for example, or only if actually changed?
>>
>> I also like Robert's suggestion of spitting up pre-commits into something
>> more fine-grained to get a clear partial signal quicker. If we have an
>> adequate number of Jenkins slots, it might also speed things up overall.
>>
>> Thanks,
>>  Henning
>>
>> On Fri, May 18, 2018 at 12:30 PM Scott Wegner  wrote:
>>
>>> re: intelligently skipping tests for code that doesn't change (i.e. Java
>>> tests on Python PR): this should be possible. We already have build-caching
>>> enabled in Gradle, but I believe it is local to the git workspace and
>>> doesn't persist between Jenkins runs.
>>>
>>> With a quick search, I see there is a Jenkins Build Cacher Plugin [1]
>>> that hooks into Gradle build cache and does exactly what we need. Does
>>> anybody know whether we could get this enabled on our Jenkins?
>>>
>>> [1] https://wiki.jenkins.io/display/JENKINS/Job+Cacher+Plugin
>>>
>>> On Fri, May 18, 2018 at 12:08 PM Robert Bradshaw 
>>> wrote:
>>>
 [somehow  my email got garbled...]

 Now that we're using gradle, perhaps we could be more intelligent about
 only running the affected tests? E.g. when you touch Python (or Go) you
 shouldn't need to run the Java precommit at all, which would reduce the
 latency for those PRs and also the time spent in queue. Presumably this
 could even be applied per-module for the Java tests. (Maybe a large, shared
 build cache could help here as well...)

 I also wouldn't be opposed to a quicker immediate signal, plus more
 extensive tests before actually merging. It's also nice to not have to wait
 an hour to see that you have a lint error; quick stuff like that could be
 signaled quickly before a contributor looses context.

 - Robert



 On Fri, May 18, 2018 at 5:55 AM Kenneth Knowles  wrote:

> I like the idea. I think it is a good time for the project to start
> tracking this and keeping it usable.
>
> Certainly 2 hours is more than enough, is that not so? The Java
> precommit seems to take <=40 minutes while Python takes ~20 and Go is so
> fast it doesn't matter. Do we have enough stragglers that we don't
> make it in the 95th percentile? Is the time spent in the Jenkins queue?
>
> For our current coverage, I'd be willing to go for:
>
>  - 1 hr hard cap (someone better at stats could choose %ile)
>  - roll back or remove test from precommit if fix looks like more than
> 1 week (roll back if it is perf degradation, remove test from precommit if
> it is additional coverage that just doesn't fit in the time)
>
> There's a longer-term issue that doing a full build each time is
> expected to linearly scale up with the size of our repo (it is the 
> monorepo
> problem but for a minirepo) so there is no cap that is feasible until we
> have effective cross-build caching. And my long-term goal would be <30
> minutes. At the latency of opening a pull request and then checking your
> email that's not burdensome, but an hour is.
>
> Kenn
>
> On Thu, May 17, 2018 at 

Re: Proposal: keeping post-commit tests green

2018-05-21 Thread Mikhail Gryzykhin
Hi Everyone,

I've updated design doc according to comments.
https://docs.google.com/document/d/1sczGwnCvdHiboVajGVdnZL0rfnr7ViXXAebBAf_uQME

In general, ideas proposed seem to be appreciated. Still, some of sections
require more discussion.

Changes highlight:
* Added roll-back first policy to best practices. This includes process on
how to handle roll-back.
* Marked topics that I'd like to have more input on. [cyan color]

--Mikhail

Have feedback ?


On Fri, May 18, 2018 at 10:56 AM Andrew Pilloud  wrote:

> Blocking commits to master on test flaps seems critical here. The test
> flaps won't get the attention they deserve as long as people are just
> spamming their PRs with 'Run Java Precommit' until they turn green. I'm
> guilty of this behavior and I know it masks new flaky tests.
>
> I added a comment to your doc about detecting flaky tests. This can easily
> be done by rerunning the postcommits during times when Jenkins would
> otherwise be idle. You'll easily get a few dozen runs every weekend, you
> just need a process to triage all the flakes and ensure there are bugs. I
> worked on a project that did this along with blocking master on any post
> commit failure. It was painful for the first few weeks, but things got
> significantly better once most of the bugs were fixed.
>
> Andrew
>
> On Fri, May 18, 2018 at 10:39 AM Kenneth Knowles  wrote:
>
>> Love it. I would pull out from the doc also the key point: make the
>> postcommit status constantly visible to everyone.
>>
>> Kenn
>>
>> On Fri, May 18, 2018 at 10:17 AM Mikhail Gryzykhin 
>> wrote:
>>
>>> Hi everyone,
>>>
>>> I'm Mikhail and started working on Google Dataflow several months ago.
>>> I'm really excited to work with Beam opensource community.
>>>
>>> I have a proposal to improve contributor experience by keeping
>>> post-commit tests green.
>>>
>>> I'm looking to get community consensus and approval about the process
>>> for keeping post-commit tests green and addressing post-commit test
>>> failures.
>>>
>>> Find full list of ideas brought in for discussion in this document:
>>>
>>> https://docs.google.com/document/d/1sczGwnCvdHiboVajGVdnZL0rfnr7ViXXAebBAf_uQME
>>>
>>> Key points are:
>>> 1. Add explicit tracking of failures via JIRA
>>> 2. No-Commit policy when post-commit tests are red
>>>
>>> --Mikhail
>>>
>>>


Re: Proposal: keeping precommit times fast

2018-05-21 Thread Scott Wegner
Thanks for the proposal, I left comments in the doc. Overall I think it's a
great idea.

I've seen other projects with much faster pre-commits, and it requires
strict guidelines on unit test design and keeping tests isolated in-memory
as much as possible. That's not currently the case in Java; we have
pre-commits which submit pipelines to Dataflow service.

I don't know if it's feasible to get Java down to 15-20 mins in the short
term, but a good starting point would be to document the requirements for a
test to run as pre-commit, and start enforcing it for new tests.


On Fri, May 18, 2018 at 3:25 PM Henning Rohde  wrote:

> Good proposal. I think it should be considered in tandem with the "No
> commit on red post-commit" proposal and could be far more ambitious than 2
> hours. For example, something in the <15-20 mins range, say, would be much
> less of an inconvenience to the development effort. Go takes ~3 mins, which
> means that it is practical to wait until a PR is green before asking anyone
> to look at it. If I need to wait for a Java or Python pre-commit, I task
> switch and come back later. If the post-commits are enforced to be green,
> we could possibly gain a much more productive flow at the cost of the
> occasional post-commit break, compared to now. Maybe IOs can be less
> extensively tested pre-commit, for example, or only if actually changed?
>
> I also like Robert's suggestion of spitting up pre-commits into something
> more fine-grained to get a clear partial signal quicker. If we have an
> adequate number of Jenkins slots, it might also speed things up overall.
>
> Thanks,
>  Henning
>
> On Fri, May 18, 2018 at 12:30 PM Scott Wegner  wrote:
>
>> re: intelligently skipping tests for code that doesn't change (i.e. Java
>> tests on Python PR): this should be possible. We already have build-caching
>> enabled in Gradle, but I believe it is local to the git workspace and
>> doesn't persist between Jenkins runs.
>>
>> With a quick search, I see there is a Jenkins Build Cacher Plugin [1]
>> that hooks into Gradle build cache and does exactly what we need. Does
>> anybody know whether we could get this enabled on our Jenkins?
>>
>> [1] https://wiki.jenkins.io/display/JENKINS/Job+Cacher+Plugin
>>
>> On Fri, May 18, 2018 at 12:08 PM Robert Bradshaw 
>> wrote:
>>
>>> [somehow  my email got garbled...]
>>>
>>> Now that we're using gradle, perhaps we could be more intelligent about
>>> only running the affected tests? E.g. when you touch Python (or Go) you
>>> shouldn't need to run the Java precommit at all, which would reduce the
>>> latency for those PRs and also the time spent in queue. Presumably this
>>> could even be applied per-module for the Java tests. (Maybe a large, shared
>>> build cache could help here as well...)
>>>
>>> I also wouldn't be opposed to a quicker immediate signal, plus more
>>> extensive tests before actually merging. It's also nice to not have to wait
>>> an hour to see that you have a lint error; quick stuff like that could be
>>> signaled quickly before a contributor looses context.
>>>
>>> - Robert
>>>
>>>
>>>
>>> On Fri, May 18, 2018 at 5:55 AM Kenneth Knowles  wrote:
>>>
 I like the idea. I think it is a good time for the project to start
 tracking this and keeping it usable.

 Certainly 2 hours is more than enough, is that not so? The Java
 precommit seems to take <=40 minutes while Python takes ~20 and Go is so
 fast it doesn't matter. Do we have enough stragglers that we don't
 make it in the 95th percentile? Is the time spent in the Jenkins queue?

 For our current coverage, I'd be willing to go for:

  - 1 hr hard cap (someone better at stats could choose %ile)
  - roll back or remove test from precommit if fix looks like more than
 1 week (roll back if it is perf degradation, remove test from precommit if
 it is additional coverage that just doesn't fit in the time)

 There's a longer-term issue that doing a full build each time is
 expected to linearly scale up with the size of our repo (it is the monorepo
 problem but for a minirepo) so there is no cap that is feasible until we
 have effective cross-build caching. And my long-term goal would be <30
 minutes. At the latency of opening a pull request and then checking your
 email that's not burdensome, but an hour is.

 Kenn

 On Thu, May 17, 2018 at 6:54 PM Udi Meiri  wrote:

> HI,
> I have a proposal to improve contributor experience by keeping
> precommit times low.
>
> I'm looking to get community consensus and approval about:
> 1. How long should precommits take. 2 hours @95th percentile over the
> past 4 weeks is the current proposal.
> 2. The process for dealing with slowness. Do we: fix, roll back,
> remove a test from precommit?
> Rolling back if a fix is 

Re: [SQL] Cross Join Operation

2018-05-21 Thread Kenneth Knowles
That seems very useful. For anything that is an equi-join and we can encode
the things being compared, we should do an inner join that compiles to
CoGroupByKey. We will probably need some costs to make this work right.
There might be some needed changes to the join library to make this happen
too, or to implement it directly in the Beam SQL codebase.

Kenn

On Mon, May 21, 2018 at 9:02 AM Kai Jiang  wrote:

> Andrew, Kenn, thanks for your comments!
>
> This query is potential using COMMA (Cross) join. Since most of TPC-DS
> queries use comma join, I took a look at their logic plan. Calcite parses
> them to inner join and join condition is true. Outside inner join is around
> with filter condition which is parsing from WHERE clause.
>
> I think a way we could do is how to push filter (WHERE clause's condition)
> into join condition. I also took a look at Flink sql code. After parsing
> query, Flink optimizes its logic plan which is get rid of filter and move
> its condition to join condition. It seems Flink uses calcite optimize rule.
>
> Should we move forward to see how it apply to Beam SQL?
>
> Best,
> Kai
>
> ᐧ
>
> On Wed, May 16, 2018 at 2:46 PM Kenneth Knowles  wrote:
>
>> It sounds like a good way to proceed would be to replace the template
>> predicates with something that induces a real join.
>>
>> Kenn
>>
>> On Tue, May 15, 2018 at 9:33 AM Andrew Pilloud 
>> wrote:
>>
>>> Calcite does not have the concept of a "CROSS JOIN". It shows up in the
>>> plan as a LogicalJoin with condition=[true]. We could try rejecting the
>>> cross join at the planning stage by returning null for them
>>> in BeamJoinRule.convert(), which might result in a different plan. But
>>> looking at your query, you have a cross join unless the where clause on the
>>> inner select contains a row from the outer select.
>>>
>>> Andrew
>>>
>>> On Tue, May 15, 2018 at 9:15 AM Kenneth Knowles  wrote:
>>>
 The logical plan should show you where the cross join is needed. Here
 is where it is logged:
 https://github.com/apache/beam/blob/master/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/planner/BeamQueryPlanner.java#L150

 (It should probably be put to DEBUG level)

 If I look at the original template, like
 https://github.com/gregrahn/tpcds-kit/blob/master/query_templates/query9.tpl
 I see conditions "[RC.1]". Are those templates expected to be filled with
 references to the `reason` table, perhaps? How does that change things?

 I still think it would be good to support CROSS JOIN if we can - the
 problem of course is huge data size, but when one side is small it would be
 good for it to work simply.

 Kenn

 On Tue, May 15, 2018 at 7:41 AM Kai Jiang  wrote:

> Hi everyone,
>
> To prove the idea of GSoC project, I was working on some simple TPC-DS
> queries running with given generated data on direct runner. query
> example
> 
>
> The example is executed with TPC-DS query 9
> .
> Briefly, Query 9 uses case when clauses to select 5 counting numbers
> from store_sales (table 1). In order to show those result numbers, case
> when clause inside one select clause. In short, it looks like:
> SELECT
>
> CASE WHEN ( SELECT count(*)  FROM  table 1 WHERE. )
> THEN condition 1
> ELSE condition 2,
> .
> CASE WHEN .
>
> FROM table 2
>
> IIUC, this query doesn't need join operation on table 1 and table 2 since
> outside select clause doesn't need to interfere with table 1.
> But, the program shows it does and throws errors message said
> "java.lang.UnsupportedOperationException: CROSS JOIN is not supported". 
> (error
> message detail
> )
>
> To make the query work, I am wondering where I can start with:
> 1. see logic plan?
> Will logic plan explain why the query need CROSS JOIN?
>
> 2. cross join support?
> I checked all queries in TPC-DS benchmark. Almost every query uses
> cross join. It is an important feature needs to implement. Unlike other
> join, it consumes a lot of computing resource. But, I think we need cross
> join in the future. and support both in join-library? I noticed
> James has open BEAM-2194
>  for supporting
> cross join.
>
> Looking forward to comments!
>
> Best,
> Kai
>
> ᐧ
>



Re: SQL shaded jars don't work. How to test?

2018-05-21 Thread Lukasz Cwik
Shading requires two pieces of information:
1) Which dependencies should be part of the shaded jar (controlled by
includes/excludes)
2) How to relocate code within those dependencies (controlled by
relocations)

The reason why the exclude(".*") exists is because typically it is an error
to produce a shaded package with dependencies which are not relocated. When
libraries do this, it causes a lot of NoClassFound/NoMethodFound errors for
users since a user can't know which version of a dependency they are
actually getting (the one that was bundled part of your jar or the one they
depend on as a library). Only applications should ever really do this,
libraries should always repackage all their code to prevent such errors.

Note that in the SQL package, you can provide your own shadowClosure to the
applyJavaNature() which means that the default won't apply. For example:
https://github.com/apache/beam/blob/a3ba6a0e8de3ae72b8fc6fc6038eb9dc725f092e/sdks/java/harness/build.gradle#L20
and remove the 'DEFAULT_SHADOW_CLOSURE <<'


On Mon, May 21, 2018 at 10:26 AM Andrew Pilloud  wrote:

> The issue SQL is seeing is caused by a default dependency of exclude(".*")
> added in build_rules.gradle. This breaks the normal method of building
> shadow jars as everything must be explicitly included. SQL explicitly added
> calcite to the jar, but not calcite's dependencies. I've been told this is
> the desired behavior as we want to ensure everything included is relocated.
>
> I don't know much about gradle, but this seems fragile. Is it possible to
> have all dependencies automatically relocated so we don't need the
> exclude(".*") rule?
>
> Andrew
>
> On Thu, May 17, 2018 at 7:41 PM Andrew Pilloud 
> wrote:
>
>> Yep, I added the issue as a blocker.
>> https://issues.apache.org/jira/projects/BEAM/issues/BEAM-4357
>>
>> On Thu, May 17, 2018, 6:05 PM Kenneth Knowles  wrote:
>>
>>> This sounds like a release blocker. Can you add it to the list? (Assign
>>> fix version on jira)
>>>
>>> Kenn
>>>
>>> On Thu, May 17, 2018, 17:30 Lukasz Cwik  wrote:
>>>
 Typically we have a test block which uses a configuration that has the
 shadow/shadowTest configurations on the classpath instead of the
 compile/testCompile configurations. The most common examples are validates
 runner/integration tests for example:

 https://github.com/apache/beam/blob/0c5ebc449554a02cae5e4fd01afb07ecdb0bbaea/runners/direct-java/build.gradle#L84

 On Thu, May 17, 2018 at 3:59 PM Andrew Pilloud 
 wrote:

> I decided to try our new JDBC support with sqlline and discovered that
> our SQL shaded jar is completely broken. As
> in java.lang.NoClassDefFoundError all over the place. How are we testing
> the output jars from other beam packages? Is there an example I can follow
> to make our integration tests run against the release artifacts?
>
> Andrew
>



Re: Java PreCommit seems broken

2018-05-21 Thread Lukasz Cwik
The archetype projects are coupled to their parents but don't have any
meaningful dependencies so a significantly simpler archetype could be used.
The dependency that exists right now in the archetype is to provide
specific build ordering which can easily be moved to Gradle directly.

An alternative would be to migrate the Maven archetype build to use Gradle.
Assembling the Maven archetype jar is easy as no compilation is required,
the issue was about running/validating that the archetype can be built.

On Fri, May 18, 2018 at 12:20 PM Scott Wegner  wrote:

> +1 to the Lukasz's proposed solution. Depending on artifacts published
> from a previous build it's fragile and will add flakiness to our test runs.
> We should make pre-commits as hermetic as possible.
>
> Depending on the transitive set of publishToMavenLocal tasks seems
> cumbersome, but also necessary.
>
> On a related note: The archetype projects are shelling out to mvn for the
> build, which uses the existing pom.xml files. This places a build
> dependency on the pom.xml files down to the project root due to parent
> relationships. Has there been any investigation on whether we can decouple
> archetype generation from our Maven pom.xml files?
>
> On Fri, May 18, 2018 at 10:47 AM Lukasz Cwik  wrote:
>
>> We would need the archetype task to depend on all the dependencies
>> publishToMavenLocal tasks transitively and then be configured to use
>> whatever that maven local is on Jenkins / dev machine. It would be best if
>> it was an ephemeral folder because it would be annoying to have stuff
>> installed underneath a devs .m2/ directory that would need cleaning up.
>>
>> On Fri, May 18, 2018 at 10:41 AM Kenneth Knowles  wrote:
>>
>>> Is this just a build tweak, or are there costly steps that we'd have to
>>> add that would slow down presubmit? (with mvn I know that `test` and
>>> `install` did very different amounts of work - because mvn test didn't test
>>> the right artifacts, but maybe with Gradle not so much?)
>>>
>>> On Fri, May 18, 2018 at 9:14 AM Lukasz Cwik  wrote:
>>>
 The problem with the way that the archetypes tests are run (now with
 Gradle and in the past with Maven) is that they run against the nightly
 snapshot and not against artifacts from the current build. To get them to
 work, we would need to publish the dependent Maven modules to a temporary
 repo and instruct the archetype project to use it for building/testing
 purposes.

 On Fri, May 18, 2018 at 5:38 AM Kenneth Knowles  wrote:

> Maybe something has changed, but the snapshots used to pull from the
> public snapshot repo. We got failures for a while every time we cut a
> release branch, but once there was a nightly snapshot they cleared up.
>
> Kenn
>
> On Thu, May 17, 2018 at 9:50 PM Scott Wegner 
> wrote:
>
>> I noticed that tests tests simply run "mvn clean install" on the
>> archetype project. But I don't see any dependent task which installs 
>> built
>> artifacts into the local Maven repo. Is that an oversight?
>>
>> If that's the case, perhaps the tests are failing sporadically when
>> there are no previously installed snapshot artifacts cached on the 
>> machine.
>>
>> On Thu, May 17, 2018, 2:45 PM Pablo Estrada 
>> wrote:
>>
>>> I'm seeing failures on Maven Archetype-related tests.
>>>
>>> Build Scan of a sample run: https://scans.gradle.com/s/kr23q43mh6fmk
>>>
>>> And the failure is here specifically:
>>> https://scans.gradle.com/s/kr23q43mh6fmk/console-log?task=:beam-sdks-java-maven-archetypes-examples:generateAndBuildArchetypeTest#L116
>>>
>>>
>>> Does anyone know why this might be happening?
>>> Best
>>> -P.
>>> --
>>> Got feedback? go/pabloem-feedback
>>> 
>>>
>>


Re: What is the future of Reshuffle?

2018-05-21 Thread Raghu Angadi
Filed https://issues.apache.org/jira/browse/BEAM-4372 (unassigned).

On Mon, May 21, 2018 at 10:22 AM Raghu Angadi  wrote:

>
>
> On Mon, May 21, 2018 at 9:56 AM Robert Bradshaw 
> wrote:
>
>> We should probably keep the warning and all the caveats until we
>> introduce the alternative (and migrate to it for the non-parallelism uses
>> of reshuffle). I was just proposing we do this via a separate transform
>> that just calls Reshuffle until we have the new story fully fleshed out (I
>> don't know if any runner supports RequresStableInput, and it isn't
>> translated in the Fn API) to avoid being in this intermediate state for
>> yet another year.
>>
>
> I think all warnings and caveats really belong in JavaDoc for GroupByKey,
> since that is the actual transform the the semantics that we are concerned
> about incorrect use. A reshuffle transform can refer the users there.
>
> In addition, I think Reshuffle is really good name for reshuffle transform
> meant for most users.
>
> That said, any reorganization is much better than deprecation.
>
> Raghu.
>
>
>>
>> On Sun, May 20, 2018 at 6:38 PM Raghu Angadi  wrote:
>>
>>>
>>>
>>> On Sat, May 19, 2018 at 10:55 PM Robert Bradshaw 
>>> wrote:
>>>
 On Sat, May 19, 2018 at 6:27 PM Raghu Angadi 
 wrote:

> [...]
>
 I think it would be much more user friendly to un-deprecate it to add a
> warning for advanced users about non-portability of durability/replay
> guarantees/stable input assumptions.
>
>>
 Yes, I think everyone in this thread is in agreement here. We should
 provide a *different* transform that provides the durability guarantees
 (with caveats). In the meantime, this delegating to a reshuffle would be
 better than using a reshuffle directly.

>>>
>>> Great. Sent a PR to undeprecate Reshuffle :
>>> https://github.com/apache/beam/pull/5432
>>> The wording there for JavaDoc just a proposal...
>>>
>>> Raghu.
>>>
>>>
 We tend to put in reshuffles in order to "commit" these random values
>> and make them stable for the next stage, to be used to provide the 
>> needed
>> idempotency for sinks.
>>
>
> In such cases, I think the author should error out on the runner
> that don't provide that guarantee. That is what ExactlyOnceSink in 
> KafkaIO
> does [1].
>
> [1]
> https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L1049
>

 We're moving to a world where the runner may not be known at
 pipeline construction time. However, explicitly using a (distinct)
 make-input-stable transform when that's the intent (which could be a
 primitive that runners should implement, possibly by swapping in 
 Reshuffle,
 or reject) would allow for this. That being said, the exact semantics 
 of
 this transform is a bit of a rabbit hole which is why we never 
 finished the
 job of deprecating Reshuffle. This is a case where doing something is
 better than doing nothing, and our use of URNs for this kind of thing 
 is
 flexible enough that we can deprecate old ones if/when we have time to
 pound out the right solution.


>
>> Kenn
>>
>> On Fri, May 18, 2018 at 4:05 PM Raghu Angadi 
>> wrote:
>>
>>>
>>> On Fri, May 18, 2018 at 12:21 PM Robert Bradshaw <
>>> rober...@google.com> wrote:
>>>
 On Fri, May 18, 2018 at 11:46 AM Raghu Angadi <
 rang...@google.com> wrote:

> Thanks Kenn.
>
> On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles <
> k...@google.com> wrote:
>
>> The fact that its usage has grown probably indicates that we
>> have a large number of transforms that can easily cause data 
>> loss /
>> duplication.
>>
>
> Is this specific to Reshuffle or it is true for any
> GroupByKey? I see Reshuffle as just a wrapper around GBK.
>
 The issue is when it's used in such a way that data corruption
 can occur when the underlying GBK output is not stable.

>>>
>>> Could you describe this breakage bit more in detail or give a
>>> example? Apologies in advance, I know this came up in multiple 
>>> contexts in
>>> the past, but I haven't grokked the issue well. It is the window 
>>> rewrite
>>> that Reshuffle does that causes misuse of GBK?
>>>
>>> Thanks.
>>>
>>


Re: SQL shaded jars don't work. How to test?

2018-05-21 Thread Andrew Pilloud
The issue SQL is seeing is caused by a default dependency of exclude(".*")
added in build_rules.gradle. This breaks the normal method of building
shadow jars as everything must be explicitly included. SQL explicitly added
calcite to the jar, but not calcite's dependencies. I've been told this is
the desired behavior as we want to ensure everything included is relocated.

I don't know much about gradle, but this seems fragile. Is it possible to
have all dependencies automatically relocated so we don't need the
exclude(".*") rule?

Andrew

On Thu, May 17, 2018 at 7:41 PM Andrew Pilloud  wrote:

> Yep, I added the issue as a blocker.
> https://issues.apache.org/jira/projects/BEAM/issues/BEAM-4357
>
> On Thu, May 17, 2018, 6:05 PM Kenneth Knowles  wrote:
>
>> This sounds like a release blocker. Can you add it to the list? (Assign
>> fix version on jira)
>>
>> Kenn
>>
>> On Thu, May 17, 2018, 17:30 Lukasz Cwik  wrote:
>>
>>> Typically we have a test block which uses a configuration that has the
>>> shadow/shadowTest configurations on the classpath instead of the
>>> compile/testCompile configurations. The most common examples are validates
>>> runner/integration tests for example:
>>>
>>> https://github.com/apache/beam/blob/0c5ebc449554a02cae5e4fd01afb07ecdb0bbaea/runners/direct-java/build.gradle#L84
>>>
>>> On Thu, May 17, 2018 at 3:59 PM Andrew Pilloud 
>>> wrote:
>>>
 I decided to try our new JDBC support with sqlline and discovered that
 our SQL shaded jar is completely broken. As
 in java.lang.NoClassDefFoundError all over the place. How are we testing
 the output jars from other beam packages? Is there an example I can follow
 to make our integration tests run against the release artifacts?

 Andrew

>>>


Re: What is the future of Reshuffle?

2018-05-21 Thread Raghu Angadi
On Mon, May 21, 2018 at 9:56 AM Robert Bradshaw  wrote:

> We should probably keep the warning and all the caveats until we introduce
> the alternative (and migrate to it for the non-parallelism uses of
> reshuffle). I was just proposing we do this via a separate transform that
> just calls Reshuffle until we have the new story fully fleshed out (I don't
> know if any runner supports RequresStableInput, and it isn't translated
> in the Fn API) to avoid being in this intermediate state for yet another
> year.
>

I think all warnings and caveats really belong in JavaDoc for GroupByKey,
since that is the actual transform the the semantics that we are concerned
about incorrect use. A reshuffle transform can refer the users there.

In addition, I think Reshuffle is really good name for reshuffle transform
meant for most users.

That said, any reorganization is much better than deprecation.

Raghu.


>
> On Sun, May 20, 2018 at 6:38 PM Raghu Angadi  wrote:
>
>>
>>
>> On Sat, May 19, 2018 at 10:55 PM Robert Bradshaw 
>> wrote:
>>
>>> On Sat, May 19, 2018 at 6:27 PM Raghu Angadi  wrote:
>>>
 [...]

>>> I think it would be much more user friendly to un-deprecate it to add a
 warning for advanced users about non-portability of durability/replay
 guarantees/stable input assumptions.

>
>>> Yes, I think everyone in this thread is in agreement here. We should
>>> provide a *different* transform that provides the durability guarantees
>>> (with caveats). In the meantime, this delegating to a reshuffle would be
>>> better than using a reshuffle directly.
>>>
>>
>> Great. Sent a PR to undeprecate Reshuffle :
>> https://github.com/apache/beam/pull/5432
>> The wording there for JavaDoc just a proposal...
>>
>> Raghu.
>>
>>
>>> We tend to put in reshuffles in order to "commit" these random values
> and make them stable for the next stage, to be used to provide the 
> needed
> idempotency for sinks.
>

 In such cases, I think the author should error out on the runner
 that don't provide that guarantee. That is what ExactlyOnceSink in 
 KafkaIO
 does [1].

 [1]
 https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L1049

>>>
>>> We're moving to a world where the runner may not be known at
>>> pipeline construction time. However, explicitly using a (distinct)
>>> make-input-stable transform when that's the intent (which could be a
>>> primitive that runners should implement, possibly by swapping in 
>>> Reshuffle,
>>> or reject) would allow for this. That being said, the exact semantics of
>>> this transform is a bit of a rabbit hole which is why we never finished 
>>> the
>>> job of deprecating Reshuffle. This is a case where doing something is
>>> better than doing nothing, and our use of URNs for this kind of thing is
>>> flexible enough that we can deprecate old ones if/when we have time to
>>> pound out the right solution.
>>>
>>>

> Kenn
>
> On Fri, May 18, 2018 at 4:05 PM Raghu Angadi 
> wrote:
>
>>
>> On Fri, May 18, 2018 at 12:21 PM Robert Bradshaw <
>> rober...@google.com> wrote:
>>
>>> On Fri, May 18, 2018 at 11:46 AM Raghu Angadi <
>>> rang...@google.com> wrote:
>>>
 Thanks Kenn.

 On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles <
 k...@google.com> wrote:

> The fact that its usage has grown probably indicates that we
> have a large number of transforms that can easily cause data loss 
> /
> duplication.
>

 Is this specific to Reshuffle or it is true for any GroupByKey?
 I see Reshuffle as just a wrapper around GBK.

>>> The issue is when it's used in such a way that data corruption
>>> can occur when the underlying GBK output is not stable.
>>>
>>
>> Could you describe this breakage bit more in detail or give a
>> example? Apologies in advance, I know this came up in multiple 
>> contexts in
>> the past, but I haven't grokked the issue well. It is the window 
>> rewrite
>> that Reshuffle does that causes misuse of GBK?
>>
>> Thanks.
>>
>


Re: What is the future of Reshuffle?

2018-05-21 Thread Ben Chambers
+1 to introducing alternative transforms even if they wrap Reshuffle

The benefits of making them distinct is that we can put appropriate Javadoc
in place and runners can figure out what the user is intending and whether
Reshuffle or some other implementation is appropriate. We can also see
which of these use cases is most common by inspecting the transform usage.
And, when better options are available, we can just introduce the
appropriate transform, to get the appropriate special behavior.

I think there were three cases of use that I can remember:

1. Splitting up retry domains (eg., if one transform is expensive,
preventing failures in nearby transforms from causing it to be retried)
2. Redistributing elements to improve parallelism
3. Checkpointing/snapshotting a pcollection before side-effecting the
outside world (eg., RequiresStableInput)

-- Ben

On Mon, May 21, 2018 at 9:56 AM Robert Bradshaw  wrote:

> We should probably keep the warning and all the caveats until we introduce
> the alternative (and migrate to it for the non-parallelism uses of
> reshuffle). I was just proposing we do this via a separate transform that
> just calls Reshuffle until we have the new story fully fleshed out (I don't
> know if any runner supports RequresStableInput, and it isn't translated
> in the Fn API) to avoid being in this intermediate state for yet another
> year.
>
> On Sun, May 20, 2018 at 6:38 PM Raghu Angadi  wrote:
>
>>
>>
>> On Sat, May 19, 2018 at 10:55 PM Robert Bradshaw 
>> wrote:
>>
>>> On Sat, May 19, 2018 at 6:27 PM Raghu Angadi  wrote:
>>>
 [...]

>>> I think it would be much more user friendly to un-deprecate it to add a
 warning for advanced users about non-portability of durability/replay
 guarantees/stable input assumptions.

>
>>> Yes, I think everyone in this thread is in agreement here. We should
>>> provide a *different* transform that provides the durability guarantees
>>> (with caveats). In the meantime, this delegating to a reshuffle would be
>>> better than using a reshuffle directly.
>>>
>>
>> Great. Sent a PR to undeprecate Reshuffle :
>> https://github.com/apache/beam/pull/5432
>> The wording there for JavaDoc just a proposal...
>>
>> Raghu.
>>
>>
>>> We tend to put in reshuffles in order to "commit" these random values
> and make them stable for the next stage, to be used to provide the 
> needed
> idempotency for sinks.
>

 In such cases, I think the author should error out on the runner
 that don't provide that guarantee. That is what ExactlyOnceSink in 
 KafkaIO
 does [1].

 [1]
 https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L1049

>>>
>>> We're moving to a world where the runner may not be known at
>>> pipeline construction time. However, explicitly using a (distinct)
>>> make-input-stable transform when that's the intent (which could be a
>>> primitive that runners should implement, possibly by swapping in 
>>> Reshuffle,
>>> or reject) would allow for this. That being said, the exact semantics of
>>> this transform is a bit of a rabbit hole which is why we never finished 
>>> the
>>> job of deprecating Reshuffle. This is a case where doing something is
>>> better than doing nothing, and our use of URNs for this kind of thing is
>>> flexible enough that we can deprecate old ones if/when we have time to
>>> pound out the right solution.
>>>
>>>

> Kenn
>
> On Fri, May 18, 2018 at 4:05 PM Raghu Angadi 
> wrote:
>
>>
>> On Fri, May 18, 2018 at 12:21 PM Robert Bradshaw <
>> rober...@google.com> wrote:
>>
>>> On Fri, May 18, 2018 at 11:46 AM Raghu Angadi <
>>> rang...@google.com> wrote:
>>>
 Thanks Kenn.

 On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles <
 k...@google.com> wrote:

> The fact that its usage has grown probably indicates that we
> have a large number of transforms that can easily cause data loss 
> /
> duplication.
>

 Is this specific to Reshuffle or it is true for any GroupByKey?
 I see Reshuffle as just a wrapper around GBK.

>>> The issue is when it's used in such a way that data corruption
>>> can occur when the underlying GBK output is not stable.
>>>
>>
>> Could you describe this breakage bit more in detail or give a
>> example? Apologies in advance, I know this came up in multiple 
>> contexts in
>> the past, but I haven't grokked the issue well. It is the window 
>> 

Re: What is the future of Reshuffle?

2018-05-21 Thread Robert Bradshaw
We should probably keep the warning and all the caveats until we introduce
the alternative (and migrate to it for the non-parallelism uses of
reshuffle). I was just proposing we do this via a separate transform that
just calls Reshuffle until we have the new story fully fleshed out (I don't
know if any runner supports RequresStableInput, and it isn't translated in
the Fn API) to avoid being in this intermediate state for yet another year.

On Sun, May 20, 2018 at 6:38 PM Raghu Angadi  wrote:

>
>
> On Sat, May 19, 2018 at 10:55 PM Robert Bradshaw 
> wrote:
>
>> On Sat, May 19, 2018 at 6:27 PM Raghu Angadi  wrote:
>>
>>> [...]
>>>
>> I think it would be much more user friendly to un-deprecate it to add a
>>> warning for advanced users about non-portability of durability/replay
>>> guarantees/stable input assumptions.
>>>

>> Yes, I think everyone in this thread is in agreement here. We should
>> provide a *different* transform that provides the durability guarantees
>> (with caveats). In the meantime, this delegating to a reshuffle would be
>> better than using a reshuffle directly.
>>
>
> Great. Sent a PR to undeprecate Reshuffle :
> https://github.com/apache/beam/pull/5432
> The wording there for JavaDoc just a proposal...
>
> Raghu.
>
>
>> We tend to put in reshuffles in order to "commit" these random values and
 make them stable for the next stage, to be used to provide the needed
 idempotency for sinks.

>>>
>>> In such cases, I think the author should error out on the runner
>>> that don't provide that guarantee. That is what ExactlyOnceSink in 
>>> KafkaIO
>>> does [1].
>>>
>>> [1]
>>> https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L1049
>>>
>>
>> We're moving to a world where the runner may not be known at pipeline
>> construction time. However, explicitly using a (distinct) 
>> make-input-stable
>> transform when that's the intent (which could be a primitive that runners
>> should implement, possibly by swapping in Reshuffle, or reject) would 
>> allow
>> for this. That being said, the exact semantics of this transform is a bit
>> of a rabbit hole which is why we never finished the job of deprecating
>> Reshuffle. This is a case where doing something is better than doing
>> nothing, and our use of URNs for this kind of thing is flexible enough 
>> that
>> we can deprecate old ones if/when we have time to pound out the right
>> solution.
>>
>>
>>>
 Kenn

 On Fri, May 18, 2018 at 4:05 PM Raghu Angadi 
 wrote:

>
> On Fri, May 18, 2018 at 12:21 PM Robert Bradshaw <
> rober...@google.com> wrote:
>
>> On Fri, May 18, 2018 at 11:46 AM Raghu Angadi 
>> wrote:
>>
>>> Thanks Kenn.
>>>
>>> On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles 
>>> wrote:
>>>
 The fact that its usage has grown probably indicates that we
 have a large number of transforms that can easily cause data loss /
 duplication.

>>>
>>> Is this specific to Reshuffle or it is true for any GroupByKey?
>>> I see Reshuffle as just a wrapper around GBK.
>>>
>> The issue is when it's used in such a way that data corruption
>> can occur when the underlying GBK output is not stable.
>>
>
> Could you describe this breakage bit more in detail or give a
> example? Apologies in advance, I know this came up in multiple 
> contexts in
> the past, but I haven't grokked the issue well. It is the window 
> rewrite
> that Reshuffle does that causes misuse of GBK?
>
> Thanks.
>



Re: The full list of proposals / prototype documents

2018-05-21 Thread Joseph PENG
Alexey,

I do not know where you can find all design docs, but I know a blog that
has collected some of the major design docs. Hope it helps.

https://wtanaka.com/beam/design-doc

https://drive.google.com/drive/folders/0B-IhJZh9Ab52OFBVZHpsNjc4eXc

On Mon, May 21, 2018 at 9:28 AM Alexey Romanenko 
wrote:

> Hi all,
>
> Is it possible to obtain somewhere a list of all proposals / prototype
> documents that have been published as a technical / design documents for
> new features? I have links to only some of them (found in mail list
> discussions by chance) but I’m not aware of others.
>
> If yes, could someone share it or point me out where it is located in case
> if I missed this?
>
> If not, don’t you think it would make sense to have such index of these
> documents? I believe it can be useful for Beam contributors since these
> proposals contain information which is absent or not so detailed on Beam
> web site documentation.
>
> WBR,
> Alexey


The full list of proposals / prototype documents

2018-05-21 Thread Alexey Romanenko
Hi all,

Is it possible to obtain somewhere a list of all proposals / prototype 
documents that have been published as a technical / design documents for new 
features? I have links to only some of them (found in mail list discussions by 
chance) but I’m not aware of others.

If yes, could someone share it or point me out where it is located in case if I 
missed this?

If not, don’t you think it would make sense to have such index of these 
documents? I believe it can be useful for Beam contributors since these 
proposals contain information which is absent or not so detailed on Beam web 
site documentation.

WBR,
Alexey

Re: Phrase triggering - is there a better way?

2018-05-21 Thread Kenneth Knowles
I don't recall any discussions about this issue - it has always been kind
of messy, like you say. I have definitely broken jobs, broken the seed job,
etc :-). That is why I added the backup seed job that does not refer to
common_job_properties, so that editing that file can never break the whole
system. It is a real hack.

The best approach IMO is to have the job definitions checked in to source
control in a way that what runs on a PR is just what is in that PR. It is
the model Travis uses and Jenkins supports it via this:
https://jenkins.io/doc/book/pipeline/jenkinsfile/. We could also support it
ourselves by making every job a one-liner and invoking a Gradle tasks or
shell script. It loses some Jenkins integrations, but now that we have the
Build Scan UI there's not that much we need from the Jenkins UI except job
status and timing.

But it sounds like you are talking about a shorter term improvement. I
think a seed job that only adds jobs and does not overwrite them will be
quite limited, even as you repeatedly work on a job in a PR.

Kenn

On Mon, May 21, 2018 at 4:56 AM Łukasz Gajowy 
wrote:

> Hello!
>
> Beam has a possibility to "Phrase trigger" Jenkins' jobs from Github. If a
> job has a special phrase defined in it, a contributor can type the phrase
> in a PR comment and the Job will run on Jenkins. A very popular usage is to
> run PreCommits this way (eg. by typing "Run JavaPreCommit" in a PR
> comment), but almost all jobs seem to have some phrase defined.
>
> For jobs that exist on the master branch (in .test_infra/jenkins), it is
> enough to just type the phrase. But if we want to run a job that is freshly
> defined in our pull request, we need to run the Seed job first to overwrite
> existing Jenkins configuration (by typing "Run seed job" in the comment).
>
> This feature, however super useful and easy to use, to me has some
> disadvantages:
>
>- we overwrite the whole Jenkins' configuration. This means that when
>two developers run seed job at the same time they interrupt each other's
>work without even noticing,
>- there is no way to "clean up" after such phrase triggering. Job
>definitions from the PR will remain on Jenkins until Seed job is run again
>from master (this happens every 6 hours at least). This, in turn, means
>that if we define a job that is also time-triggered, it will start again
>but using the source from master branch. It will fail if the code invoked
>by the job is not on master yet. This is, in fact, a very common scenario
>when adding new Performance tests.
>
> Do you think we can improve this a little bit? I realize that it can be
> hard not to overwrite all Jenkins' definitions (or maybe I'm wrong?), but
> maybe we can add some little improvements, like:
>
>- define a special seed(ish) job that only adds new jobs and does not
>overwrite the others. It would reduce the risk of interrupting each other's
>work,
>- define special phrase trigger which will run the seed job on master
>(not on the PR branch) on demand. It would allow doing the "clean up".
>
>
> WDYT? Do you have other ideas/suggestions? Did you face the
> above-mentioned issues before?
>
> Best regards,
> Łukasz
>


Phrase triggering - is there a better way?

2018-05-21 Thread Łukasz Gajowy
Hello!

Beam has a possibility to "Phrase trigger" Jenkins' jobs from Github. If a
job has a special phrase defined in it, a contributor can type the phrase
in a PR comment and the Job will run on Jenkins. A very popular usage is to
run PreCommits this way (eg. by typing "Run JavaPreCommit" in a PR
comment), but almost all jobs seem to have some phrase defined.

For jobs that exist on the master branch (in .test_infra/jenkins), it is
enough to just type the phrase. But if we want to run a job that is freshly
defined in our pull request, we need to run the Seed job first to overwrite
existing Jenkins configuration (by typing "Run seed job" in the comment).

This feature, however super useful and easy to use, to me has some
disadvantages:

   - we overwrite the whole Jenkins' configuration. This means that when
   two developers run seed job at the same time they interrupt each other's
   work without even noticing,
   - there is no way to "clean up" after such phrase triggering. Job
   definitions from the PR will remain on Jenkins until Seed job is run again
   from master (this happens every 6 hours at least). This, in turn, means
   that if we define a job that is also time-triggered, it will start again
   but using the source from master branch. It will fail if the code invoked
   by the job is not on master yet. This is, in fact, a very common scenario
   when adding new Performance tests.

Do you think we can improve this a little bit? I realize that it can be
hard not to overwrite all Jenkins' definitions (or maybe I'm wrong?), but
maybe we can add some little improvements, like:

   - define a special seed(ish) job that only adds new jobs and does not
   overwrite the others. It would reduce the risk of interrupting each other's
   work,
   - define special phrase trigger which will run the seed job on master
   (not on the PR branch) on demand. It would allow doing the "clean up".


WDYT? Do you have other ideas/suggestions? Did you face the above-mentioned
issues before?

Best regards,
Łukasz