Re: What is the future of Reshuffle?

2018-05-18 Thread Raghu Angadi
On Fri, May 18, 2018 at 5:34 PM Robert Bradshaw wrote: > Ah, thanks, that makes sense. That implies to me Reshuffle is no more >> broken than GBK itself. May be Reshuffle.viaRandomKey() could have a clear >> caveat. Reshuffle's JavaDoc could add a caveat too about

Re: Current progress on Portable runners

2018-05-18 Thread Thomas Weise
- Flink JobService: in review That's TODO (above PR was merged, but it doesn't contain the Flink job service). Discussion about it is here: https://docs.google.com/document/d/1xOaEEJrMmiSHprd-WiYABegfT129qqF-idUBINjxz8s/edit?ts=5afa1238 Thanks, Thomas

Re: What is the future of Reshuffle?

2018-05-18 Thread Raghu Angadi
On Fri, May 18, 2018 at 4:07 PM Kenneth Knowles wrote: > It isn't any particular logic in Reshuffle - it is, semantically, an > identity transform. It is the fact that other runners are perfectly able to > re-run transform prior to a GBK. So, for example, randomly generated IDs

Re: What is the future of Reshuffle?

2018-05-18 Thread Kenneth Knowles
It isn't any particular logic in Reshuffle - it is, semantically, an identity transform. It is the fact that other runners are perfectly able to re-run transform prior to a GBK. So, for example, randomly generated IDs will be re-generated. We tend to put in reshuffles in order to "commit" these

Re: What is the future of Reshuffle?

2018-05-18 Thread Raghu Angadi
On Fri, May 18, 2018 at 12:22 PM Robert Bradshaw wrote: > [resending] > Agreed that keeping this deprecated without a clear replacement for so long > is not ideal. > > I would at least break this into two separate transforms, the > parallelism-breaking one (which seems OK)

Re: Proposal: keeping precommit times fast

2018-05-18 Thread Henning Rohde
Good proposal. I think it should be considered in tandem with the "No commit on red post-commit" proposal and could be far more ambitious than 2 hours. For example, something in the <15-20 mins range, say, would be much less of an inconvenience to the development effort. Go takes ~3 mins, which

Re: Launching a Portable Pipeline

2018-05-18 Thread Ankur Goenka
Thanks for all the input. I have summarized the discussions at the bottom of the document ( here ). Please feel free to provide comments. Once we agree, I will publish the conclusion on

Re: Proposal: keeping precommit times fast

2018-05-18 Thread Scott Wegner
re: intelligently skipping tests for code that doesn't change (i.e. Java tests on Python PR): this should be possible. We already have build-caching enabled in Gradle, but I believe it is local to the git workspace and doesn't persist between Jenkins runs. With a quick search, I see there is a

Re: What is the future of Reshuffle?

2018-05-18 Thread Robert Bradshaw
[resending] Agreed that keeping this deprecated without a clear replacement for so long is not ideal. I would at least break this into two separate transforms, the parallelism-breaking one (which seems OK) and the stable input one (which may just call the parallelism-breaking one, but should be

Re: What is the future of Reshuffle?

2018-05-18 Thread Robert Bradshaw
On Fri, May 18, 2018 at 11:46 AM Raghu Angadi wrote: > Thanks Kenn. > > On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles wrote: > >> The fact that its usage has grown probably indicates that we have a large >> number of transforms that can easily cause data

Re: Java PreCommit seems broken

2018-05-18 Thread Scott Wegner
+1 to the Lukasz's proposed solution. Depending on artifacts published from a previous build it's fragile and will add flakiness to our test runs. We should make pre-commits as hermetic as possible. Depending on the transitive set of publishToMavenLocal tasks seems cumbersome, but also necessary.

Re: Proposal: keeping precommit times fast

2018-05-18 Thread Robert Bradshaw
Now that were using gradle, perhaps we could be more intelligent about only running the affected tests? E.g. when you touch Python (or Go) you shouldnt need to run the Java precommit at all, which would reduce the latency for those PRs and also the time spent in queue. Presumably this could even

Re: What is the future of Reshuffle?

2018-05-18 Thread Robert Bradshaw
On Fri, May 18, 2018 at 11:46 AM Raghu Angadi rang...@google.com wrote: Thanks Kenn. On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles k...@google.com wrote: The fact that its usage has grown probably indicates that we have a large number of transforms that can easily cause data loss / duplication.

Re: What is the future of Reshuffle?

2018-05-18 Thread Raghu Angadi
Thanks Kenn. On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles wrote: > The fact that its usage has grown probably indicates that we have a large > number of transforms that can easily cause data loss / duplication. > Is this specific to Reshuffle or it is true for any

Re: What is the future of Reshuffle?

2018-05-18 Thread Kenneth Knowles
The fact that its usage has grown probably indicates that we have a large number of transforms that can easily cause data loss / duplication. Yes, it is deprecated because it is primarily used as a Dataflow-specific way to ensure stable input. My understanding is that the SparkRunner also

Re: Java code under main depends on junit?

2018-05-18 Thread Lukasz Cwik
I agree with separating it out as a separate sub-project for the same reason as you specify, just wanted to point out that it was just less bad with Gradle for internal use as we are doing it right now. On Fri, May 18, 2018 at 10:35 AM Kenneth Knowles wrote: > Ah, nice. That

Re: Proposal: keeping post-commit tests green

2018-05-18 Thread Andrew Pilloud
Blocking commits to master on test flaps seems critical here. The test flaps won't get the attention they deserve as long as people are just spamming their PRs with 'Run Java Precommit' until they turn green. I'm guilty of this behavior and I know it masks new flaky tests. I added a comment to

Re: What is the Impulse and why do we need it?

2018-05-18 Thread Kenneth Knowles
I think it makes a lot of sense to move it to the Beam web site. There's already a good landing point: https://beam.apache.org/contribute/runner-guide/ That page is a collection of advice for legacy-style runners on how to use runners-core, etc, and just general stuff about how to write one,

Re: Java PreCommit seems broken

2018-05-18 Thread Lukasz Cwik
We would need the archetype task to depend on all the dependencies publishToMavenLocal tasks transitively and then be configured to use whatever that maven local is on Jenkins / dev machine. It would be best if it was an ephemeral folder because it would be annoying to have stuff installed

Re: Java PreCommit seems broken

2018-05-18 Thread Kenneth Knowles
Is this just a build tweak, or are there costly steps that we'd have to add that would slow down presubmit? (with mvn I know that `test` and `install` did very different amounts of work - because mvn test didn't test the right artifacts, but maybe with Gradle not so much?) On Fri, May 18, 2018 at

Re: Proposal: keeping post-commit tests green

2018-05-18 Thread Kenneth Knowles
Love it. I would pull out from the doc also the key point: make the postcommit status constantly visible to everyone. Kenn On Fri, May 18, 2018 at 10:17 AM Mikhail Gryzykhin wrote: > Hi everyone, > > I'm Mikhail and started working on Google Dataflow several months ago. I'm

Re: Java code under main depends on junit?

2018-05-18 Thread Kenneth Knowles
Ah, nice. That means you can actually declare a dependency on test suites and get their dependencies in order to run them successfully. It does mean I can't just argue "it doesn't work" but have to go back to arguing that it is a design problem :-) A "test jar" is a jar containing a bunch of

Re: [PROPOSAL] Preparing 2.5.0 release next week

2018-05-18 Thread Lukasz Cwik
I believe JB is referring to https://issues.apache.org/jira/browse/BEAM-4060 On Fri, May 18, 2018 at 10:16 AM Scott Wegner wrote: > J.B., can you give any context on what metadata is missing? Is there a > JIRA? > > On Thu, May 17, 2018 at 9:30 PM Jean-Baptiste Onofré

Proposal: keeping post-commit tests green

2018-05-18 Thread Mikhail Gryzykhin
Hi everyone, I'm Mikhail and started working on Google Dataflow several months ago. I'm really excited to work with Beam opensource community. I have a proposal to improve contributor experience by keeping post-commit tests green. I'm looking to get community consensus and approval about the

Re: [PROPOSAL] Preparing 2.5.0 release next week

2018-05-18 Thread Scott Wegner
J.B., can you give any context on what metadata is missing? Is there a JIRA? On Thu, May 17, 2018 at 9:30 PM Jean-Baptiste Onofré wrote: > Hi, > > The build was OK yesterday but the maven-metadata is still missing. > > That's the point to fix before being able to move

Re: Java PreCommit seems broken

2018-05-18 Thread Lukasz Cwik
The problem with the way that the archetypes tests are run (now with Gradle and in the past with Maven) is that they run against the nightly snapshot and not against artifacts from the current build. To get them to work, we would need to publish the dependent Maven modules to a temporary repo and

Re: What is the Impulse and why do we need it?

2018-05-18 Thread Lukasz Cwik
The Beam Runner API doc needs a lot of updating to discuss impulse and SDF, (and deprecate / remove Read): https://s.apache.org/beam-runner-api It could also use examples from Go/Python code base. Alternatively we could start to codify this information on the Apache Beam website as the

Re: What is the future of Reshuffle?

2018-05-18 Thread Eugene Kirpichov
Agreed that it should be undeprecated, many users are getting confused by this. I know that some people are working on a replacement for at least one of its use cases (RequiresStableInput), but the use case of breaking fusion is, as of yet, unaddressed, and there's not much to be gained by keeping

Re: What is the Impulse and why do we need it?

2018-05-18 Thread Eugene Kirpichov
Hi Ismael, Impulse is a primitive necessary for the Portability world, where sources do not exist. Impulse is the only possible root of the pipeline, it emits a single empty byte array, and it's all DoFn's and SDF's from there. E.g. when using Fn API, Read.from(BoundedSource) is translated into:

Re: What is the future of Reshuffle?

2018-05-18 Thread Raghu Angadi
I am interested in more clarity on this as well. It has been deprecated for a long time without a replacement, and its usage has only grown, both within Beam code base as well as in user applications. If we are certain that it will not be removed before there is a good replacement for it, can we

Re: What is the Impulse and why do we need it?

2018-05-18 Thread Jean-Baptiste Onofré
Fully agree. I already started to take a look. Regards JB On 18/05/2018 16:12, Ismaël Mejía wrote: I have seen multiple mentions of 'Impulse' in JIRAs and some on other discussions, but have not seen any document or concrete explanation on what's Impulse and why we need it. This seems like an

What is the Impulse and why do we need it?

2018-05-18 Thread Ismaël Mejía
I have seen multiple mentions of 'Impulse' in JIRAs and some on other discussions, but have not seen any document or concrete explanation on what's Impulse and why we need it. This seems like an internal implementation detail but it is probably a good idea to explain it somewhere (my excuses if

What is the future of Reshuffle?

2018-05-18 Thread Ismaël Mejía
I saw in a recent thread that the use of the Reshuffle transform was recommended to solve an user issue: https://lists.apache.org/thread.html/87ef575ac67948868648e0a8110be242f811bfff8fdaa7f9b758b933@%3Cdev.beam.apache.org%3E I can see why it may fix the reported issue. I am just curious about

Re: [DISCUSS] Remove findbugs from sdks/java

2018-05-18 Thread Ismaël Mejía
As part of the error-prone effort Tim has been also cleaning other static analysis warnings as reported by IntelliJ's Inspect -> Analyze code. I think this is a good moment to grok some of those too e.g. scoping, unused variables, redundancies, etc. So I hope the others taking part this work try

Re: Current progress on Portable runners

2018-05-18 Thread Thomas Weise
Most of it should probably go to https://beam.apache.org/con tribute/portability/ Also for reference, here is the prototype doc: https://s.apache.org/beam- portability-team-doc Thomas On Fri, May 18, 2018 at 5:35 AM, Kenneth Knowles wrote: > This is awesome. Would you be up

Re: Proposal: keeping precommit times fast

2018-05-18 Thread Kenneth Knowles
I like the idea. I think it is a good time for the project to start tracking this and keeping it usable. Certainly 2 hours is more than enough, is that not so? The Java precommit seems to take <=40 minutes while Python takes ~20 and Go is so fast it doesn't matter. Do we have enough stragglers

Re: Java PreCommit seems broken

2018-05-18 Thread Kenneth Knowles
Maybe something has changed, but the snapshots used to pull from the public snapshot repo. We got failures for a while every time we cut a release branch, but once there was a nightly snapshot they cleared up. Kenn On Thu, May 17, 2018 at 9:50 PM Scott Wegner wrote: > I

Re: Current progress on Portable runners

2018-05-18 Thread Kenneth Knowles
This is awesome. Would you be up for adding a brief description at https://beam.apache.org/contribute/#works-in-progress and maybe a pointer to a gdoc with something like the contents of this email? (my reasoning is (a) keep the contribution guide concise but (b) all this detail is helpful yet (c)

Re: Current progress on Portable runners

2018-05-18 Thread Robert Bradshaw
On Thu, May 17, 2018 at 10:25 PM Thomas Weise wrote: > Hi Eugene, > Thanks for putting this together, this is a very nice update and brings > much needed visibility to those hoping to make use of the portability > features or contribute to them. +1, this is a great summary. >