Re: What is the future of Reshuffle?

2018-05-18 Thread Raghu Angadi
On Fri, May 18, 2018 at 5:34 PM Robert Bradshaw  wrote:

> Ah, thanks, that makes sense. That implies to me Reshuffle is no more
>> broken than GBK itself. May be Reshuffle.viaRandomKey() could have a clear
>> caveat. Reshuffle's JavaDoc could add a caveat too about non-deterministic
>> keys and retries (though it applies to GroupByKey in general).
>>
>
> The "randomness" of Reshuffle.viaRandomKey() is fine, as the randomly
> generated key is never exposed to the users (so it doesn't matter if it
> changes).
>

Agreed.


> Reshuffle is broken if you are using it to get stable input on a runner
> that doesn't always have stable input as an implementation detail of GBKs.
>

True. I am still failing to see what is broken about Reshuffle that is also
not broken with GroupByKey transform. If someone depends on GroupByKey to
get stable input, isn't that equally incorrect/unportable?

Raghu.

>
> We tend to put in reshuffles in order to "commit" these random values and
>>> make them stable for the next stage, to be used to provide the needed
>>> idempotency for sinks.
>>>
>>
>> In such cases, I think the author should error out on the runner that
>> don't provide that guarantee. That is what ExactlyOnceSink in KafkaIO does
>> [1].
>>
>> [1]
>> https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L1049
>>
>
> We're moving to a world where the runner may not be known at pipeline
> construction time. However, explicitly using a (distinct) make-input-stable
> transform when that's the intent (which could be a primitive that runners
> should implement, possibly by swapping in Reshuffle, or reject) would allow
> for this. That being said, the exact semantics of this transform is a bit
> of a rabbit hole which is why we never finished the job of deprecating
> Reshuffle. This is a case where doing something is better than doing
> nothing, and our use of URNs for this kind of thing is flexible enough that
> we can deprecate old ones if/when we have time to pound out the right
> solution.
>
>
>>
>>> Kenn
>>>
>>> On Fri, May 18, 2018 at 4:05 PM Raghu Angadi  wrote:
>>>

 On Fri, May 18, 2018 at 12:21 PM Robert Bradshaw 
 wrote:

> On Fri, May 18, 2018 at 11:46 AM Raghu Angadi 
> wrote:
>
>> Thanks Kenn.
>>
>> On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles 
>> wrote:
>>
>>> The fact that its usage has grown probably indicates that we have a
>>> large number of transforms that can easily cause data loss / 
>>> duplication.
>>>
>>
>> Is this specific to Reshuffle or it is true for any GroupByKey? I see
>> Reshuffle as just a wrapper around GBK.
>>
> The issue is when it's used in such a way that data corruption can
> occur when the underlying GBK output is not stable.
>

 Could you describe this breakage bit more in detail or give a example?
 Apologies in advance, I know this came up in multiple contexts in the past,
 but I haven't grokked the issue well. It is the window rewrite that
 Reshuffle does that causes misuse of GBK?

 Thanks.

>>>


Re: Current progress on Portable runners

2018-05-18 Thread Thomas Weise
- Flink JobService: in review 

That's TODO (above PR was merged, but it doesn't contain the Flink job
service).

Discussion about it is here:
https://docs.google.com/document/d/1xOaEEJrMmiSHprd-WiYABegfT129qqF-idUBINjxz8s/edit?ts=5afa1238

Thanks,
Thomas



On Fri, May 18, 2018 at 7:01 AM, Thomas Weise  wrote:

> Most of it should probably go to https://beam.apache.org/con
> tribute/portability/
>
> Also for reference, here is the prototype doc: https://s.apache.org/beam-
> portability-team-doc
>
> Thomas
>
> On Fri, May 18, 2018 at 5:35 AM, Kenneth Knowles  wrote:
>
>> This is awesome. Would you be up for adding a brief description at
>> https://beam.apache.org/contribute/#works-in-progress and maybe a
>> pointer to a gdoc with something like the contents of this email? (my
>> reasoning is (a) keep the contribution guide concise but (b) all this
>> detail is helpful yet (c) the detail may be ever-changing so making a
>> separate web page is not the best format)
>>
>> Kenn
>>
>> On Thu, May 17, 2018 at 3:13 PM Eugene Kirpichov 
>> wrote:
>>
>>> Hi all,
>>>
>>> A little over a month ago, a large group of Beam community members has
>>> been working a prototype of a portable Flink runner - that is, a runner
>>> that can execute Beam pipelines on Flink via the Portability API
>>> . The prototype was developed in
>>> a separate branch
>>>  and was
>>> successfully demonstrated at Flink Forward, where it ran Python and Go
>>> pipelines in a limited setting.
>>>
>>> Since then, a smaller group of people (Ankur Goenka, Axel Magnuson, Ben
>>> Sidhom and myself) have been working on productionizing the prototype to
>>> address its limitations and do things "the right way", preparing to reuse
>>> this work for developing other portable runners (e.g. Spark). This involves
>>> a surprising amount of work, since many important design and implementation
>>> concerns could be ignored for the purposes of a prototype. I wanted to give
>>> an update on where we stand now.
>>>
>>> Our immediate milestone in sight is *Run Java and Python batch
>>> WordCount examples against a distributed remote Flink cluster*. That
>>> involves a few moving parts, roughly in order of appearance:
>>>
>>> *Job submission:*
>>> - The SDK is configured to use a "portable runner", whose responsibility
>>> is to run the pipeline against a given JobService endpoint.
>>> - The portable runner converts the pipeline to a portable Pipeline proto
>>> - The runner finds out which artifacts it needs to stage, and staging
>>> them against an ArtifactStagingService
>>> - A Flink-specific JobService receives the Pipeline proto, performs some
>>> optimizations (e.g. fusion) and translates it to Flink datasets and
>>> functions
>>>
>>> *Job execution:*
>>> - A Flink function executes a fused chain of Beam transforms (an
>>> "executable stage") by converting the input and the stage to bundles and
>>> executing them against an SDK harness
>>> - The function starts the proper SDK harness, auxiliary services (e.g.
>>> artifact retrieval, side input handling) and wires them together
>>> - The function feeds the data to the harness and receives data back.
>>>
>>> *And here is our status of implementation for these parts:* basically,
>>> almost everything is either done or in review.
>>>
>>> *Job submission:*
>>> - General-purpose portable runner in the Python SDK: done
>>> ; Java SDK: also done
>>> 
>>> - Artifact staging from the Python SDK: in review (PR
>>> , PR
>>> ); in java, it's done also
>>> - Flink JobService: in review 
>>> - Translation from a Pipeline proto to Flink datasets and functions:
>>> done 
>>> - ArtifactStagingService implementation that stages artifacts to a
>>> location on a distributed filesystem: in development (design is clear)
>>>
>>> *Job execution:*
>>> - Flink function for executing via an SDK harness: done
>>> 
>>> - APIs for managing lifecycle of an SDK harness: done
>>> 
>>> - Specific implementation of those APIs using Docker: part done
>>> , part in review
>>> 
>>> - ArtifactRetrievalService that retrieves artifacts from the location
>>> where ArtifactStagingService staged them: in development.
>>>
>>> We expect that the in-review parts will be done, and the in-development
>>> parts be developed, in the next 2-3 weeks. We will, of course, update the
>>> community when this important milestone is reached.
>>>
>>> *After that, the 

Re: What is the future of Reshuffle?

2018-05-18 Thread Raghu Angadi
On Fri, May 18, 2018 at 4:07 PM Kenneth Knowles  wrote:

> It isn't any particular logic in Reshuffle - it is, semantically, an
> identity transform. It is the fact that other runners are perfectly able to
> re-run transform prior to a GBK. So, for example, randomly generated IDs
> will be re-generated.
>

Ah, thanks, that makes sense. That implies to me Reshuffle is no more
broken than GBK itself. May be Reshuffle.viaRandomKey() could have a clear
caveat. Reshuffle's JavaDoc could add a caveat too about non-deterministic
keys and retries (though it applies to GroupByKey in general).

We tend to put in reshuffles in order to "commit" these random values and
> make them stable for the next stage, to be used to provide the needed
> idempotency for sinks.
>

In such cases, I think the author should error out on the runner that don't
provide that guarantee. That is what ExactlyOnceSink in KafkaIO does [1].

[1]
https://github.com/apache/beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L1049


> Kenn
>
> On Fri, May 18, 2018 at 4:05 PM Raghu Angadi  wrote:
>
>>
>> On Fri, May 18, 2018 at 12:21 PM Robert Bradshaw 
>> wrote:
>>
>>> On Fri, May 18, 2018 at 11:46 AM Raghu Angadi 
>>> wrote:
>>>
 Thanks Kenn.

 On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles 
 wrote:

> The fact that its usage has grown probably indicates that we have a
> large number of transforms that can easily cause data loss / duplication.
>

 Is this specific to Reshuffle or it is true for any GroupByKey? I see
 Reshuffle as just a wrapper around GBK.

>>> The issue is when it's used in such a way that data corruption can occur
>>> when the underlying GBK output is not stable.
>>>
>>
>> Could you describe this breakage bit more in detail or give a example?
>> Apologies in advance, I know this came up in multiple contexts in the past,
>> but I haven't grokked the issue well. It is the window rewrite that
>> Reshuffle does that causes misuse of GBK?
>>
>> Thanks.
>>
>


Re: What is the future of Reshuffle?

2018-05-18 Thread Kenneth Knowles
It isn't any particular logic in Reshuffle - it is, semantically, an
identity transform. It is the fact that other runners are perfectly able to
re-run transform prior to a GBK. So, for example, randomly generated IDs
will be re-generated. We tend to put in reshuffles in order to "commit"
these random values and make them stable for the next stage, to be used to
provide the needed idempotency for sinks.

Kenn

On Fri, May 18, 2018 at 4:05 PM Raghu Angadi  wrote:

>
> On Fri, May 18, 2018 at 12:21 PM Robert Bradshaw 
> wrote:
>
>> On Fri, May 18, 2018 at 11:46 AM Raghu Angadi  wrote:
>>
>>> Thanks Kenn.
>>>
>>> On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles  wrote:
>>>
 The fact that its usage has grown probably indicates that we have a
 large number of transforms that can easily cause data loss / duplication.

>>>
>>> Is this specific to Reshuffle or it is true for any GroupByKey? I see
>>> Reshuffle as just a wrapper around GBK.
>>>
>> The issue is when it's used in such a way that data corruption can occur
>> when the underlying GBK output is not stable.
>>
>
> Could you describe this breakage bit more in detail or give a example?
> Apologies in advance, I know this came up in multiple contexts in the past,
> but I haven't grokked the issue well. It is the window rewrite that
> Reshuffle does that causes misuse of GBK?
>
> Thanks.
>


Re: What is the future of Reshuffle?

2018-05-18 Thread Raghu Angadi
On Fri, May 18, 2018 at 12:22 PM Robert Bradshaw 
wrote:

> [resending]
>
Agreed that keeping this deprecated without a clear replacement for so long
> is not ideal.
>
> I would at least break this into two separate transforms, the
> parallelism-breaking one (which seems OK) and the stable input one (which
> may just call the parallelism-breaking one, but should be decorated with
> lots of caveats and maybe even still have the deprecated annotation).
>

+1. Parallelism-breaking one is the most relevant to many users. Would love
to see that part deprecated, ideally keeping the name Reshuffle.

Raghu.


>
>
> On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles  wrote:
>
>> The fact that its usage has grown probably indicates that we have a large
>> number of transforms that can easily cause data loss / duplication.
>>
>> Yes, it is deprecated because it is primarily used as a Dataflow-specific
>> way to ensure stable input. My understanding is that the SparkRunner also
>> materializes at every GBK so it works there too (is this still the case?).
>> It doesn't work at all for other runners AFAIK. So it is @Deprecated not
>> because there is a replacement, but because it is kind of dangerous to use. 
>> Beam
>> could just say "GBK must ensure stable output" and "a composite containing
>> a GBK has to ensure stable output even if replaced" and that would solve
>> the issue, but I think it would make Beam on Flink impossibly slow - I
>> could be wrong about that. Generally stable input is tied to durability
>> model which is a key design point for engines.
>>
>> True that it isn't the only use, and I know you have been trying to nail
>> down what the uses actually are. Ben wrote up various uses in a portable
>> manner at https://beam.apache.org/documentation/execution-model.
>>
>>  - Coupled failure is the use where Reshuffle is to provide stable input
>>  - Breaking dependent parallelism is more portable - but since it is the
>> identity transform a runner may just elide it; it is a hint, basically, and
>> that seem OK (but can we do it more directly?)
>>
>> What I don't want is to build something where the implementation details
>> are the spec, and not fundamental, which is sort of where Reshuffle lies.
>> This thread highlights that this is a pretty urgent problem with our SDKs
>> and runners that it would be very helpful to work on.
>>
>> Kenn
>>
>>
>>
>> On Fri, May 18, 2018 at 7:50 AM Eugene Kirpichov 
>> wrote:
>>
>>> Agreed that it should be undeprecated, many users are getting confused
>>> by this.
>>> I know that some people are working on a replacement for at least one of
>>> its use cases (RequiresStableInput), but the use case of breaking fusion
>>> is, as of yet, unaddressed, and there's not much to be gained by keeping it
>>> deprecated.
>>>
>>> On Fri, May 18, 2018 at 7:45 AM Raghu Angadi  wrote:
>>>
 I am interested in more clarity on this as well. It has been deprecated
 for a long time without a replacement, and its usage has only grown, both
 within Beam code base as well as in user applications.

 If we are certain that it will not be removed before there is a good
 replacement for it, can we undeprecate it until there are proper plans for
 replacement?

 On Fri, May 18, 2018 at 7:12 AM Ismaël Mejía  wrote:

> I saw in a recent thread that the use of the Reshuffle transform was
> recommended to solve an user issue:
>
>
> https://lists.apache.org/thread.html/87ef575ac67948868648e0a8110be242f811bfff8fdaa7f9b758b933@%3Cdev.beam.apache.org%3E
>
> I can see why it may fix the reported issue. I am just curious about
> the fact that the Reshuffle transform is marked as both @Internal and
> @Deprecated in Beam's SDK.
>
> Do we have some alternative? So far the class documentation does not
> recommend any replacement.
>



Re: Proposal: keeping precommit times fast

2018-05-18 Thread Henning Rohde
Good proposal. I think it should be considered in tandem with the "No
commit on red post-commit" proposal and could be far more ambitious than 2
hours. For example, something in the <15-20 mins range, say, would be much
less of an inconvenience to the development effort. Go takes ~3 mins, which
means that it is practical to wait until a PR is green before asking anyone
to look at it. If I need to wait for a Java or Python pre-commit, I task
switch and come back later. If the post-commits are enforced to be green,
we could possibly gain a much more productive flow at the cost of the
occasional post-commit break, compared to now. Maybe IOs can be less
extensively tested pre-commit, for example, or only if actually changed?

I also like Robert's suggestion of spitting up pre-commits into something
more fine-grained to get a clear partial signal quicker. If we have an
adequate number of Jenkins slots, it might also speed things up overall.

Thanks,
 Henning

On Fri, May 18, 2018 at 12:30 PM Scott Wegner  wrote:

> re: intelligently skipping tests for code that doesn't change (i.e. Java
> tests on Python PR): this should be possible. We already have build-caching
> enabled in Gradle, but I believe it is local to the git workspace and
> doesn't persist between Jenkins runs.
>
> With a quick search, I see there is a Jenkins Build Cacher Plugin [1] that
> hooks into Gradle build cache and does exactly what we need. Does anybody
> know whether we could get this enabled on our Jenkins?
>
> [1] https://wiki.jenkins.io/display/JENKINS/Job+Cacher+Plugin
>
> On Fri, May 18, 2018 at 12:08 PM Robert Bradshaw 
> wrote:
>
>> [somehow  my email got garbled...]
>>
>> Now that we're using gradle, perhaps we could be more intelligent about
>> only running the affected tests? E.g. when you touch Python (or Go) you
>> shouldn't need to run the Java precommit at all, which would reduce the
>> latency for those PRs and also the time spent in queue. Presumably this
>> could even be applied per-module for the Java tests. (Maybe a large, shared
>> build cache could help here as well...)
>>
>> I also wouldn't be opposed to a quicker immediate signal, plus more
>> extensive tests before actually merging. It's also nice to not have to wait
>> an hour to see that you have a lint error; quick stuff like that could be
>> signaled quickly before a contributor looses context.
>>
>> - Robert
>>
>>
>>
>> On Fri, May 18, 2018 at 5:55 AM Kenneth Knowles  wrote:
>>
>>> I like the idea. I think it is a good time for the project to start
>>> tracking this and keeping it usable.
>>>
>>> Certainly 2 hours is more than enough, is that not so? The Java
>>> precommit seems to take <=40 minutes while Python takes ~20 and Go is so
>>> fast it doesn't matter. Do we have enough stragglers that we don't make
>>> it in the 95th percentile? Is the time spent in the Jenkins queue?
>>>
>>> For our current coverage, I'd be willing to go for:
>>>
>>>  - 1 hr hard cap (someone better at stats could choose %ile)
>>>  - roll back or remove test from precommit if fix looks like more than 1
>>> week (roll back if it is perf degradation, remove test from precommit if it
>>> is additional coverage that just doesn't fit in the time)
>>>
>>> There's a longer-term issue that doing a full build each time is
>>> expected to linearly scale up with the size of our repo (it is the monorepo
>>> problem but for a minirepo) so there is no cap that is feasible until we
>>> have effective cross-build caching. And my long-term goal would be <30
>>> minutes. At the latency of opening a pull request and then checking your
>>> email that's not burdensome, but an hour is.
>>>
>>> Kenn
>>>
>>> On Thu, May 17, 2018 at 6:54 PM Udi Meiri  wrote:
>>>
 HI,
 I have a proposal to improve contributor experience by keeping
 precommit times low.

 I'm looking to get community consensus and approval about:
 1. How long should precommits take. 2 hours @95th percentile over the
 past 4 weeks is the current proposal.
 2. The process for dealing with slowness. Do we: fix, roll back, remove
 a test from precommit?
 Rolling back if a fix is estimated to take longer than 2 weeks is the
 current proposal.


 https://docs.google.com/document/d/1udtvggmS2LTMmdwjEtZCcUQy6aQAiYTI3OrTP8CLfJM/edit?usp=sharing

>>>


Re: Launching a Portable Pipeline

2018-05-18 Thread Ankur Goenka
Thanks for all the input.
I have summarized the discussions at the bottom of the document ( here

).
Please feel free to provide comments.
Once we agree, I will publish the conclusion on the mailing list.

On Mon, May 14, 2018 at 1:51 PM Eugene Kirpichov 
wrote:

> Thanks Ankur, this document clarifies a few points and raises some very
> important questions. I encourage everybody with a stake in Portability to
> take a look and chime in.
>
> +Aljoscha Krettek  +Thomas Weise
>  +Henning Rohde 
>
> On Mon, May 14, 2018 at 12:34 PM Ankur Goenka  wrote:
>
>> Updated link
>> 
>>  to
>> the document as the previous link was not working for some people.
>>
>>
>> On Fri, May 11, 2018 at 7:56 PM Ankur Goenka  wrote:
>>
>>> Hi,
>>>
>>> Recent effort on portability has introduced JobService and
>>> ArtifactService to the beam stack along with SDK. This has open up a few
>>> questions around how we start a pipeline in a portable setup (with
>>> JobService).
>>> I am trying to document our approach to launching a portable pipeline
>>> and take binding decisions based on the discussion.
>>> Please review the document and provide your feedback.
>>>
>>> Thanks,
>>> Ankur
>>>
>>


Re: Proposal: keeping precommit times fast

2018-05-18 Thread Scott Wegner
re: intelligently skipping tests for code that doesn't change (i.e. Java
tests on Python PR): this should be possible. We already have build-caching
enabled in Gradle, but I believe it is local to the git workspace and
doesn't persist between Jenkins runs.

With a quick search, I see there is a Jenkins Build Cacher Plugin [1] that
hooks into Gradle build cache and does exactly what we need. Does anybody
know whether we could get this enabled on our Jenkins?

[1] https://wiki.jenkins.io/display/JENKINS/Job+Cacher+Plugin

On Fri, May 18, 2018 at 12:08 PM Robert Bradshaw 
wrote:

> [somehow  my email got garbled...]
>
> Now that we're using gradle, perhaps we could be more intelligent about
> only running the affected tests? E.g. when you touch Python (or Go) you
> shouldn't need to run the Java precommit at all, which would reduce the
> latency for those PRs and also the time spent in queue. Presumably this
> could even be applied per-module for the Java tests. (Maybe a large, shared
> build cache could help here as well...)
>
> I also wouldn't be opposed to a quicker immediate signal, plus more
> extensive tests before actually merging. It's also nice to not have to wait
> an hour to see that you have a lint error; quick stuff like that could be
> signaled quickly before a contributor looses context.
>
> - Robert
>
>
>
> On Fri, May 18, 2018 at 5:55 AM Kenneth Knowles  wrote:
>
>> I like the idea. I think it is a good time for the project to start
>> tracking this and keeping it usable.
>>
>> Certainly 2 hours is more than enough, is that not so? The Java
>> precommit seems to take <=40 minutes while Python takes ~20 and Go is so
>> fast it doesn't matter. Do we have enough stragglers that we don't make
>> it in the 95th percentile? Is the time spent in the Jenkins queue?
>>
>> For our current coverage, I'd be willing to go for:
>>
>>  - 1 hr hard cap (someone better at stats could choose %ile)
>>  - roll back or remove test from precommit if fix looks like more than 1
>> week (roll back if it is perf degradation, remove test from precommit if it
>> is additional coverage that just doesn't fit in the time)
>>
>> There's a longer-term issue that doing a full build each time is expected
>> to linearly scale up with the size of our repo (it is the monorepo problem
>> but for a minirepo) so there is no cap that is feasible until we have
>> effective cross-build caching. And my long-term goal would be <30 minutes.
>> At the latency of opening a pull request and then checking your email
>> that's not burdensome, but an hour is.
>>
>> Kenn
>>
>> On Thu, May 17, 2018 at 6:54 PM Udi Meiri  wrote:
>>
>>> HI,
>>> I have a proposal to improve contributor experience by keeping precommit
>>> times low.
>>>
>>> I'm looking to get community consensus and approval about:
>>> 1. How long should precommits take. 2 hours @95th percentile over the
>>> past 4 weeks is the current proposal.
>>> 2. The process for dealing with slowness. Do we: fix, roll back, remove
>>> a test from precommit?
>>> Rolling back if a fix is estimated to take longer than 2 weeks is the
>>> current proposal.
>>>
>>>
>>> https://docs.google.com/document/d/1udtvggmS2LTMmdwjEtZCcUQy6aQAiYTI3OrTP8CLfJM/edit?usp=sharing
>>>
>>


Re: What is the future of Reshuffle?

2018-05-18 Thread Robert Bradshaw
[resending]

Agreed that keeping this deprecated without a clear replacement for so long
is not ideal.

I would at least break this into two separate transforms, the
parallelism-breaking one (which seems OK) and the stable input one (which
may just call the parallelism-breaking one, but should be decorated with
lots of caveats and maybe even still have the deprecated annotation).

On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles  wrote:

> The fact that its usage has grown probably indicates that we have a large
> number of transforms that can easily cause data loss / duplication.
>
> Yes, it is deprecated because it is primarily used as a Dataflow-specific
> way to ensure stable input. My understanding is that the SparkRunner also
> materializes at every GBK so it works there too (is this still the case?).
> It doesn't work at all for other runners AFAIK. So it is @Deprecated not
> because there is a replacement, but because it is kind of dangerous to use. 
> Beam
> could just say "GBK must ensure stable output" and "a composite containing
> a GBK has to ensure stable output even if replaced" and that would solve
> the issue, but I think it would make Beam on Flink impossibly slow - I
> could be wrong about that. Generally stable input is tied to durability
> model which is a key design point for engines.
>
> True that it isn't the only use, and I know you have been trying to nail
> down what the uses actually are. Ben wrote up various uses in a portable
> manner at https://beam.apache.org/documentation/execution-model.
>
>  - Coupled failure is the use where Reshuffle is to provide stable input
>  - Breaking dependent parallelism is more portable - but since it is the
> identity transform a runner may just elide it; it is a hint, basically, and
> that seem OK (but can we do it more directly?)
>
> What I don't want is to build something where the implementation details
> are the spec, and not fundamental, which is sort of where Reshuffle lies.
> This thread highlights that this is a pretty urgent problem with our SDKs
> and runners that it would be very helpful to work on.
>
> Kenn
>
>
>
> On Fri, May 18, 2018 at 7:50 AM Eugene Kirpichov 
> wrote:
>
>> Agreed that it should be undeprecated, many users are getting confused by
>> this.
>> I know that some people are working on a replacement for at least one of
>> its use cases (RequiresStableInput), but the use case of breaking fusion
>> is, as of yet, unaddressed, and there's not much to be gained by keeping it
>> deprecated.
>>
>> On Fri, May 18, 2018 at 7:45 AM Raghu Angadi  wrote:
>>
>>> I am interested in more clarity on this as well. It has been deprecated
>>> for a long time without a replacement, and its usage has only grown, both
>>> within Beam code base as well as in user applications.
>>>
>>> If we are certain that it will not be removed before there is a good
>>> replacement for it, can we undeprecate it until there are proper plans for
>>> replacement?
>>>
>>> On Fri, May 18, 2018 at 7:12 AM Ismaël Mejía  wrote:
>>>
 I saw in a recent thread that the use of the Reshuffle transform was
 recommended to solve an user issue:


 https://lists.apache.org/thread.html/87ef575ac67948868648e0a8110be242f811bfff8fdaa7f9b758b933@%3Cdev.beam.apache.org%3E

 I can see why it may fix the reported issue. I am just curious about
 the fact that the Reshuffle transform is marked as both @Internal and
 @Deprecated in Beam's SDK.

 Do we have some alternative? So far the class documentation does not
 recommend any replacement.

>>>


Re: What is the future of Reshuffle?

2018-05-18 Thread Robert Bradshaw
On Fri, May 18, 2018 at 11:46 AM Raghu Angadi  wrote:

> Thanks Kenn.
>
> On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles  wrote:
>
>> The fact that its usage has grown probably indicates that we have a large
>> number of transforms that can easily cause data loss / duplication.
>>
>
> Is this specific to Reshuffle or it is true for any GroupByKey? I see
> Reshuffle as just a wrapper around GBK.
>

The issue is when it's used in such a way that data corruption can occur
when the underlying GBK output is not stable.


Re: Java PreCommit seems broken

2018-05-18 Thread Scott Wegner
+1 to the Lukasz's proposed solution. Depending on artifacts published from
a previous build it's fragile and will add flakiness to our test runs. We
should make pre-commits as hermetic as possible.

Depending on the transitive set of publishToMavenLocal tasks seems
cumbersome, but also necessary.

On a related note: The archetype projects are shelling out to mvn for the
build, which uses the existing pom.xml files. This places a build
dependency on the pom.xml files down to the project root due to parent
relationships. Has there been any investigation on whether we can decouple
archetype generation from our Maven pom.xml files?

On Fri, May 18, 2018 at 10:47 AM Lukasz Cwik  wrote:

> We would need the archetype task to depend on all the dependencies
> publishToMavenLocal tasks transitively and then be configured to use
> whatever that maven local is on Jenkins / dev machine. It would be best if
> it was an ephemeral folder because it would be annoying to have stuff
> installed underneath a devs .m2/ directory that would need cleaning up.
>
> On Fri, May 18, 2018 at 10:41 AM Kenneth Knowles  wrote:
>
>> Is this just a build tweak, or are there costly steps that we'd have to
>> add that would slow down presubmit? (with mvn I know that `test` and
>> `install` did very different amounts of work - because mvn test didn't test
>> the right artifacts, but maybe with Gradle not so much?)
>>
>> On Fri, May 18, 2018 at 9:14 AM Lukasz Cwik  wrote:
>>
>>> The problem with the way that the archetypes tests are run (now with
>>> Gradle and in the past with Maven) is that they run against the nightly
>>> snapshot and not against artifacts from the current build. To get them to
>>> work, we would need to publish the dependent Maven modules to a temporary
>>> repo and instruct the archetype project to use it for building/testing
>>> purposes.
>>>
>>> On Fri, May 18, 2018 at 5:38 AM Kenneth Knowles  wrote:
>>>
 Maybe something has changed, but the snapshots used to pull from the
 public snapshot repo. We got failures for a while every time we cut a
 release branch, but once there was a nightly snapshot they cleared up.

 Kenn

 On Thu, May 17, 2018 at 9:50 PM Scott Wegner 
 wrote:

> I noticed that tests tests simply run "mvn clean install" on the
> archetype project. But I don't see any dependent task which installs built
> artifacts into the local Maven repo. Is that an oversight?
>
> If that's the case, perhaps the tests are failing sporadically when
> there are no previously installed snapshot artifacts cached on the 
> machine.
>
> On Thu, May 17, 2018, 2:45 PM Pablo Estrada 
> wrote:
>
>> I'm seeing failures on Maven Archetype-related tests.
>>
>> Build Scan of a sample run: https://scans.gradle.com/s/kr23q43mh6fmk
>>
>> And the failure is here specifically:
>> https://scans.gradle.com/s/kr23q43mh6fmk/console-log?task=:beam-sdks-java-maven-archetypes-examples:generateAndBuildArchetypeTest#L116
>>
>>
>> Does anyone know why this might be happening?
>> Best
>> -P.
>> --
>> Got feedback? go/pabloem-feedback
>> 
>>
>


Re: Proposal: keeping precommit times fast

2018-05-18 Thread Robert Bradshaw
Now that were using gradle, perhaps we could be more intelligent
about only running the affected tests? E.g. when you touch Python (or
Go) you shouldnt need to run the Java precommit at all, which
would reduce the latency for those PRs and also the time spent in
queue. Presumably this could even be applied per-module for the Java
tests. (Maybe a large, shared build cache could help here as
well...)I also wouldnt be opposed to a quicker immediate
signal, plus more extensive tests before actually merging. Its
also nice to not have to wait an hour to see that you have a lint
error; quick stuff like that could be signaled quickly before a
contributor looses context. On Fri, May 18, 2018 at 5:55 AM
Kenneth Knowles k...@google.com wrote: I like
the idea. I think it is a good time for the project to start tracking
this and keeping it usable. Certainly 2 hours is more
than enough, is that not so? The Java precommit seems to take =40
minutes while Python takes ~20 and Go is so fast it doesnt
matter. Do we have enough stragglers that we dont make it in the
95th percentile? Is the time spent in the Jenkins
queue? For our current coverage, Id be willing to
go for: - 1 hr hard cap (someone better at stats
could choose %ile) - roll back or remove test from
precommit if fix looks like more than 1 week (roll back if it is perf
degradation, remove test from precommit if it is additional coverage
that just doesnt fit in the time) Theres a
longer-term issue that doing a full build each time is expected to
linearly scale up with the size of our repo (it is the monorepo
problem but for a minirepo) so there is no cap that is feasible until
we have effective cross-build caching. And my long-term goal would be
30 minutes. At the latency of opening a pull request and then
checking your email thats not burdensome, but an hour
is. Kenn On Thu, May 17, 2018 at 6:54
PM Udi Meiri eh...@google.com wrote:
HI, I have a proposal to improve contributor experience by
keeping precommit times low. Im looking
to get community consensus and approval about: 1. How long
should precommits take. 2 hours @95th percentile over the past 4 weeks
is the current proposal. 2. The process for dealing with
slowness. Do we: fix, roll back, remove a test from
precommit? Rolling back if a fix is estimated to take
longer than 2 weeks is the current proposal.
https://docs.google.com/document/d/1udtvggmS2LTMmdwjEtZCcUQy6aQAiYTI3OrTP8CLfJM/edit?usp=sharing


Re: What is the future of Reshuffle?

2018-05-18 Thread Robert Bradshaw
On Fri, May 18, 2018 at 11:46 AM Raghu Angadi
rang...@google.com wrote: Thanks
Kenn. On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles
k...@google.com wrote: The fact that
its usage has grown probably indicates that we have a large number of
transforms that can easily cause data loss /
duplication. Is this specific to Reshuffle or it is
true for any GroupByKey? I see Reshuffle as just a wrapper around
GBK.The issue is when its used in such a way that data
corruption can occur when the underlying GBK output is not stable.


Re: What is the future of Reshuffle?

2018-05-18 Thread Raghu Angadi
Thanks Kenn.

On Fri, May 18, 2018 at 11:02 AM Kenneth Knowles  wrote:

> The fact that its usage has grown probably indicates that we have a large
> number of transforms that can easily cause data loss / duplication.
>

Is this specific to Reshuffle or it is true for any GroupByKey? I see
Reshuffle as just a wrapper around GBK.

Raghu.

>
>
> Yes, it is deprecated because it is primarily used as a Dataflow-specific
> way to ensure stable input. My understanding is that the SparkRunner also
> materializes at every GBK so it works there too (is this still the case?).
> It doesn't work at all for other runners AFAIK. So it is @Deprecated not
> because there is a replacement, but because it is kind of dangerous to use. 
> Beam
> could just say "GBK must ensure stable output" and "a composite containing
> a GBK has to ensure stable output even if replaced" and that would solve
> the issue, but I think it would make Beam on Flink impossibly slow - I
> could be wrong about that. Generally stable input is tied to durability
> model which is a key design point for engines.
>
> True that it isn't the only use, and I know you have been trying to nail
> down what the uses actually are. Ben wrote up various uses in a portable
> manner at https://beam.apache.org/documentation/execution-model.
>
>  - Coupled failure is the use where Reshuffle is to provide stable input
>  - Breaking dependent parallelism is more portable - but since it is the
> identity transform a runner may just elide it; it is a hint, basically, and
> that seem OK (but can we do it more directly?)
>
> What I don't want is to build something where the implementation details
> are the spec, and not fundamental, which is sort of where Reshuffle lies.
> This thread highlights that this is a pretty urgent problem with our SDKs
> and runners that it would be very helpful to work on.
>
> Kenn
>
>
>
> On Fri, May 18, 2018 at 7:50 AM Eugene Kirpichov 
> wrote:
>
>> Agreed that it should be undeprecated, many users are getting confused by
>> this.
>> I know that some people are working on a replacement for at least one of
>> its use cases (RequiresStableInput), but the use case of breaking fusion
>> is, as of yet, unaddressed, and there's not much to be gained by keeping it
>> deprecated.
>>
>> On Fri, May 18, 2018 at 7:45 AM Raghu Angadi  wrote:
>>
>>> I am interested in more clarity on this as well. It has been deprecated
>>> for a long time without a replacement, and its usage has only grown, both
>>> within Beam code base as well as in user applications.
>>>
>>> If we are certain that it will not be removed before there is a good
>>> replacement for it, can we undeprecate it until there are proper plans for
>>> replacement?
>>>
>>> On Fri, May 18, 2018 at 7:12 AM Ismaël Mejía  wrote:
>>>
 I saw in a recent thread that the use of the Reshuffle transform was
 recommended to solve an user issue:


 https://lists.apache.org/thread.html/87ef575ac67948868648e0a8110be242f811bfff8fdaa7f9b758b933@%3Cdev.beam.apache.org%3E

 I can see why it may fix the reported issue. I am just curious about
 the fact that the Reshuffle transform is marked as both @Internal and
 @Deprecated in Beam's SDK.

 Do we have some alternative? So far the class documentation does not
 recommend any replacement.

>>>


Re: What is the future of Reshuffle?

2018-05-18 Thread Kenneth Knowles
The fact that its usage has grown probably indicates that we have a large
number of transforms that can easily cause data loss / duplication.

Yes, it is deprecated because it is primarily used as a Dataflow-specific
way to ensure stable input. My understanding is that the SparkRunner also
materializes at every GBK so it works there too (is this still the case?).
It doesn't work at all for other runners AFAIK. So it is @Deprecated not
because there is a replacement, but because it is kind of dangerous to
use. Beam
could just say "GBK must ensure stable output" and "a composite containing
a GBK has to ensure stable output even if replaced" and that would solve
the issue, but I think it would make Beam on Flink impossibly slow - I
could be wrong about that. Generally stable input is tied to durability
model which is a key design point for engines.

True that it isn't the only use, and I know you have been trying to nail
down what the uses actually are. Ben wrote up various uses in a portable
manner at https://beam.apache.org/documentation/execution-model.

 - Coupled failure is the use where Reshuffle is to provide stable input
 - Breaking dependent parallelism is more portable - but since it is the
identity transform a runner may just elide it; it is a hint, basically, and
that seem OK (but can we do it more directly?)

What I don't want is to build something where the implementation details
are the spec, and not fundamental, which is sort of where Reshuffle lies.
This thread highlights that this is a pretty urgent problem with our SDKs
and runners that it would be very helpful to work on.

Kenn



On Fri, May 18, 2018 at 7:50 AM Eugene Kirpichov 
wrote:

> Agreed that it should be undeprecated, many users are getting confused by
> this.
> I know that some people are working on a replacement for at least one of
> its use cases (RequiresStableInput), but the use case of breaking fusion
> is, as of yet, unaddressed, and there's not much to be gained by keeping it
> deprecated.
>
> On Fri, May 18, 2018 at 7:45 AM Raghu Angadi  wrote:
>
>> I am interested in more clarity on this as well. It has been deprecated
>> for a long time without a replacement, and its usage has only grown, both
>> within Beam code base as well as in user applications.
>>
>> If we are certain that it will not be removed before there is a good
>> replacement for it, can we undeprecate it until there are proper plans for
>> replacement?
>>
>> On Fri, May 18, 2018 at 7:12 AM Ismaël Mejía  wrote:
>>
>>> I saw in a recent thread that the use of the Reshuffle transform was
>>> recommended to solve an user issue:
>>>
>>>
>>> https://lists.apache.org/thread.html/87ef575ac67948868648e0a8110be242f811bfff8fdaa7f9b758b933@%3Cdev.beam.apache.org%3E
>>>
>>> I can see why it may fix the reported issue. I am just curious about
>>> the fact that the Reshuffle transform is marked as both @Internal and
>>> @Deprecated in Beam's SDK.
>>>
>>> Do we have some alternative? So far the class documentation does not
>>> recommend any replacement.
>>>
>>


Re: Java code under main depends on junit?

2018-05-18 Thread Lukasz Cwik
I agree with separating it out as a separate sub-project for the same
reason as you specify, just wanted to point out that it was just less bad
with Gradle for internal use as we are doing it right now.

On Fri, May 18, 2018 at 10:35 AM Kenneth Knowles  wrote:

> Ah, nice. That means you can actually declare a dependency on test suites
> and get their dependencies in order to run them successfully.
>
> It does mean I can't just argue "it doesn't work" but have to go back to
> arguing that it is a design problem :-)
>
> A "test jar" is a jar containing a bunch of tests you can run, akin to an
> executable. Shouldn't depend on someone's tests in order to use their test
> utilities any more than you would depend on someone's main() program to
> call into a library, and shouldn't design your distributed artifacts to
> encourage this.
>
> Kenn
>
> On Fri, May 18, 2018 at 10:22 AM Lukasz Cwik  wrote:
>
>> Note that transitive dependencies in Gradle do work correctly for test
>> configurations so the issue with Maven not handling transitive test
>> dependencies is less important within the project. If we expect end users
>> to use these utilities (in which case they could be using Maven), then as
>> suggested many times before a separate test artifact is needed.
>>
>> On Thu, May 17, 2018 at 5:16 PM Thomas Weise  wrote:
>>
>>> Thanks!
>>>
>>> IMO we should at least run "mvn verify -DskipTests" in precommit until
>>> the maven build can be retired (== deleted from master).
>>>
>>>
>>> On Thu, May 17, 2018 at 5:00 PM, Anton Kedin  wrote:
>>>
 Opened PR  to fix the
 current build issue, opened BEAM-4358
  to extract test
 dependencies.

 Should we keep maven precommits running for now if we have to fix the
 issues like these? In the PR I had to fix another issue in the same
 project, and I suspect other projects are broken for me for similar 
 reasons.

 Regards,
 Anton

 On Thu, May 17, 2018 at 4:52 PM Kenneth Knowles  wrote:

> I know what you mean. But indeed, test artifacts are unsuitable to
> depend on since transitive deps don't work correctly. I think it makes
> sense to have a separate test utility. For the core, one reason we didn't
> was to have PAssert available in main. But now that we have Gradle we
> actually can do that because it is not a true cycle but a false cycle
> introduced by maven.
>
> For GCP it is even easier.
>
> Kenn
>
>
> On Thu, May 17, 2018, 16:28 Thomas Weise  wrote:
>
>> It is possible to depend on a test artifact to achieve the same, but
>> unfortunately not transitively.
>>
>> Mixing test utilities into the main artifacts seems undesirable,
>> since they are only needed for tests. It may give more food to the 
>> shading
>> monster also..
>>
>> So it is probably better to create a dedicated test tools artifact
>> that qualifies as transitive dependency?
>>
>> Thanks
>>
>>
>> On Thu, May 17, 2018 at 4:17 PM, Kenneth Knowles 
>> wrote:
>>
>>> This seems correct. Test jars are for tests. Utilities to be used
>>> for tests need to be in main jars. (If for no other reason, this is how
>>> transitive deps work)
>>>
>>> We've considered putting these things in a separate package (still
>>> in main). Just no one has done it.
>>>
>>> Kenn
>>>
>>> On Thu, May 17, 2018, 16:04 Thomas Weise  wrote:
>>>
 Hi,

 Is the following dependency intended or an oversight?


 https://github.com/apache/beam/blob/06c70bdf871c5da8a115011b43f8072916cd79e8/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/TestPubsub.java#L32

 It appears that dependent code is in test scope.

 Should the build flag this (the maven build fails)?

 Thanks


>>
>>>


Re: Proposal: keeping post-commit tests green

2018-05-18 Thread Andrew Pilloud
Blocking commits to master on test flaps seems critical here. The test
flaps won't get the attention they deserve as long as people are just
spamming their PRs with 'Run Java Precommit' until they turn green. I'm
guilty of this behavior and I know it masks new flaky tests.

I added a comment to your doc about detecting flaky tests. This can easily
be done by rerunning the postcommits during times when Jenkins would
otherwise be idle. You'll easily get a few dozen runs every weekend, you
just need a process to triage all the flakes and ensure there are bugs. I
worked on a project that did this along with blocking master on any post
commit failure. It was painful for the first few weeks, but things got
significantly better once most of the bugs were fixed.

Andrew

On Fri, May 18, 2018 at 10:39 AM Kenneth Knowles  wrote:

> Love it. I would pull out from the doc also the key point: make the
> postcommit status constantly visible to everyone.
>
> Kenn
>
> On Fri, May 18, 2018 at 10:17 AM Mikhail Gryzykhin 
> wrote:
>
>> Hi everyone,
>>
>> I'm Mikhail and started working on Google Dataflow several months ago.
>> I'm really excited to work with Beam opensource community.
>>
>> I have a proposal to improve contributor experience by keeping
>> post-commit tests green.
>>
>> I'm looking to get community consensus and approval about the process for
>> keeping post-commit tests green and addressing post-commit test failures.
>>
>> Find full list of ideas brought in for discussion in this document:
>>
>> https://docs.google.com/document/d/1sczGwnCvdHiboVajGVdnZL0rfnr7ViXXAebBAf_uQME
>>
>> Key points are:
>> 1. Add explicit tracking of failures via JIRA
>> 2. No-Commit policy when post-commit tests are red
>>
>> --Mikhail
>>
>>


Re: What is the Impulse and why do we need it?

2018-05-18 Thread Kenneth Knowles
I think it makes a lot of sense to move it to the Beam web site. There's
already a good landing point:
https://beam.apache.org/contribute/runner-guide/

That page is a collection of advice for legacy-style runners on how to use
runners-core, etc, and just general stuff about how to write one, gathered
from lots of emails I've written in response to questions from runner
authors.

The places I think need update specifically for impulse are:

 - conceptual list of primitives:
https://beam.apache.org/contribute/runner-guide/#ptransforms (it also does
not say CoGroupByKey because AFAIK the work to switch to it is not done)
 - how to implement the primitives:
https://beam.apache.org/contribute/runner-guide/#implementing-the-beam-primitives

The how-to-implement part needs a rewrite to just talk about how to use the
new utilities to do fusion and implement the runners side of the Fn API. It
is deliberatley more of a tutorial and code walk than a reference doc, so
it is not redundant with the existing docs.

As for SDF, I think that ParDo is useful to talk about as "elementwise"
processing only at the high level, but needs to immediately be split into
"vanilla", "stateful" and "SDF" which are really different primitive modes
of computation.

And bigger picture, the last bit on
https://beam.apache.org/contribute/runner-guide/#writing-an-sdk-independent-runner
should be pulled up to be the main topic.

Kenn

On Fri, May 18, 2018 at 8:14 AM Lukasz Cwik  wrote:

> The Beam Runner API doc needs a lot of updating to discuss impulse and
> SDF, (and deprecate / remove Read): https://s.apache.org/beam-runner-api
> It could also use examples from Go/Python 
> code base.
>
> Alternatively we could start to codify this information on the Apache Beam
> website as the definitions/contracts are less influx.
>
> On Fri, May 18, 2018 at 7:49 AM Eugene Kirpichov 
> wrote:
>
>> Hi Ismael,
>> Impulse is a primitive necessary for the Portability world, where sources
>> do not exist. Impulse is the only possible root of the pipeline, it emits a
>> single empty byte array, and it's all DoFn's and SDF's from there. E.g.
>> when using Fn API, Read.from(BoundedSource) is translated into: Impulse +
>> ParDo(emit source) + ParDo(call .split()) + reshuffle + ParDo(call
>> .createReader() and read from it).
>> Agree that it makes sense to document it somewhere on the portability
>> page.
>>
>> On Fri, May 18, 2018 at 7:21 AM Jean-Baptiste Onofré 
>> wrote:
>>
>>> Fully agree.
>>>
>>> I already started to take a look.
>>>
>>> Regards
>>> JB
>>>
>>> On 18/05/2018 16:12, Ismaël Mejía wrote:
>>> > I have seen multiple mentions of 'Impulse' in JIRAs and some on other
>>> > discussions, but have not seen any document or concrete explanation on
>>> > what's Impulse and why we need it. This seems like an internal
>>> > implementation detail but it is probably a good idea to explain it
>>> > somewhere (my excuses if this is in some document and I missed it).
>>> >
>>>
>>


Re: Java PreCommit seems broken

2018-05-18 Thread Lukasz Cwik
We would need the archetype task to depend on all the dependencies
publishToMavenLocal tasks transitively and then be configured to use
whatever that maven local is on Jenkins / dev machine. It would be best if
it was an ephemeral folder because it would be annoying to have stuff
installed underneath a devs .m2/ directory that would need cleaning up.

On Fri, May 18, 2018 at 10:41 AM Kenneth Knowles  wrote:

> Is this just a build tweak, or are there costly steps that we'd have to
> add that would slow down presubmit? (with mvn I know that `test` and
> `install` did very different amounts of work - because mvn test didn't test
> the right artifacts, but maybe with Gradle not so much?)
>
> On Fri, May 18, 2018 at 9:14 AM Lukasz Cwik  wrote:
>
>> The problem with the way that the archetypes tests are run (now with
>> Gradle and in the past with Maven) is that they run against the nightly
>> snapshot and not against artifacts from the current build. To get them to
>> work, we would need to publish the dependent Maven modules to a temporary
>> repo and instruct the archetype project to use it for building/testing
>> purposes.
>>
>> On Fri, May 18, 2018 at 5:38 AM Kenneth Knowles  wrote:
>>
>>> Maybe something has changed, but the snapshots used to pull from the
>>> public snapshot repo. We got failures for a while every time we cut a
>>> release branch, but once there was a nightly snapshot they cleared up.
>>>
>>> Kenn
>>>
>>> On Thu, May 17, 2018 at 9:50 PM Scott Wegner  wrote:
>>>
 I noticed that tests tests simply run "mvn clean install" on the
 archetype project. But I don't see any dependent task which installs built
 artifacts into the local Maven repo. Is that an oversight?

 If that's the case, perhaps the tests are failing sporadically when
 there are no previously installed snapshot artifacts cached on the machine.

 On Thu, May 17, 2018, 2:45 PM Pablo Estrada  wrote:

> I'm seeing failures on Maven Archetype-related tests.
>
> Build Scan of a sample run: https://scans.gradle.com/s/kr23q43mh6fmk
>
> And the failure is here specifically:
> https://scans.gradle.com/s/kr23q43mh6fmk/console-log?task=:beam-sdks-java-maven-archetypes-examples:generateAndBuildArchetypeTest#L116
>
>
> Does anyone know why this might be happening?
> Best
> -P.
> --
> Got feedback? go/pabloem-feedback
> 
>



Re: Java PreCommit seems broken

2018-05-18 Thread Kenneth Knowles
Is this just a build tweak, or are there costly steps that we'd have to add
that would slow down presubmit? (with mvn I know that `test` and `install`
did very different amounts of work - because mvn test didn't test the right
artifacts, but maybe with Gradle not so much?)

On Fri, May 18, 2018 at 9:14 AM Lukasz Cwik  wrote:

> The problem with the way that the archetypes tests are run (now with
> Gradle and in the past with Maven) is that they run against the nightly
> snapshot and not against artifacts from the current build. To get them to
> work, we would need to publish the dependent Maven modules to a temporary
> repo and instruct the archetype project to use it for building/testing
> purposes.
>
> On Fri, May 18, 2018 at 5:38 AM Kenneth Knowles  wrote:
>
>> Maybe something has changed, but the snapshots used to pull from the
>> public snapshot repo. We got failures for a while every time we cut a
>> release branch, but once there was a nightly snapshot they cleared up.
>>
>> Kenn
>>
>> On Thu, May 17, 2018 at 9:50 PM Scott Wegner  wrote:
>>
>>> I noticed that tests tests simply run "mvn clean install" on the
>>> archetype project. But I don't see any dependent task which installs built
>>> artifacts into the local Maven repo. Is that an oversight?
>>>
>>> If that's the case, perhaps the tests are failing sporadically when
>>> there are no previously installed snapshot artifacts cached on the machine.
>>>
>>> On Thu, May 17, 2018, 2:45 PM Pablo Estrada  wrote:
>>>
 I'm seeing failures on Maven Archetype-related tests.

 Build Scan of a sample run: https://scans.gradle.com/s/kr23q43mh6fmk

 And the failure is here specifically:
 https://scans.gradle.com/s/kr23q43mh6fmk/console-log?task=:beam-sdks-java-maven-archetypes-examples:generateAndBuildArchetypeTest#L116


 Does anyone know why this might be happening?
 Best
 -P.
 --
 Got feedback? go/pabloem-feedback
 

>>>


Re: Proposal: keeping post-commit tests green

2018-05-18 Thread Kenneth Knowles
Love it. I would pull out from the doc also the key point: make the
postcommit status constantly visible to everyone.

Kenn

On Fri, May 18, 2018 at 10:17 AM Mikhail Gryzykhin 
wrote:

> Hi everyone,
>
> I'm Mikhail and started working on Google Dataflow several months ago. I'm
> really excited to work with Beam opensource community.
>
> I have a proposal to improve contributor experience by keeping post-commit
> tests green.
>
> I'm looking to get community consensus and approval about the process for
> keeping post-commit tests green and addressing post-commit test failures.
>
> Find full list of ideas brought in for discussion in this document:
>
> https://docs.google.com/document/d/1sczGwnCvdHiboVajGVdnZL0rfnr7ViXXAebBAf_uQME
>
> Key points are:
> 1. Add explicit tracking of failures via JIRA
> 2. No-Commit policy when post-commit tests are red
>
> --Mikhail
>
>


Re: Java code under main depends on junit?

2018-05-18 Thread Kenneth Knowles
Ah, nice. That means you can actually declare a dependency on test suites
and get their dependencies in order to run them successfully.

It does mean I can't just argue "it doesn't work" but have to go back to
arguing that it is a design problem :-)

A "test jar" is a jar containing a bunch of tests you can run, akin to an
executable. Shouldn't depend on someone's tests in order to use their test
utilities any more than you would depend on someone's main() program to
call into a library, and shouldn't design your distributed artifacts to
encourage this.

Kenn

On Fri, May 18, 2018 at 10:22 AM Lukasz Cwik  wrote:

> Note that transitive dependencies in Gradle do work correctly for test
> configurations so the issue with Maven not handling transitive test
> dependencies is less important within the project. If we expect end users
> to use these utilities (in which case they could be using Maven), then as
> suggested many times before a separate test artifact is needed.
>
> On Thu, May 17, 2018 at 5:16 PM Thomas Weise  wrote:
>
>> Thanks!
>>
>> IMO we should at least run "mvn verify -DskipTests" in precommit until
>> the maven build can be retired (== deleted from master).
>>
>>
>> On Thu, May 17, 2018 at 5:00 PM, Anton Kedin  wrote:
>>
>>> Opened PR  to fix the current
>>> build issue, opened BEAM-4358
>>>  to extract test
>>> dependencies.
>>>
>>> Should we keep maven precommits running for now if we have to fix the
>>> issues like these? In the PR I had to fix another issue in the same
>>> project, and I suspect other projects are broken for me for similar reasons.
>>>
>>> Regards,
>>> Anton
>>>
>>> On Thu, May 17, 2018 at 4:52 PM Kenneth Knowles  wrote:
>>>
 I know what you mean. But indeed, test artifacts are unsuitable to
 depend on since transitive deps don't work correctly. I think it makes
 sense to have a separate test utility. For the core, one reason we didn't
 was to have PAssert available in main. But now that we have Gradle we
 actually can do that because it is not a true cycle but a false cycle
 introduced by maven.

 For GCP it is even easier.

 Kenn


 On Thu, May 17, 2018, 16:28 Thomas Weise  wrote:

> It is possible to depend on a test artifact to achieve the same, but
> unfortunately not transitively.
>
> Mixing test utilities into the main artifacts seems undesirable, since
> they are only needed for tests. It may give more food to the shading
> monster also..
>
> So it is probably better to create a dedicated test tools artifact
> that qualifies as transitive dependency?
>
> Thanks
>
>
> On Thu, May 17, 2018 at 4:17 PM, Kenneth Knowles 
> wrote:
>
>> This seems correct. Test jars are for tests. Utilities to be used for
>> tests need to be in main jars. (If for no other reason, this is how
>> transitive deps work)
>>
>> We've considered putting these things in a separate package (still in
>> main). Just no one has done it.
>>
>> Kenn
>>
>> On Thu, May 17, 2018, 16:04 Thomas Weise  wrote:
>>
>>> Hi,
>>>
>>> Is the following dependency intended or an oversight?
>>>
>>>
>>> https://github.com/apache/beam/blob/06c70bdf871c5da8a115011b43f8072916cd79e8/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/TestPubsub.java#L32
>>>
>>> It appears that dependent code is in test scope.
>>>
>>> Should the build flag this (the maven build fails)?
>>>
>>> Thanks
>>>
>>>
>
>>


Re: [PROPOSAL] Preparing 2.5.0 release next week

2018-05-18 Thread Lukasz Cwik
I believe JB is referring to https://issues.apache.org/jira/browse/BEAM-4060

On Fri, May 18, 2018 at 10:16 AM Scott Wegner  wrote:

> J.B., can you give any context on what metadata is missing? Is there a
> JIRA?
>
> On Thu, May 17, 2018 at 9:30 PM Jean-Baptiste Onofré 
> wrote:
>
>> Hi,
>>
>> The build was OK  yesterday but the maven-metadata is still missing.
>>
>> That's the point to  fix before being able to move forward on  the
>> release.
>>
>> I  gonna tackle this later today.
>>
>> Regards
>> JB
>>
>> On 05/18/2018 02:41 AM, Ahmet Altay wrote:
>> > Hi JB and all,
>> >
>> > I wanted to follow up on my previous email. The python streaming issue I
>> > mentioned is resolved and removed from the blocker list. Blocker list
>> is empty
>> > now. You can go ahead with the release branch cut when you are ready.
>> >
>> > Thank you,
>> > Ahmet
>> >
>> >
>> > On Sun, May 13, 2018 at 8:43 AM, Jean-Baptiste Onofré > > > wrote:
>> >
>> > Hi guys,
>> >
>> > just to let you know that the build fully passed on my box.
>> >
>> > I'm testing the artifacts right now.
>> >
>> > Regards
>> > JB
>> >
>> > On 06/04/2018 10:48, Jean-Baptiste Onofré wrote:
>> >
>> > Hi guys,
>> >
>> > Apache Beam 2.4.0 has been released on March 20th.
>> >
>> > According to our cycle of release (roughly 6 weeks), we should
>> think
>> > about 2.5.0.
>> >
>> > I'm volunteer to tackle this release.
>> >
>> > I'm proposing the following items:
>> >
>> > 1. We start the Jira triage now, up to Tuesday
>> > 2. I would like to cut the release on Tuesday night (Europe
>> time)
>> > 2bis. I think it's wiser to still use Maven for this release.
>> Do you
>> > think we
>> > will be ready to try a release with Gradle ?
>> >
>> > After this release, I would like a discussion about:
>> > 1. Gradle release (if we release 2.5.0 with Maven)
>> > 2. Isolate release cycle per Beam part. I think it would be
>> interesting
>> > to have
>> > different release cycle: SDKs, DSLs, Runners, IOs. That's
>> another
>> > discussion, I
>> > will start a thread about that.
>> >
>> > Thoughts ?
>> >
>> > Regards
>> > JB
>> >
>> >
>>
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>


Proposal: keeping post-commit tests green

2018-05-18 Thread Mikhail Gryzykhin
Hi everyone,

I'm Mikhail and started working on Google Dataflow several months ago. I'm
really excited to work with Beam opensource community.

I have a proposal to improve contributor experience by keeping post-commit
tests green.

I'm looking to get community consensus and approval about the process for
keeping post-commit tests green and addressing post-commit test failures.

Find full list of ideas brought in for discussion in this document:
https://docs.google.com/document/d/1sczGwnCvdHiboVajGVdnZL0rfnr7ViXXAebBAf_uQME

Key points are:
1. Add explicit tracking of failures via JIRA
2. No-Commit policy when post-commit tests are red

--Mikhail


Re: [PROPOSAL] Preparing 2.5.0 release next week

2018-05-18 Thread Scott Wegner
J.B., can you give any context on what metadata is missing? Is there a JIRA?

On Thu, May 17, 2018 at 9:30 PM Jean-Baptiste Onofré 
wrote:

> Hi,
>
> The build was OK  yesterday but the maven-metadata is still missing.
>
> That's the point to  fix before being able to move forward on  the release.
>
> I  gonna tackle this later today.
>
> Regards
> JB
>
> On 05/18/2018 02:41 AM, Ahmet Altay wrote:
> > Hi JB and all,
> >
> > I wanted to follow up on my previous email. The python streaming issue I
> > mentioned is resolved and removed from the blocker list. Blocker list is
> empty
> > now. You can go ahead with the release branch cut when you are ready.
> >
> > Thank you,
> > Ahmet
> >
> >
> > On Sun, May 13, 2018 at 8:43 AM, Jean-Baptiste Onofré  > > wrote:
> >
> > Hi guys,
> >
> > just to let you know that the build fully passed on my box.
> >
> > I'm testing the artifacts right now.
> >
> > Regards
> > JB
> >
> > On 06/04/2018 10:48, Jean-Baptiste Onofré wrote:
> >
> > Hi guys,
> >
> > Apache Beam 2.4.0 has been released on March 20th.
> >
> > According to our cycle of release (roughly 6 weeks), we should
> think
> > about 2.5.0.
> >
> > I'm volunteer to tackle this release.
> >
> > I'm proposing the following items:
> >
> > 1. We start the Jira triage now, up to Tuesday
> > 2. I would like to cut the release on Tuesday night (Europe time)
> > 2bis. I think it's wiser to still use Maven for this release. Do
> you
> > think we
> > will be ready to try a release with Gradle ?
> >
> > After this release, I would like a discussion about:
> > 1. Gradle release (if we release 2.5.0 with Maven)
> > 2. Isolate release cycle per Beam part. I think it would be
> interesting
> > to have
> > different release cycle: SDKs, DSLs, Runners, IOs. That's another
> > discussion, I
> > will start a thread about that.
> >
> > Thoughts ?
> >
> > Regards
> > JB
> >
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: Java PreCommit seems broken

2018-05-18 Thread Lukasz Cwik
The problem with the way that the archetypes tests are run (now with Gradle
and in the past with Maven) is that they run against the nightly snapshot
and not against artifacts from the current build. To get them to work, we
would need to publish the dependent Maven modules to a temporary repo and
instruct the archetype project to use it for building/testing purposes.

On Fri, May 18, 2018 at 5:38 AM Kenneth Knowles  wrote:

> Maybe something has changed, but the snapshots used to pull from the
> public snapshot repo. We got failures for a while every time we cut a
> release branch, but once there was a nightly snapshot they cleared up.
>
> Kenn
>
> On Thu, May 17, 2018 at 9:50 PM Scott Wegner  wrote:
>
>> I noticed that tests tests simply run "mvn clean install" on the
>> archetype project. But I don't see any dependent task which installs built
>> artifacts into the local Maven repo. Is that an oversight?
>>
>> If that's the case, perhaps the tests are failing sporadically when there
>> are no previously installed snapshot artifacts cached on the machine.
>>
>> On Thu, May 17, 2018, 2:45 PM Pablo Estrada  wrote:
>>
>>> I'm seeing failures on Maven Archetype-related tests.
>>>
>>> Build Scan of a sample run: https://scans.gradle.com/s/kr23q43mh6fmk
>>>
>>> And the failure is here specifically:
>>> https://scans.gradle.com/s/kr23q43mh6fmk/console-log?task=:beam-sdks-java-maven-archetypes-examples:generateAndBuildArchetypeTest#L116
>>>
>>>
>>> Does anyone know why this might be happening?
>>> Best
>>> -P.
>>> --
>>> Got feedback? go/pabloem-feedback
>>> 
>>>
>>


Re: What is the Impulse and why do we need it?

2018-05-18 Thread Lukasz Cwik
The Beam Runner API doc needs a lot of updating to discuss impulse and SDF,
(and deprecate / remove Read): https://s.apache.org/beam-runner-api
It could also use examples from Go/Python code base.

Alternatively we could start to codify this information on the Apache Beam
website as the definitions/contracts are less influx.

On Fri, May 18, 2018 at 7:49 AM Eugene Kirpichov 
wrote:

> Hi Ismael,
> Impulse is a primitive necessary for the Portability world, where sources
> do not exist. Impulse is the only possible root of the pipeline, it emits a
> single empty byte array, and it's all DoFn's and SDF's from there. E.g.
> when using Fn API, Read.from(BoundedSource) is translated into: Impulse +
> ParDo(emit source) + ParDo(call .split()) + reshuffle + ParDo(call
> .createReader() and read from it).
> Agree that it makes sense to document it somewhere on the portability page.
>
> On Fri, May 18, 2018 at 7:21 AM Jean-Baptiste Onofré 
> wrote:
>
>> Fully agree.
>>
>> I already started to take a look.
>>
>> Regards
>> JB
>>
>> On 18/05/2018 16:12, Ismaël Mejía wrote:
>> > I have seen multiple mentions of 'Impulse' in JIRAs and some on other
>> > discussions, but have not seen any document or concrete explanation on
>> > what's Impulse and why we need it. This seems like an internal
>> > implementation detail but it is probably a good idea to explain it
>> > somewhere (my excuses if this is in some document and I missed it).
>> >
>>
>


Re: What is the future of Reshuffle?

2018-05-18 Thread Eugene Kirpichov
Agreed that it should be undeprecated, many users are getting confused by
this.
I know that some people are working on a replacement for at least one of
its use cases (RequiresStableInput), but the use case of breaking fusion
is, as of yet, unaddressed, and there's not much to be gained by keeping it
deprecated.

On Fri, May 18, 2018 at 7:45 AM Raghu Angadi  wrote:

> I am interested in more clarity on this as well. It has been deprecated
> for a long time without a replacement, and its usage has only grown, both
> within Beam code base as well as in user applications.
>
> If we are certain that it will not be removed before there is a good
> replacement for it, can we undeprecate it until there are proper plans for
> replacement?
>
> On Fri, May 18, 2018 at 7:12 AM Ismaël Mejía  wrote:
>
>> I saw in a recent thread that the use of the Reshuffle transform was
>> recommended to solve an user issue:
>>
>>
>> https://lists.apache.org/thread.html/87ef575ac67948868648e0a8110be242f811bfff8fdaa7f9b758b933@%3Cdev.beam.apache.org%3E
>>
>> I can see why it may fix the reported issue. I am just curious about
>> the fact that the Reshuffle transform is marked as both @Internal and
>> @Deprecated in Beam's SDK.
>>
>> Do we have some alternative? So far the class documentation does not
>> recommend any replacement.
>>
>


Re: What is the Impulse and why do we need it?

2018-05-18 Thread Eugene Kirpichov
Hi Ismael,
Impulse is a primitive necessary for the Portability world, where sources
do not exist. Impulse is the only possible root of the pipeline, it emits a
single empty byte array, and it's all DoFn's and SDF's from there. E.g.
when using Fn API, Read.from(BoundedSource) is translated into: Impulse +
ParDo(emit source) + ParDo(call .split()) + reshuffle + ParDo(call
.createReader() and read from it).
Agree that it makes sense to document it somewhere on the portability page.

On Fri, May 18, 2018 at 7:21 AM Jean-Baptiste Onofré 
wrote:

> Fully agree.
>
> I already started to take a look.
>
> Regards
> JB
>
> On 18/05/2018 16:12, Ismaël Mejía wrote:
> > I have seen multiple mentions of 'Impulse' in JIRAs and some on other
> > discussions, but have not seen any document or concrete explanation on
> > what's Impulse and why we need it. This seems like an internal
> > implementation detail but it is probably a good idea to explain it
> > somewhere (my excuses if this is in some document and I missed it).
> >
>


Re: What is the future of Reshuffle?

2018-05-18 Thread Raghu Angadi
I am interested in more clarity on this as well. It has been deprecated for
a long time without a replacement, and its usage has only grown, both
within Beam code base as well as in user applications.

If we are certain that it will not be removed before there is a good
replacement for it, can we undeprecate it until there are proper plans for
replacement?

On Fri, May 18, 2018 at 7:12 AM Ismaël Mejía  wrote:

> I saw in a recent thread that the use of the Reshuffle transform was
> recommended to solve an user issue:
>
>
> https://lists.apache.org/thread.html/87ef575ac67948868648e0a8110be242f811bfff8fdaa7f9b758b933@%3Cdev.beam.apache.org%3E
>
> I can see why it may fix the reported issue. I am just curious about
> the fact that the Reshuffle transform is marked as both @Internal and
> @Deprecated in Beam's SDK.
>
> Do we have some alternative? So far the class documentation does not
> recommend any replacement.
>


Re: What is the Impulse and why do we need it?

2018-05-18 Thread Jean-Baptiste Onofré

Fully agree.

I already started to take a look.

Regards
JB

On 18/05/2018 16:12, Ismaël Mejía wrote:

I have seen multiple mentions of 'Impulse' in JIRAs and some on other
discussions, but have not seen any document or concrete explanation on
what's Impulse and why we need it. This seems like an internal
implementation detail but it is probably a good idea to explain it
somewhere (my excuses if this is in some document and I missed it).



What is the Impulse and why do we need it?

2018-05-18 Thread Ismaël Mejía
I have seen multiple mentions of 'Impulse' in JIRAs and some on other
discussions, but have not seen any document or concrete explanation on
what's Impulse and why we need it. This seems like an internal
implementation detail but it is probably a good idea to explain it
somewhere (my excuses if this is in some document and I missed it).


What is the future of Reshuffle?

2018-05-18 Thread Ismaël Mejía
I saw in a recent thread that the use of the Reshuffle transform was
recommended to solve an user issue:

https://lists.apache.org/thread.html/87ef575ac67948868648e0a8110be242f811bfff8fdaa7f9b758b933@%3Cdev.beam.apache.org%3E

I can see why it may fix the reported issue. I am just curious about
the fact that the Reshuffle transform is marked as both @Internal and
@Deprecated in Beam's SDK.

Do we have some alternative? So far the class documentation does not
recommend any replacement.


Re: [DISCUSS] Remove findbugs from sdks/java

2018-05-18 Thread Ismaël Mejía
As part of the error-prone effort Tim has been also cleaning other static
analysis warnings as reported by IntelliJ's Inspect -> Analyze code. I
think this is a good moment to grok some of those too e.g. scoping, unused
variables, redundancies, etc. So I hope the others taking part this work
try to tackle a chunk of those as well.

Extra note. Of course IntelliJ's code analysis should be judged a bit,
there are always fake positives or undesirable changes.

On Fri, May 18, 2018 at 7:56 AM Jean-Baptiste Onofré 
wrote:

> Thanks Tim.

> I think we will be able to remove findbugs after some run/check using
ErrorProne and see the gaps.

> Regards
> JB
> Le 18 mai 2018, à 07:49, Tim Robertson  a
écrit:

>> Thank you all.

>> I think this is clear.  Removing findbugs can happen at a future point.

>> @Scott - I've been working through the java IO error prone issues (some
already merged, some with open PRs now) so will take those IO Jiras. I will
enable failOnWarning, address dependency issues for findbugs and tackle the
error prone warnings.


>> On Fri, May 18, 2018 at 1:07 AM, Scott Wegner  wrote:

>>> +0.02173913

>>> I'm happy to replace FindBugs with ErrorProne, but we need to first
upgrade ErrorProne analyzer warnings to errors. Currently the codebase is
full of warning spam, and there's no enforcement preventing future
violations from being added.

>>> I've done the work for enforcing ErrorProne analysis on java-sdk-core
[1], and I've sharded out the rest of the Java components in JIRA issues
[2] (45 total).  Fixing the issues is relatively straightforward, and I've
tried to provide enough guidance to make them as starter tasks (example:
[3]). Teng Peng has already started on Spark [4] (thanks!)

>>> [1] https://github.com/apache/beam/pull/5319
>>> [2]
https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20status%20%3D%20Open%20AND%20labels%20%3D%20errorprone
>>> [3] https://issues.apache.org/jira/browse/BEAM-4347
>>> [4] https://issues.apache.org/jira/browse/BEAM-4318

>>> On Thu, May 17, 2018 at 2:00 PM Ismaël Mejía  wrote:

 +0.7 also. Findbugs support for more recent versions of Java is
lacking and
 the maintenance seems frozen in time.

 As a possible plan b can we identify the missing important validations
to
 identify how much we lose and if it is considerable, maybe we can
create a
 minimal configuration for those, and eventually migrate from findbugs
to
 spotbugs (https://github.com/spotbugs/spotbugs/) that seems at least
to be
 maintained and the most active findbugs fork.


 On Thu, May 17, 2018 at 9:31 PM Kenneth Knowles  wrote:

 > +0.7 I think we should work to remove findbugs. Errorprone covers
most of
 the same stuff but better and faster.

 > The one thing I'm not sure about is nullness analysis. Findbugs has
some
 serious limitations there but it really improves code quality and
prevents
 blunders. I'm not sure errorprone covers that. I know the Checker
analyzer
 has a full solution that makes NPE impossible as in most modern
languages.
 Maybe that is easy to plug in. The core Java SDK is a good candidate
for
 the first place to do it since it is affects everything else.

 > On Thu, May 17, 2018 at 12:02 PM Tim Robertson <
timrobertson...@gmail.com>
 wrote:

 >> Hi all,
 >> [bringing a side thread discussion from slack to here]

 >> We're tackling error-prone warnings now and we aim to fail the
build on
 warnings raised [1].

 >> Enabling failOnWarning also fails the build on findbugs warnings.
 Currently I see places where these  arise from missing a dependency on
 findbugs_annotations and I asked on slack the best way to introduce
this
 globally in gradle.

 >> In that discussion the idea was floated to consider removing
findbugs
 completely given it is older, has licensing considerations and is not
 released regularly.

 >> What do people think about this idea please?

 >> Thanks,
 >> Tim
 >> [1]

https://lists.apache.org/thread.html/95aae2785c3cd728c2d3378cbdff2a7ba19caffcd4faa2049d2e2f46@%3Cdev.beam.apache.org%3E


Re: Current progress on Portable runners

2018-05-18 Thread Thomas Weise
Most of it should probably go to https://beam.apache.org/con
tribute/portability/

Also for reference, here is the prototype doc: https://s.apache.org/beam-
portability-team-doc

Thomas

On Fri, May 18, 2018 at 5:35 AM, Kenneth Knowles  wrote:

> This is awesome. Would you be up for adding a brief description at
> https://beam.apache.org/contribute/#works-in-progress and maybe a pointer
> to a gdoc with something like the contents of this email? (my reasoning is
> (a) keep the contribution guide concise but (b) all this detail is helpful
> yet (c) the detail may be ever-changing so making a separate web page is
> not the best format)
>
> Kenn
>
> On Thu, May 17, 2018 at 3:13 PM Eugene Kirpichov 
> wrote:
>
>> Hi all,
>>
>> A little over a month ago, a large group of Beam community members has
>> been working a prototype of a portable Flink runner - that is, a runner
>> that can execute Beam pipelines on Flink via the Portability API
>> . The prototype was developed in a 
>> separate
>> branch  and was
>> successfully demonstrated at Flink Forward, where it ran Python and Go
>> pipelines in a limited setting.
>>
>> Since then, a smaller group of people (Ankur Goenka, Axel Magnuson, Ben
>> Sidhom and myself) have been working on productionizing the prototype to
>> address its limitations and do things "the right way", preparing to reuse
>> this work for developing other portable runners (e.g. Spark). This involves
>> a surprising amount of work, since many important design and implementation
>> concerns could be ignored for the purposes of a prototype. I wanted to give
>> an update on where we stand now.
>>
>> Our immediate milestone in sight is *Run Java and Python batch WordCount
>> examples against a distributed remote Flink cluster*. That involves a
>> few moving parts, roughly in order of appearance:
>>
>> *Job submission:*
>> - The SDK is configured to use a "portable runner", whose responsibility
>> is to run the pipeline against a given JobService endpoint.
>> - The portable runner converts the pipeline to a portable Pipeline proto
>> - The runner finds out which artifacts it needs to stage, and staging
>> them against an ArtifactStagingService
>> - A Flink-specific JobService receives the Pipeline proto, performs some
>> optimizations (e.g. fusion) and translates it to Flink datasets and
>> functions
>>
>> *Job execution:*
>> - A Flink function executes a fused chain of Beam transforms (an
>> "executable stage") by converting the input and the stage to bundles and
>> executing them against an SDK harness
>> - The function starts the proper SDK harness, auxiliary services (e.g.
>> artifact retrieval, side input handling) and wires them together
>> - The function feeds the data to the harness and receives data back.
>>
>> *And here is our status of implementation for these parts:* basically,
>> almost everything is either done or in review.
>>
>> *Job submission:*
>> - General-purpose portable runner in the Python SDK: done
>> ; Java SDK: also done
>> 
>> - Artifact staging from the Python SDK: in review (PR
>> , PR
>> ); in java, it's done also
>> - Flink JobService: in review 
>> - Translation from a Pipeline proto to Flink datasets and functions: done
>> 
>> - ArtifactStagingService implementation that stages artifacts to a
>> location on a distributed filesystem: in development (design is clear)
>>
>> *Job execution:*
>> - Flink function for executing via an SDK harness: done
>> 
>> - APIs for managing lifecycle of an SDK harness: done
>> 
>> - Specific implementation of those APIs using Docker: part done
>> , part in review
>> 
>> - ArtifactRetrievalService that retrieves artifacts from the location
>> where ArtifactStagingService staged them: in development.
>>
>> We expect that the in-review parts will be done, and the in-development
>> parts be developed, in the next 2-3 weeks. We will, of course, update the
>> community when this important milestone is reached.
>>
>> *After that, the next milestones include:*
>> - Sett up Java, Python and Go ValidatesRunner tests to run against the
>> portable Flink runner, and get them to pass
>> - Expand Python and Go to parity in terms of such test coverage
>> - Implement the portable Spark runner, with a similar lifecycle but
>> reusing almost all of the Flink work
>> - Add support for streaming to both (which requires SDF - that work is
>> progressing in parallel and by this point should be in a suitable 

Re: Proposal: keeping precommit times fast

2018-05-18 Thread Kenneth Knowles
I like the idea. I think it is a good time for the project to start
tracking this and keeping it usable.

Certainly 2 hours is more than enough, is that not so? The Java precommit
seems to take <=40 minutes while Python takes ~20 and Go is so fast it
doesn't matter. Do we have enough stragglers that we don't make it in the
95th percentile? Is the time spent in the Jenkins queue?

For our current coverage, I'd be willing to go for:

 - 1 hr hard cap (someone better at stats could choose %ile)
 - roll back or remove test from precommit if fix looks like more than 1
week (roll back if it is perf degradation, remove test from precommit if it
is additional coverage that just doesn't fit in the time)

There's a longer-term issue that doing a full build each time is expected
to linearly scale up with the size of our repo (it is the monorepo problem
but for a minirepo) so there is no cap that is feasible until we have
effective cross-build caching. And my long-term goal would be <30 minutes.
At the latency of opening a pull request and then checking your email
that's not burdensome, but an hour is.

Kenn

On Thu, May 17, 2018 at 6:54 PM Udi Meiri  wrote:

> HI,
> I have a proposal to improve contributor experience by keeping precommit
> times low.
>
> I'm looking to get community consensus and approval about:
> 1. How long should precommits take. 2 hours @95th percentile over the past
> 4 weeks is the current proposal.
> 2. The process for dealing with slowness. Do we: fix, roll back, remove a
> test from precommit?
> Rolling back if a fix is estimated to take longer than 2 weeks is the
> current proposal.
>
>
> https://docs.google.com/document/d/1udtvggmS2LTMmdwjEtZCcUQy6aQAiYTI3OrTP8CLfJM/edit?usp=sharing
>


Re: Java PreCommit seems broken

2018-05-18 Thread Kenneth Knowles
Maybe something has changed, but the snapshots used to pull from the public
snapshot repo. We got failures for a while every time we cut a release
branch, but once there was a nightly snapshot they cleared up.

Kenn

On Thu, May 17, 2018 at 9:50 PM Scott Wegner  wrote:

> I noticed that tests tests simply run "mvn clean install" on the archetype
> project. But I don't see any dependent task which installs built artifacts
> into the local Maven repo. Is that an oversight?
>
> If that's the case, perhaps the tests are failing sporadically when there
> are no previously installed snapshot artifacts cached on the machine.
>
> On Thu, May 17, 2018, 2:45 PM Pablo Estrada  wrote:
>
>> I'm seeing failures on Maven Archetype-related tests.
>>
>> Build Scan of a sample run: https://scans.gradle.com/s/kr23q43mh6fmk
>>
>> And the failure is here specifically:
>> https://scans.gradle.com/s/kr23q43mh6fmk/console-log?task=:beam-sdks-java-maven-archetypes-examples:generateAndBuildArchetypeTest#L116
>>
>>
>> Does anyone know why this might be happening?
>> Best
>> -P.
>> --
>> Got feedback? go/pabloem-feedback
>> 
>>
>


Re: Current progress on Portable runners

2018-05-18 Thread Kenneth Knowles
This is awesome. Would you be up for adding a brief description at
https://beam.apache.org/contribute/#works-in-progress and maybe a pointer
to a gdoc with something like the contents of this email? (my reasoning is
(a) keep the contribution guide concise but (b) all this detail is helpful
yet (c) the detail may be ever-changing so making a separate web page is
not the best format)

Kenn

On Thu, May 17, 2018 at 3:13 PM Eugene Kirpichov 
wrote:

> Hi all,
>
> A little over a month ago, a large group of Beam community members has
> been working a prototype of a portable Flink runner - that is, a runner
> that can execute Beam pipelines on Flink via the Portability API
> . The prototype was developed in a 
> separate
> branch  and was
> successfully demonstrated at Flink Forward, where it ran Python and Go
> pipelines in a limited setting.
>
> Since then, a smaller group of people (Ankur Goenka, Axel Magnuson, Ben
> Sidhom and myself) have been working on productionizing the prototype to
> address its limitations and do things "the right way", preparing to reuse
> this work for developing other portable runners (e.g. Spark). This involves
> a surprising amount of work, since many important design and implementation
> concerns could be ignored for the purposes of a prototype. I wanted to give
> an update on where we stand now.
>
> Our immediate milestone in sight is *Run Java and Python batch WordCount
> examples against a distributed remote Flink cluster*. That involves a few
> moving parts, roughly in order of appearance:
>
> *Job submission:*
> - The SDK is configured to use a "portable runner", whose responsibility
> is to run the pipeline against a given JobService endpoint.
> - The portable runner converts the pipeline to a portable Pipeline proto
> - The runner finds out which artifacts it needs to stage, and staging them
> against an ArtifactStagingService
> - A Flink-specific JobService receives the Pipeline proto, performs some
> optimizations (e.g. fusion) and translates it to Flink datasets and
> functions
>
> *Job execution:*
> - A Flink function executes a fused chain of Beam transforms (an
> "executable stage") by converting the input and the stage to bundles and
> executing them against an SDK harness
> - The function starts the proper SDK harness, auxiliary services (e.g.
> artifact retrieval, side input handling) and wires them together
> - The function feeds the data to the harness and receives data back.
>
> *And here is our status of implementation for these parts:* basically,
> almost everything is either done or in review.
>
> *Job submission:*
> - General-purpose portable runner in the Python SDK: done
> ; Java SDK: also done
> 
> - Artifact staging from the Python SDK: in review (PR
> , PR
> ); in java, it's done also
> - Flink JobService: in review 
> - Translation from a Pipeline proto to Flink datasets and functions: done
> 
> - ArtifactStagingService implementation that stages artifacts to a
> location on a distributed filesystem: in development (design is clear)
>
> *Job execution:*
> - Flink function for executing via an SDK harness: done
> 
> - APIs for managing lifecycle of an SDK harness: done
> 
> - Specific implementation of those APIs using Docker: part done
> , part in review
> 
> - ArtifactRetrievalService that retrieves artifacts from the location
> where ArtifactStagingService staged them: in development.
>
> We expect that the in-review parts will be done, and the in-development
> parts be developed, in the next 2-3 weeks. We will, of course, update the
> community when this important milestone is reached.
>
> *After that, the next milestones include:*
> - Sett up Java, Python and Go ValidatesRunner tests to run against the
> portable Flink runner, and get them to pass
> - Expand Python and Go to parity in terms of such test coverage
> - Implement the portable Spark runner, with a similar lifecycle but
> reusing almost all of the Flink work
> - Add support for streaming to both (which requires SDF - that work is
> progressing in parallel and by this point should be in a suitable place)
>
> *For people who would like to get involved in this effort: *You can
> already help out by improving ValidatesRunner test coverage in Python and
> Go. Java has >300 such tests, Python has only a handful. There'll be a
> large amount of parallelizable work once we get the VR test suites running
> - stay tuned. SDF+Portability is also expected to produce a 

Re: Current progress on Portable runners

2018-05-18 Thread Robert Bradshaw
On Thu, May 17, 2018 at 10:25 PM Thomas Weise  wrote:

> Hi Eugene,

> Thanks for putting this together, this is a very nice update and brings
> much needed visibility to those hoping to make use of the portability
> features or contribute to them.

+1, this is a great summary.

> Since the P1 (MVP) milestone is "wordcount" and some of the next things
> listed are more contributor oriented, perhaps we can get more detailed on
> what functionality users can expect?

The way things are structured, once we can run wordcount, we'll be able to
run almost all batch pipelines.

> The next P2 milestone is basically everything and that is a lot. It might
> actually help to break this down a bit more. A couple of things that I'm
> specifically interested in for Python on Flink:

> AFAIK state and timer support in Python are not being worked on yet, is
> anyone planning to and any idea by when SDK and portable runner might
> support it?

IIRC, Luke's starting to investigate portable support for state (and
timers). The Python SDK does not yet offer them yet; this work is ripe for
someone to pick up. I'd be more than happy to talk to anyone about design
here. (The Python SDK already has some notion of state in the side input
implementation, but how exactly timers will be plumbed back is still a bit
less clear.)

> Session windows are supported in the Python SDK, but will they (and all
other windowing features) work equally well on the portable Flink runner?
We know that custom window functions will need work..

The "standard" known window fn URNs (global, fixed, sliding, and sessions)
will work equally well on the portable Flink runner. WindowFns with URNs
that are not built into the runner core (e.g. "PickledPythonWindowFn") are
a bit harder, and there's still design work to do for how best to support
merging in that case.

Windowing also impacts side inputs, and this should "just work" as long as
the runner can understand the window coder (currently we only have Global
and IntervalWindows) and probably wouldn't be too hard to augment with
length prefixing for arbitrary window types (given that the window mapping
happens in the SDK).

> BTW can you clarify the dependency between streaming support (which I'm
working on) and SDF. It refers to new connectors?

It's pretty straightforward to translate a bound source into an impulse, a
DoFn that outputs splits, and then a DoFn that reads the splits, none of
which requires the S part of SDF (until one wants to do liquid sharding of
course). We'll actually need the full splittable protocol over the Fn API
to wrap existing unbounded sources (optimistically hoping there's a clean
map) as (a series of) SDFs or write new ones.

- Robert


> On Thu, May 17, 2018 at 3:12 PM, Eugene Kirpichov 
wrote:

>> Hi all,

>> A little over a month ago, a large group of Beam community members has
been working a prototype of a portable Flink runner - that is, a runner
that can execute Beam pipelines on Flink via the Portability API. The
prototype was developed in a separate branch and was successfully
demonstrated at Flink Forward, where it ran Python and Go pipelines in a
limited setting.

>> Since then, a smaller group of people (Ankur Goenka, Axel Magnuson, Ben
Sidhom and myself) have been working on productionizing the prototype to
address its limitations and do things "the right way", preparing to reuse
this work for developing other portable runners (e.g. Spark). This involves
a surprising amount of work, since many important design and implementation
concerns could be ignored for the purposes of a prototype. I wanted to give
an update on where we stand now.

>> Our immediate milestone in sight is Run Java and Python batch WordCount
examples against a distributed remote Flink cluster. That involves a few
moving parts, roughly in order of appearance:

>> Job submission:
>> - The SDK is configured to use a "portable runner", whose responsibility
is to run the pipeline against a given JobService endpoint.
>> - The portable runner converts the pipeline to a portable Pipeline proto
>> - The runner finds out which artifacts it needs to stage, and staging
them against an ArtifactStagingService
>> - A Flink-specific JobService receives the Pipeline proto, performs some
optimizations (e.g. fusion) and translates it to Flink datasets and
functions

>> Job execution:
>> - A Flink function executes a fused chain of Beam transforms (an
"executable stage") by converting the input and the stage to bundles and
executing them against an SDK harness
>> - The function starts the proper SDK harness, auxiliary services (e.g.
artifact retrieval, side input handling) and wires them together
>> - The function feeds the data to the harness and receives data back.

>> And here is our status of implementation for these parts: basically,
almost everything is either done or in review.

>> Job submission:
>> - General-purpose portable runner in the Python SDK: done; Java SDK: