Re: beam-site issues with Jenkins and MergeBot

2017-08-09 Thread Eugene Kirpichov
Indeed beam-site is at https://gitbox.apache.org/repos/asf/beam-site.git now. However, Mergebot appears to still be not working. https://github.com/apache/beam-site/pull/283 fixes the dead link and it passes the Jenkins precommit test

Re: beam-site issues with Jenkins and MergeBot

2017-08-09 Thread Jean-Baptiste Onofré
Beam site is no more on git-wip-us, but it moved to gitbox afair. Regards JB On 08/09/2017 10:08 PM, Eugene Kirpichov wrote: Hello, I've been trying to merge a PR https://github.com/apache/beam-site/pull/278 and ran into the following issues: 1) When I do "git fetch --all" on beam-site, I get

Re: https://builds.apache.org/view/Beam/ gives 404

2017-08-09 Thread Ahmet Altay
The new link is https://builds.apache.org/view/A-D/view/Beam/ . Pei has a PR (https://github.com/apache/beam-site/pull/283) to update it. On Wed, Aug 9, 2017 at 1:10 PM, Eugene Kirpichov < kirpic...@google.com.invalid> wrote: > I suppose this is not supposed to be the case? I think I saw the page

Re: [PROPOSAL] "Requires deterministic input"

2017-08-09 Thread Reuven Lax
Yes - I don't think we should try and make any deterministic guarantees about what is in a bundle. Stability guarantees are per element only. On Wed, Aug 9, 2017 at 1:30 PM, Thomas Groh wrote: > +1 to the annotation-on-ProcessElement approach. ProcessElement is the > minimum implementation requi

Re: [PROPOSAL] "Requires deterministic input"

2017-08-09 Thread Thomas Groh
As I said, a minor concern; we should be explicit in our documentation that it is only the input _elements_ that are deterministic/stable/replayable/etc, and not operational concerns surrounding them (such as bundling). I'd generally avoid making the actual annotation more verbose. On Wed, Aug 9,

Re: beam-site issues with Jenkins and MergeBot

2017-08-09 Thread Kenneth Knowles
>> The second failure seems legit - https://builds.apache.org/view/Beam/ is >> actually 404 right now (I'll send a separate email about htis) Filed https://issues.apache.org/jira/browse/BEAM-2757 about this earlier today. Kenn

Re: [PROPOSAL] "Requires deterministic input"

2017-08-09 Thread Kenneth Knowles
On Wed, Aug 9, 2017 at 1:30 PM, Thomas Groh wrote: > I have a minor concern that this may not work as expected for users that > try to batch remote calls in `FinishBundle` - we should make sure we > document that it is explicitly the input elements that will be replayed, > and bundles and other o

Re: [PROPOSAL] "Requires deterministic input"

2017-08-09 Thread Thomas Groh
+1 to the annotation-on-ProcessElement approach. ProcessElement is the minimum implementation requirement of a DoFn, and should be where the processing logic which depends on characteristics of the inputs lie. It's a good way of signalling the requirements of the Fn, and letting the runner decide.

Re: beam-site issues with Jenkins and MergeBot

2017-08-09 Thread Ted Yu
However, the following is accessible: https://github.com/apache/beam-site.git Last commit was 13 days ago. On Wed, Aug 9, 2017 at 1:12 PM, Ted Yu wrote: > For #1, under https://git-wip-us.apache.org/repos/asf , I don't see > beam-site > > FYI > > On Wed, Aug 9, 2017 at 1:08 PM, Eugene Kirpicho

Re: beam-site issues with Jenkins and MergeBot

2017-08-09 Thread Ted Yu
For #1, under https://git-wip-us.apache.org/repos/asf , I don't see beam-site FYI On Wed, Aug 9, 2017 at 1:08 PM, Eugene Kirpichov < kirpic...@google.com.invalid> wrote: > Hello, > > I've been trying to merge a PR https://github.com/apache/ > beam-site/pull/278 > and ran into the following issue

https://builds.apache.org/view/Beam/ gives 404

2017-08-09 Thread Eugene Kirpichov
I suppose this is not supposed to be the case? I think I saw the page before, and it's linked from https://beam.apache.org/contribute/testing/ under "Precommit": For precommit testing, Beam uses Jenkins and a code coverage tool called Coveralls

beam-site issues with Jenkins and MergeBot

2017-08-09 Thread Eugene Kirpichov
Hello, I've been trying to merge a PR https://github.com/apache/beam-site/pull/278 and ran into the following issues: 1) When I do "git fetch --all" on beam-site, I get an error "fatal: repository 'https://git-wip-us.apache.org/repos/asf/beam-site.git/' not found". Has the git address of the apac

Re: is it ok to have a dicussion without subscribe the list

2017-08-09 Thread Ted Yu
Derek: You can periodically visit: http://search-hadoop.com/Beam where it is easy to find the thread(s) you're interested in. The latency of indexing is very low. FYI On Wed, Aug 9, 2017 at 1:28 AM, derek wrote: > On Tue, Aug 8, 2017 at 1:23 PM, Jason Kuster > wrote: > > Hi Derek, > > > > I

Re: is it ok to have a dicussion without subscribe the list

2017-08-09 Thread derek
On Tue, Aug 8, 2017 at 1:23 PM, Jason Kuster wrote: > Hi Derek, > > If you aren't subscribed to the list then people have to manually add you > back into the to: line in order for you to receive replies (I always do > anyway). Subscribing (and unsubscribing) to the list is fairly > straightforward

Re: Exactly-once Kafka sink

2017-08-09 Thread Raghu Angadi
Yep, an option to ensure replays see identical input would be pretty useful. It might be challenging on horizontally checkpointing runners like Flink (only way I see to buffer all the input in state and replay it after checkpoint). On Wed, Aug 9, 2017 at 10:21 AM, Reuven Lax wrote: > Please see

Re: [PROPOSAL] "Requires deterministic input"

2017-08-09 Thread Reuven Lax
I think deterministic here means deterministically replayable. i.e. no matter how many times the element is retried, it will always be the same. I think we should also allow specifying this on processTimer. This would mean that any keyed state written in a previous processElement must be guarantee

Re: Exactly-once Kafka sink

2017-08-09 Thread Reuven Lax
Please see Kenn's proposal. This is a generic thing that is lacking in the Beam model, and only works today for specific runners. We should fix this at the Beam level, but I don't think that should block your PR. On Wed, Aug 9, 2017 at 10:10 AM, Raghu Angadi wrote: > There are quite a few custo

Re: [PROPOSAL] "Requires deterministic input"

2017-08-09 Thread Kenneth Knowles
I like "Stable" too. I can try to make up other scenarios to try out different vocabulary. Here are a couple: - redundant processing to mitigate stragglers - duplication in the course of optimizations* This expands the scope of the feature to be not just agreement on the PCollection contents b

Re: [VOTE] Release 2.1.0, release candidate #3

2017-08-09 Thread Ahmet Altay
+1, Thank you JB! - I verified the hashes for apache-beam-2.1.0-python.zip, apache-beam-2.1.0-source-release.zip files - Unzipped apache-beam-2.1.0-source-release.zip and ran python packaging and unittests using tox - Ran python wordcount and mobile gaming examples with DirectRunner and DataflowRu

Re: Exactly-once Kafka sink

2017-08-09 Thread Raghu Angadi
There are quite a few customers using KafkaIO with Dataflow. All of them are potential users of exactly-once sink. Dataflow Pubsub sink does not support EOS yet. Even among those customers, I do expect fraction of applications requiring EOS would be pretty small, that's why I don't think extra shuf

Re: [PROPOSAL] "Requires deterministic input"

2017-08-09 Thread Ben Chambers
I strongly agree with this proposal. I think moving away from "just insert a GroupByKey for one of the 3 different reasons you may want it" towards APIs that allow code to express the requirements they have and the runner to choose the best way to meet this is a major step forwards in terms of port

Re: [PROPOSAL] "Requires deterministic input"

2017-08-09 Thread Kenneth Knowles
This came up again, so I wanted to push it along by proposing a specific API for Java that could have a derived API in Python. I am writing this quickly to get something out there, so I welcome suggestions for revision. Today a DoFn has a @ProcessElement annotated method with various automated par

Re: Exactly-once Kafka sink

2017-08-09 Thread Reuven Lax
I assume this holds for side inputs as well. On Wed, Aug 9, 2017 at 9:41 AM, Kenneth Knowles wrote: > Yea, exactly. > > On Wed, Aug 9, 2017 at 9:40 AM, Reuven Lax > wrote: > > > Oh, I understand now. This DoFn is saying "make my input > deterministically > > replayable." If it turns out the inp

Re: Exactly-once Kafka sink

2017-08-09 Thread Kenneth Knowles
Yea, exactly. On Wed, Aug 9, 2017 at 9:40 AM, Reuven Lax wrote: > Oh, I understand now. This DoFn is saying "make my input deterministically > replayable." If it turns out the input already is deterministically > replayable, then nothing needs to be done. > > > > On Wed, Aug 9, 2017 at 9:10 AM,

Re: Exactly-once Kafka sink

2017-08-09 Thread Reuven Lax
Oh, I understand now. This DoFn is saying "make my input deterministically replayable." If it turns out the input already is deterministically replayable, then nothing needs to be done. On Wed, Aug 9, 2017 at 9:10 AM, Kenneth Knowles wrote: > The term "determinism" refers to a property of the

Re: Exactly-once Kafka sink

2017-08-09 Thread Kenneth Knowles
The term "determinism" refers to a property of the input PCollection, not any transform or DoFn. What we mean by it is that the PCollection has well-defined contents, so any transform consuming it will see consistent PCollection contents across retries. Illustrated, I think we are talking about th

Re: Exactly-once Kafka sink

2017-08-09 Thread Reuven Lax
Is determinism the right thing for this? One thing to keep in mind, is that most inputs will not be deterministic. If any upstream aggregation is done and allowed_lateness > 0, then that aggregation is non deterministic (basically, if it is retried it might get a slightly different set of input ele

Re: Exactly-once Kafka sink

2017-08-09 Thread Kenneth Knowles
We've had a few threads related to this. There was one proposal that seemed to achieve consensus [1]. The TL;DR is that we have to assume any DoFn might have side effects (in the broadest sense of the term where anything other than a pure mathematical function is a side effect) and when we want det

Re: Exactly-once Kafka sink

2017-08-09 Thread Aljoscha Krettek
Yes, I think making this explicit would be good. Having a transformation that makes assumptions about how the runner implements certain things is not optimal. Also, I think that most people probably don't use Kafka with the Dataflow Runner (because GCE has Pubsub, but I'm guest guessing here). T