Re: Introducing a Redistribute transform

2016-10-10 Thread Eugene Kirpichov
Hi Amit, The transform, the way it's implemented, actually does several things at the same time and that's why it's tricky to document it. Redistribute.arbitrarily(): - Introduces a fusion barrier (in runners that have it), making sure that the runner can fully parallelize processing the output P

Re: Introducing a Redistribute transform

2016-10-10 Thread Robert Bradshaw
On Mon, Oct 10, 2016 at 12:57 PM, Amit Sela wrote: >> > So this is basically a "FanOut" transformation which will depend on the >> >> > available resources of the runner (and the uniqueness of the assigned >> keys) >> >> > ? >> >> > >> >> > Would we want to Redistribute into a user-defined number

Re: Introducing a Redistribute transform

2016-10-10 Thread Amit Sela
On Mon, Oct 10, 2016 at 9:21 PM Robert Bradshaw wrote: > On Sat, Oct 8, 2016 at 7:31 AM, Amit Sela wrote: > > > Hi Eugene, > > > > > > This is very interesting. > > > Let me see if I get this right, the "Redistribute" transformation > assigns > > > a "running id" key (per-bundle) , calls "Redis

Re: Jenkins build is still unstable: beam_PostCommit_MavenVerify » Apache Beam :: Examples :: Java #1489

2016-10-10 Thread Pei He
Looking at the broken tests. On Mon, Oct 10, 2016 at 10:05 AM, Apache Jenkins Server < jenk...@builds.apache.org> wrote: > See MavenVerify/org.apache.beam$beam-examples-java/1489/> > >

Re: Introducing a Redistribute transform

2016-10-10 Thread Robert Bradshaw
On Sat, Oct 8, 2016 at 7:31 AM, Amit Sela wrote: > Hi Eugene, > > This is very interesting. > Let me see if I get this right, the "Redistribute" transformation assigns > a "running id" key (per-bundle) , calls "Redistribute.byKey", and extracts > back the values, correct ? The keys are (pseudora

Re: [DISCUSS] UnboundedSource and the KafkaIO.

2016-10-10 Thread Amit Sela
That's a good question Robert, and I did. First of all, an UnboundedSource is split into splits that implement a sort of "BoundedReadFromUnboundedSource", with Restrictions on time and (optional) number of records - this seems to fit nicely into the *SDF* language. Taking a look at the diagram i

Re: [PROPOSAL] Introduce review mailing list and provide update on open discussion

2016-10-10 Thread Jean-Baptiste Onofré
Thanks for the update Frances. I will ping my infra contact to move forward quickly. Regards JB On 10/10/2016 07:27 PM, Frances Perry wrote: Related to #3-5: Also, as we discussed earlier [1], there will be an additional level of tracking in jira for deeper proposal-style conversations to help

Re: [PROPOSAL] Introduce review mailing list and provide update on open discussion

2016-10-10 Thread Frances Perry
Related to #3-5: Also, as we discussed earlier [1], there will be an additional level of tracking in jira for deeper proposal-style conversations to help us keep track of which ones are still under discussion on the dev@ list (which, as usual, remains the source of truth). The details are still in

Re: [DISCUSS] UnboundedSource and the KafkaIO.

2016-10-10 Thread Robert Bradshaw
Just looking to the future, have you given any thought on how well this would work on https://s.apache.org/splittable-do-fn? On Mon, Oct 10, 2016 at 6:35 AM, Amit Sela wrote: > Thanks Max! > > I'll try to explain Spark's stateful operators and how/why I used them with > UnboundedSource. > > Spark

Jenkins build is still unstable: beam_Release_NightlySnapshot #195

2016-10-10 Thread Apache Jenkins Server
See

Re: [DISCUSS] UnboundedSource and the KafkaIO.

2016-10-10 Thread Amit Sela
Thanks Max! I'll try to explain Spark's stateful operators and how/why I used them with UnboundedSource. Spark has two stateful operators: *updateStateByKey* and *mapWithState*. Since updateStateByKey is bound to output the (updated) state itself - the CheckpointMark in our case - we're left with

Re: [DISCUSS] UnboundedSource and the KafkaIO.

2016-10-10 Thread Jean-Baptiste Onofré
Hi Max, thanks for the explanation and it makes lot of sense. Not sure it will be so simple to store a previous state from one micro-batch to another. Let me take a look with Amit. Regards JB On 10/10/2016 03:02 PM, Maximilian Michels wrote: Just to add a comment from the Flink side and its

Re: [DISCUSS] UnboundedSource and the KafkaIO.

2016-10-10 Thread Maximilian Michels
Just to add a comment from the Flink side and its UnboundedSourceWrapper. We experienced the only way to guarantee deterministic splitting of the source, was to generate the splits upon creation of the source and then checkpoint the assignment during runtime. When restoring from a checkpoint, the s

Re: [DISCUSS] UnboundedSource and the KafkaIO.

2016-10-10 Thread Jean-Baptiste Onofré
Hi Amit, thanks for the explanation. For 4, you are right, it's slightly different from DataXchange (related to the elements in the PCollection). I think storing the "starting point" for a reader makes sense. Regards JB On 10/10/2016 10:33 AM, Amit Sela wrote: Inline, thanks JB! On Mon, O

Re: [DISCUSS] UnboundedSource and the KafkaIO.

2016-10-10 Thread Amit Sela
Inline, thanks JB! On Mon, Oct 10, 2016 at 9:01 AM Jean-Baptiste Onofré wrote: > Hi Amit, > > > > For 1., the runner is responsible of the checkpoint storage (associated > > with the source). It's the way for the runner to retry and know the > > failed bundles. > True, this was a recap/summary o