Re: [DISCUSS] [DOC] Triggering is for sinks!

2017-11-30 Thread Jean-Baptiste Onofré
Hi Kenn, very interesting idea. It sounds more usable and "logic". Regards JB On 11/30/2017 09:06 PM, Kenneth Knowles wrote: Hi all, Triggers are one of the more novel aspects of Beam's support for unbounded data. They are also one of the most challenging aspects of the model. Ben & I

[PROPOSAL] Beam Go SDK feature branch

2017-11-30 Thread Henning Rohde
Hi everyone, We have been prototyping a Go SDK for Beam for some time and have reached a point, where we think this effort might be of interest to the wider Beam community and would benefit from being developed in a proper feature branch. We have prepared a PR to that end:

Re: Schema-Aware PCollections

2017-11-30 Thread Reuven Lax
Thanks! On Thu, Nov 30, 2017 at 11:25 AM, Holden Karau wrote: > Rocking, I'll start leaving some comments on this. I'm excited to see work > being done in this area as well :) > > On Thu, Nov 30, 2017 at 9:20 AM, Tyler Akidau wrote: > >> On Wed, Nov

[DISCUSS] [DOC] Triggering is for sinks!

2017-11-30 Thread Kenneth Knowles
Hi all, Triggers are one of the more novel aspects of Beam's support for unbounded data. They are also one of the most challenging aspects of the model. Ben & I have been working on a major new idea for how triggers could work in the Beam model. We think it will make triggers much more usable,

Re: makes bundle concept usable?

2017-11-30 Thread Romain Manni-Bucau
2017-11-30 20:36 GMT+01:00 Kenneth Knowles : > On Thu, Nov 30, 2017 at 11:28 AM, Romain Manni-Bucau > wrote: >> >> This is my short term concern yes. Note that the opposite is not sane >> neither (too big) cause it forces eager flushes in all IO (so instead

Re: makes bundle concept usable?

2017-11-30 Thread Romain Manni-Bucau
This is my short term concern yes. Note that the opposite is not sane neither (too big) cause it forces eager flushes in all IO (so instead of fixing it once in a single code location you impact everyone N times). However it is not blocking as the small bundle size issue. Next concenr is commit

Re: Schema-Aware PCollections

2017-11-30 Thread Holden Karau
Rocking, I'll start leaving some comments on this. I'm excited to see work being done in this area as well :) On Thu, Nov 30, 2017 at 9:20 AM, Tyler Akidau wrote: > On Wed, Nov 29, 2017 at 6:38 PM Reuven Lax wrote: > >> There has been a lot of conversation

Re: makes bundle concept usable?

2017-11-30 Thread Eugene Kirpichov
I mean: if these runners have some limitation that forces them into only supporting tiny bundles, there's a good chance that this limitation will also apply to whatever beam model API you propose as a fix, and they won't be able to implement it. On Thu, Nov 30, 2017, 11:19 AM Eugene Kirpichov

Re: makes bundle concept usable?

2017-11-30 Thread Eugene Kirpichov
So is your main concern potential poor performance on runners that choose to use a very small bundle size? (Currently an IO can trivially handle too large bundles, simply by flushing when enough data accumulates, which is what all IOs do - but indeed working around having unreasonably small

Re: makes bundle concept usable?

2017-11-30 Thread Romain Manni-Bucau
Le 30 nov. 2017 19:23, "Kenneth Knowles" a écrit : On Thu, Nov 30, 2017 at 10:03 AM, Romain Manni-Bucau wrote: > Hmm, > > ESIO: https://github.com/apache/beam/blob/master/sdks/java/io/elas > ticsearch/src/main/java/org/apache/beam/sdk/io/elasticsear >

Re: makes bundle concept usable?

2017-11-30 Thread Kenneth Knowles
On Thu, Nov 30, 2017 at 10:03 AM, Romain Manni-Bucau wrote: > Hmm, > > ESIO: https://github.com/apache/beam/blob/master/sdks/java/io/ > elasticsearch/src/main/java/org/apache/beam/sdk/io/ > elasticsearch/ElasticsearchIO.java#L847 > JDBCIO:

Re: makes bundle concept usable?

2017-11-30 Thread Romain Manni-Bucau
Hmm, ESIO: https://github.com/apache/beam/blob/master/sdks/java/io/elasticsearch/src/main/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.java#L847 JDBCIO: https://github.com/apache/beam/blob/master/sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java#L592 MongoIO:

Re: makes bundle concept usable?

2017-11-30 Thread Jean-Baptiste Onofré
Agree, but maybe we can inform the runner if wanted no ? Honestly, from my side, I'm fine with the current situation as it's runner specific. Regards JB On 11/30/2017 06:12 PM, Reuven Lax wrote: I don't think it belongs in PIpelineOptions, as bundle size is always a runner thing. We could

Re: makes bundle concept usable?

2017-11-30 Thread Ben Chambers
I think both concepts likely need to co-exist: As described in the execution model [1] bundling is a runner-specific choice about how to execute a pipeline. This affects how frequently it may need to checkpoint during process, how much communication overhead there is between workers, the scope of

Re: makes bundle concept usable?

2017-11-30 Thread Kenneth Knowles
Bundles: they are not for user-controlled batching; they are for runner-controlled amortization across elements. Trying to use them in another way is a misunderstanding of the model. IOs: It is perfectly fine for an IO to use bundle boundaries. But that is just one tool that an IO can use to

Re: makes bundle concept usable?

2017-11-30 Thread Eugene Kirpichov
"First immediately blocking issue is how to batch records reliably and *portably* (the biggest beam added-value IMHO). Since bundles are "flush" trigger for most IO it means ensuring the bundle size is somehow controllable or at least not set to a very small value OOTB." Please cite an existing

Re: makes bundle concept usable?

2017-11-30 Thread Romain Manni-Bucau
@Ben: would all IO be rewritten to use that and the bundle concept dropped from the API to avoid any ambiguity and misleading usage like in current IOs? Romain Manni-Bucau @rmannibucau | Blog | Old Blog | Github | LinkedIn 2017-11-30 18:43 GMT+01:00 Ben Chambers : > Beam

Re: makes bundle concept usable?

2017-11-30 Thread Ben Chambers
Beam includes a GroupIntoBatches transform (see https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/GroupIntoBatches.java) which I believe was intended to be used as part of such a portable IO. It can be used to request that elements are divided

Re: makes bundle concept usable?

2017-11-30 Thread Romain Manni-Bucau
2017-11-30 18:11 GMT+01:00 Eugene Kirpichov : > Very strong -1 from me: > - Having a pipeline-global parameter is bad because it will apply to all > transforms, with no ability to control it for individual transforms. This > can go especially poorly because it means that when

Re: Schema-Aware PCollections

2017-11-30 Thread Tyler Akidau
On Wed, Nov 29, 2017 at 6:38 PM Reuven Lax wrote: > There has been a lot of conversation about schemas on PCollections > recently. There are a number of reasons for this. Schemas as first-class > objects in Beam provide a nice base for building BeamSQL. Spark has > provided

Re: makes bundle concept usable?

2017-11-30 Thread Reuven Lax
I don't think it belongs in PIpelineOptions, as bundle size is always a runner thing. We could consider adding a new generic RunnerOptions, however I'm not convinced all runners can actually support this. On Thu, Nov 30, 2017 at 6:09 AM, Romain Manni-Bucau wrote: > Guys,

Re: makes bundle concept usable?

2017-11-30 Thread Eugene Kirpichov
Very strong -1 from me: - Having a pipeline-global parameter is bad because it will apply to all transforms, with no ability to control it for individual transforms. This can go especially poorly because it means that when I write a transform, I don't know whether a user will set this parameter in

Re: makes bundle concept usable?

2017-11-30 Thread Jean-Baptiste Onofré
It sounds reasonable to me. And agree for Spark, I would like to merge Spark 2 update first. Regards JB On 11/30/2017 03:09 PM, Romain Manni-Bucau wrote: Guys, what about moving getMaxBundleSize from flink options to pipeline options. I think all runners can support it right? Spark code

Re: makes bundle concept usable?

2017-11-30 Thread Romain Manni-Bucau
Guys, what about moving getMaxBundleSize from flink options to pipeline options. I think all runners can support it right? Spark code needs the merge of the v2 before being able to be implemented probably but I don't see any blocker. wdyt? Romain Manni-Bucau @rmannibucau | Blog | Old Blog |