Hi Kenn,
very interesting idea. It sounds more usable and "logic".
Regards
JB
On 11/30/2017 09:06 PM, Kenneth Knowles wrote:
Hi all,
Triggers are one of the more novel aspects of Beam's support for unbounded data.
They are also one of the most challenging aspects of the model.
Ben & I
Hi everyone,
We have been prototyping a Go SDK for Beam for some time and have reached
a point, where we think this effort might be of interest to the wider Beam
community and would benefit from being developed in a proper feature
branch. We have prepared a PR to that end:
Thanks!
On Thu, Nov 30, 2017 at 11:25 AM, Holden Karau wrote:
> Rocking, I'll start leaving some comments on this. I'm excited to see work
> being done in this area as well :)
>
> On Thu, Nov 30, 2017 at 9:20 AM, Tyler Akidau wrote:
>
>> On Wed, Nov
Hi all,
Triggers are one of the more novel aspects of Beam's support for unbounded
data. They are also one of the most challenging aspects of the model.
Ben & I have been working on a major new idea for how triggers could work
in the Beam model. We think it will make triggers much more usable,
2017-11-30 20:36 GMT+01:00 Kenneth Knowles :
> On Thu, Nov 30, 2017 at 11:28 AM, Romain Manni-Bucau
> wrote:
>>
>> This is my short term concern yes. Note that the opposite is not sane
>> neither (too big) cause it forces eager flushes in all IO (so instead
This is my short term concern yes. Note that the opposite is not sane
neither (too big) cause it forces eager flushes in all IO (so instead
of fixing it once in a single code location you impact everyone N
times). However it is not blocking as the small bundle size issue.
Next concenr is commit
Rocking, I'll start leaving some comments on this. I'm excited to see work
being done in this area as well :)
On Thu, Nov 30, 2017 at 9:20 AM, Tyler Akidau wrote:
> On Wed, Nov 29, 2017 at 6:38 PM Reuven Lax wrote:
>
>> There has been a lot of conversation
I mean: if these runners have some limitation that forces them into only
supporting tiny bundles, there's a good chance that this limitation will
also apply to whatever beam model API you propose as a fix, and they won't
be able to implement it.
On Thu, Nov 30, 2017, 11:19 AM Eugene Kirpichov
So is your main concern potential poor performance on runners that choose
to use a very small bundle size? (Currently an IO can trivially handle too
large bundles, simply by flushing when enough data accumulates, which is
what all IOs do - but indeed working around having unreasonably small
Le 30 nov. 2017 19:23, "Kenneth Knowles" a écrit :
On Thu, Nov 30, 2017 at 10:03 AM, Romain Manni-Bucau
wrote:
> Hmm,
>
> ESIO: https://github.com/apache/beam/blob/master/sdks/java/io/elas
> ticsearch/src/main/java/org/apache/beam/sdk/io/elasticsear
>
On Thu, Nov 30, 2017 at 10:03 AM, Romain Manni-Bucau
wrote:
> Hmm,
>
> ESIO: https://github.com/apache/beam/blob/master/sdks/java/io/
> elasticsearch/src/main/java/org/apache/beam/sdk/io/
> elasticsearch/ElasticsearchIO.java#L847
> JDBCIO:
Hmm,
ESIO:
https://github.com/apache/beam/blob/master/sdks/java/io/elasticsearch/src/main/java/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.java#L847
JDBCIO:
https://github.com/apache/beam/blob/master/sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java#L592
MongoIO:
Agree, but maybe we can inform the runner if wanted no ?
Honestly, from my side, I'm fine with the current situation as it's runner
specific.
Regards
JB
On 11/30/2017 06:12 PM, Reuven Lax wrote:
I don't think it belongs in PIpelineOptions, as bundle size is always a runner
thing.
We could
I think both concepts likely need to co-exist:
As described in the execution model [1] bundling is a runner-specific
choice about how to execute a pipeline. This affects how frequently it may
need to checkpoint during process, how much communication overhead there is
between workers, the scope of
Bundles: they are not for user-controlled batching; they are for
runner-controlled amortization across elements. Trying to use them in
another way is a misunderstanding of the model.
IOs: It is perfectly fine for an IO to use bundle boundaries. But that is
just one tool that an IO can use to
"First immediately blocking issue is how to batch records reliably and
*portably* (the biggest beam added-value IMHO).
Since bundles are "flush" trigger for most IO it means ensuring the
bundle size is somehow controllable or at least not set to a very
small value OOTB."
Please cite an existing
@Ben: would all IO be rewritten to use that and the bundle concept
dropped from the API to avoid any ambiguity and misleading usage like
in current IOs?
Romain Manni-Bucau
@rmannibucau | Blog | Old Blog | Github | LinkedIn
2017-11-30 18:43 GMT+01:00 Ben Chambers :
> Beam
Beam includes a GroupIntoBatches transform (see
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/GroupIntoBatches.java)
which I believe was intended to be used as part of such a portable IO. It
can be used to request that elements are divided
2017-11-30 18:11 GMT+01:00 Eugene Kirpichov :
> Very strong -1 from me:
> - Having a pipeline-global parameter is bad because it will apply to all
> transforms, with no ability to control it for individual transforms. This
> can go especially poorly because it means that when
On Wed, Nov 29, 2017 at 6:38 PM Reuven Lax wrote:
> There has been a lot of conversation about schemas on PCollections
> recently. There are a number of reasons for this. Schemas as first-class
> objects in Beam provide a nice base for building BeamSQL. Spark has
> provided
I don't think it belongs in PIpelineOptions, as bundle size is always a
runner thing.
We could consider adding a new generic RunnerOptions, however I'm not
convinced all runners can actually support this.
On Thu, Nov 30, 2017 at 6:09 AM, Romain Manni-Bucau
wrote:
> Guys,
Very strong -1 from me:
- Having a pipeline-global parameter is bad because it will apply to all
transforms, with no ability to control it for individual transforms. This
can go especially poorly because it means that when I write a transform, I
don't know whether a user will set this parameter in
It sounds reasonable to me.
And agree for Spark, I would like to merge Spark 2 update first.
Regards
JB
On 11/30/2017 03:09 PM, Romain Manni-Bucau wrote:
Guys,
what about moving getMaxBundleSize from flink options to pipeline
options. I think all runners can support it right? Spark code
Guys,
what about moving getMaxBundleSize from flink options to pipeline
options. I think all runners can support it right? Spark code needs
the merge of the v2 before being able to be implemented probably but I
don't see any blocker.
wdyt?
Romain Manni-Bucau
@rmannibucau | Blog | Old Blog |
24 matches
Mail list logo