the Generative AI world at the
>> "Generative AI Meetup" Wednesday afternoon - if the doc Ben linked to (or
>> GenAI) is interesting to you and you'll be at the conference I'd love to
>> touch base in person!
>>
>> -Ryan
>>
>> On
Hello Beam!
Kaskada has created a query language for expressing temporal queries,
making it easy to work with multiple streams and perform temporally
correct joins. We’re looking at taking our native, columnar execution
engine and making it available as a PTransform and FnHarness for use
with Apac
On Wed, Oct 3, 2018 at 12:16 PM Jean-Baptiste Onofré
wrote:
> Hi Anton,
>
> jackson is the json extension as we have XML. Agree that it should be
> documented.
>
> Agree about join-library.
>
> sketching is some statistic extensions providing ready to use stats
> CombineFn.
>
> Regards
> JB
>
> O
I think there is a confusion of terminology here. Let me attempt to clarify
as a (mostly) outsider.
I think that Etienne is correct, in that the SDK harness only reports the
difference associated with a bundle. So, if we have a metric that measures
execution time, the SDK harness reports the time
I think Kenn's second option accurately reflects my memory of the original
intentions:
1. I remember we we considered either using the Future interface or calling
the ReadableState interface a future, and explicitly said "no, future
implies asynchrony and that the value returned by `get` won't cha
+1 to introducing alternative transforms even if they wrap Reshuffle
The benefits of making them distinct is that we can put appropriate Javadoc
in place and runners can figure out what the user is intending and whether
Reshuffle or some other implementation is appropriate. We can also see
which o
It looks like the problem is that your project uses beam-examples-parent as
its parent. This was the suggested way of creating a Maven project because
we published a POM and this provided an easy way to make sure that derived
projects inherited the same options and versions as Beam used.
With the
That sounds like a very reasonable choice -- given the discussion seemed to
be focusing on the differences between these two categories, separating
them will allow the proposal (and implementation) to address each category
in the best way possible without needing to make compromises.
Looking forwa
Le 12 avr. 2018 08:06, "Robert Bradshaw" a écrit :
>
>> By "type" of metric, I mean both the data types (including their
>> encoding) and accumulator strategy. So sumint would be a type, as would
>> double-distribution.
>>
>> On Wed, Apr 11, 2018 a
t; feature
> that I was proposing we kick down the road (and may never get to). What I'm
> suggesting is that we support custom metrics of standard type.
>
> On Wed, Apr 11, 2018 at 5:52 PM Ben Chambers wrote:
>
>> The metric api is designed to prevent user defined metric
The metric api is designed to prevent user defined metric types based on
the fact they just weren't used enough to justify support.
Is there a reason we are bringing that complexity back? Shouldn't we just
need the ability for the standard set plus any special system metrivs?
On Wed, Apr 11, 2018
I believe it doesn't need to be stable across refactoring, only across all
workers executing a specific version of the code. Specifically, it is used
as follows:
1. Create a pipeline on the user's machine. It walks the stack until the
static initializer block, which provides an ID.
2. Send the pip
tric? What
> parameter combos are supported?
> 2. Should the SDK support different Metric variables contributing to the
> same Metric value? If so, how does the SDK communicate that?
> 3. Should runtime-only scopes be exposed to the user? Useful for
> Distributions?
> 4. What sho
Beam should support some sort of label / tag / worker-id for
>> aggregation of Gauges (maybe other metrics?)
>>
>> -P.
>>
>> On Fri, Apr 6, 2018 at 11:21 AM Ben Chambers
>> wrote:
>>
>>> Gauges are incredibly useful for exposing the current st
th are same to the user.
>>>>
>>>> On Fri, Apr 6, 2018 at 9:31 AM Pablo Estrada
>>>> wrote:
>>>>
>>>>> Hi Ben : D
>>>>>
>>>>> Sure, that's reasonable. And perhaps I started the discussion in the
&
See for instance how gauge metrics are handled in Prometheus, Datadog and
Stackdriver monitoring. Gauges are perfect for use in distributed systems,
they just need to be properly labeled. Perhaps we should apply a default
tag or allow users to specify one.
On Fri, Apr 6, 2018, 9:14 AM Ben
Some metrics backend label the value, for instance with the worker that
sent it. Then the aggregation is latest per label. This makes it useful for
holding values such as "memory usage" that need to hold current value.
On Fri, Apr 6, 2018, 9:00 AM Scott Wegner wrote:
> +1 on the proposal to supp
e need a very custom
> thing and we are done for me.
>
> Le 13 mars 2018 19:26, "Ben Chambers" a écrit :
>
>> I think the existing rationale (not introducing lots of special fluent
>> methods) makes sense. However, if we look at the Java Stream API, we
>> p
I think the existing rationale (not introducing lots of special fluent
methods) makes sense. However, if we look at the Java Stream API, we
probably wouldn't need to introduce *a lot* of fluent builders to get most
of the functionality. Specifically, if we focus on map, flatMap, and
collect from th
s.
>
> Le 18 févr. 2018 20:07, "Reuven Lax" a écrit :
>
>>
>>
>> On Sun, Feb 18, 2018 at 10:50 AM, Romain Manni-Bucau <
>> rmannibu...@gmail.com> wrote:
>>
>>>
>>>
>>> Le 18 févr. 2018 19:28, "Ben Chambers" a éc
It feels like his thread may be a bit off-track. Rather than focusing on
the semantics of the existing methods -- which have been noted to be meet
many existing use cases -- it would be helpful to focus on more on the
reason you are looking for something with different semantics.
Some possibilitie
It sounds like in your specific case you're saying that the same encoding
can be viewed by the Java type system two different ways. For instance, if
you have an object Person that is convertible to JSON using Jackson, than
that JSON encoding can be viewed as either a Person or a Map looking at the
hould focus on is making
>>>> simple things simple. Beam is very powerful, but it doesn't always
>>>> make easy things easy. Features like schema'd PCollections could go a
>>>> long way here. Also fully fleshing out/smoothing our runner
>>>> p
ity story is part of this too.
>>>
>>> For beam 3.x we could also reason about if there's any complexity that
>>> doesn't hold its weight (e.g. side inputs on CombineFns).
>>>
>>> On Mon, Jan 22, 2018 at 9:20 PM, Jean-Baptiste Onofré
>>>
+1. Release notes from jira may be hard to digest for people outside the
project. A short summary for may be really helpful for those considering
Beam or loosely following the project to get an understanding of what is
happening.
It also presents a good opportunity share progress externally, which
Thanks Davor for starting the state of the project discussions [1].
In this fork of the state of the project discussion, I’d like to start the
discussion of the feature roadmap for 2018 (and beyond).
To kick off the discussion, I think the features could be divided into
several areas, as follows:
en its last pane has fired. I could see this be a property on the View
> transform itself. In terms of implementation - I tried to figure out how
> side input readiness is determined, in the direct runner and Dataflow
> runner, and I'm completely lost and would appreciate some help.
&g
This would be absolutely great! It seems somewhat similar to the changes
that were made to the BigQuery sink to support WriteResult (
https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteResult.java
).
I find it helpfu
log | Github | LinkedIn
>
>
> 2017-11-30 18:43 GMT+01:00 Ben Chambers :
> > Beam includes a GroupIntoBatches transform (see
> >
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/GroupIntoBatches.java
> )
> > which I be
Beam includes a GroupIntoBatches transform (see
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/GroupIntoBatches.java)
which I believe was intended to be used as part of such a portable IO. It
can be used to request that elements are divided in
One risk to "squash and merge" is that it may lead to commits that don't
have clean descriptions -- for instance, commits like "Fixing review
comments" will show up. If we use (a) these would also show up as separate
commits. It seems like there are two cases of multiple commits in a PR:
1. Multip
Strong +1 to both increasing the frequency of minor releases and also
putting together a road map for the next major release or two.
I think it would be great to communicate to the community the direction
Beam is taking in the future -- what things will users be able to do with
3.0 or 4.0 that the
I think discussing a runner agnostic way of configuring how metrics are
extracted is a great idea -- thanks for bringing it up Etienne!
Using a thread that polls the pipeline result relies on the program that
created and submitted the pipeline continuing to run (eg., no machine
faults, network pro
This seems to be the second thread entitled "[VOTE] Release 2.2.0, release
candidate #2". The subject and description refer to release candidate #2,
however the artifacts mention v2.2.0-RC3. Which release candidate is this
vote thread for?
On Wed, Nov 8, 2017 at 12:52 PM Jean-Baptiste Onofré
wrot
I think both Gradle and Bazel are worth exploring. Gradle is definitely
more common in the wild, but Bazel may be a better fit for the large
mixture of languages being developed in one codebase within Beam. It might
be helpful for us to list what functionality we want from such a tool, and
then hav
I believe license errors are detected by Apache RAT, which are part of the
release profile. I believe you need to run "mvn clean verify -Prelease" to
enable anything associated with that profile.
On Mon, Oct 23, 2017 at 11:46 AM Valentyn Tymofieiev
wrote:
> Hi Beam-Dev,
>
> It's been >5 days sin
It looks like ReadableFile#open does currently decompress the stream, but
it seems like we could add a ReadableFile#openRaw(...) or something like
that which didn't implicitly decompress. Then libraries such as Tika which
want the *actual* file content could use that method. Would that address
your
PCollection that is then processed in some other way.
On Fri, Sep 22, 2017 at 10:18 AM Allison, Timothy B.
wrote:
> Do tell...
>
> Interesting. Any pointers?
>
> -Original Message-----
> From: Ben Chambers [mailto:bchamb...@google.com.INVALID]
> Sent: Friday, September 2
Regarding specifically elements that are failing -- I believe some other IO
has used the concept of a "Dead Letter" side-output,, where documents that
failed to process are side-output so the user can handle them appropriately.
On Fri, Sep 22, 2017 at 9:47 AM Eugene Kirpichov
wrote:
> Hi Tim,
>
Any elaboration or jira issues describing what is broken? Any proposal for
what changes need to happen to fix it?
On Tue, Sep 19, 2017, 5:49 PM Chamikara Jayalath
wrote:
> +1 for cutting 2.1.1 for Python SDK only.
>
> Thanks,
> Cham
>
> On Tue, Sep 19, 2017 at 5:43 PM Robert Bradshaw
>
> wrote:
One possible issue with this is that updating a thread local is likely to
be much more expensive than passing an additional argument. Also, not all
code called from within the DoFn will necessarily be in the same thread
(eg., sometimes we create a pool of threads for doing work). It may be
*more* c
If the user classes are not Serializable, how does adding Serializable to
the WindowedValue help? The user class, which is stored in a field in the
WindowedValue will still be non-serializable, and thus cause problems.
On Thu, Aug 24, 2017 at 11:59 AM Kobi Salant wrote:
> right, if it is not kry
I think as Luke pointed out, one problem with Serializable is that it isn't
guaranteed to be stable across different JVM versions.
The only drawback I see here from the user perspective is that if I have
already written a custom coder to handle my type in some reasonable way,
this is going to igno
d make sure
> we
> > > > >> document that it is explicitly the input elements that will be
> > > replayed,
> > > > >> and bundles and other operational are still arbitrary.
> > > > >>
> > > > >>
> > > &g
an annotation or change
> a type to indicate that the input PCollection should be
> well-defined/deterministically-replayable. I don't have a really strong
> opinion.
>
> Kenn
>
> On Tue, Mar 21, 2017 at 4:53 PM, Ben Chambers >
> wrote:
>
> > Allowing an
I think with GraphViz you can put a label on the "subgraph" that represents
each composite. This would allow each leaf transform to be just the name of
that leaf within the composite, rather than full name. Is that what you
were suggesting Robert?
On Thu, Aug 3, 2017 at 10:06 PM Robert Bradshaw
w
I also reported something similar to this as
https://issues.apache.org/jira/browse/BEAM-2577. That issue was reported
because we don't have any tests that use a runner and attempt to pass
ValueProviders in. This means that we've found bugs such as
NestedValueProviders used with non-serializable ano
Regarding changing the coder -- keep in mind that there may be persisted
state somewhere, so we can't just change the coder once this is used.
If the processing of scanning for modified and new files reported the
last-modified-time, could we use that and have the SDF report KV with the last-modifi
I think the distinction between metrics being reported (the API we give
users to create Metrics) from getting metrics out of (currently either
using the PipelineResult to query results or connecting to an external
metric store) is important.
There is an additional distinction to be made in data pr
There hasn't been a need for user defined triggers and we've found it is
really hard to create triggers with proper behavior.
Can you elaborate on why you are trying to use triggers to understand
watermarks in this way? It's not clear how this trigger would be useful to
that understanding.
On Wed
n not express it.
>
> Best,JingsongLee
> --
> From:Ben Chambers
> Time:2017 Jun 2 (Fri) 21:46
> To:dev ; JingsongLee
> Cc:Aviem Zur ; Ben Chambers
>
> Subject:Re: [DISCUSS] Source Watermark Metrics
> I think havi
2. Jun 2017, at 03:52, JingsongLee wrote:
> >
> > @Aviem Zur @Ben Chambers What do you think about the value of
> METRIC_MAX_SPLITS?
> >
> >
> --From:JingsongLee
> Time:2017 May 11 (Thu)
> 16:37To:dev@beam.apache.
Exposing the CombineFn is necessary for use with composed combine or
combining value state. There may be other reasons we made this visible, but
these continue to justify it.
On Sun, May 14, 2017, 1:00 PM Reuven Lax wrote:
> I believe the reason why this is called Top.largest, is that originally
Correction autovalue coder.
On Wed, Apr 5, 2017, 2:24 PM Ben Chambers wrote:
> Serializable coder had a separate set of issues - often larger and less
> efficient. Ideally, we would have an avrocoder.
>
> On Wed, Apr 5, 2017, 2:15 PM Pablo Estrada
> wrote:
>
> As
Serializable coder had a separate set of issues - often larger and less
efficient. Ideally, we would have an avrocoder.
On Wed, Apr 5, 2017, 2:15 PM Pablo Estrada
wrote:
> As a note, it seems that SerializableCoder does the trick in this case, as
> it does not require a no-arg constructor for th
When you say instance do you mean a different named PTransform class? Or
just a separate instance of whatever "HandleAssertResult" PTransform is
introduced? If the latter, I believe that makes sense. It also addresses
the problems with committed vs. attempted metrics. Specifically:
Since we want t
Allowing an annotation on DoFn's that produce deterministic output could be
added in the future, but doesn't seem like a great option.
1. It is a correctness issue to assume a DoFn is deterministic and be
wrong, so we would need to assume all transform outputs are
non-deterministic unless annotate
Following up on what Kenn said, it seems like the idea of a per-key
watermark is logically consistent with Beam model. Each runner already
chooses how to approximate the model-level concept of the per-key watermark.
The list Kenn had could be extended with things ilke:
4. A runner that assumes the
r using a "pull-like" would use an adapter ?
> >>>
> >>> On Sat, Feb 18, 2017 at 6:27 PM Jean-Baptiste Onofré
> >>> wrote:
> >>>
> >>>> Hi Ben,
> >>>>
> >>>> ok it's what I thought. Thanks for the clarification.
gt;> their
>>> own metrics reporting system - but that's for the runner author to
>> decide.
>>> Stas did this for the Spark runner because Spark doesn't report back
>>> user-defined Accumulators (Spark's Aggregators) to it's Metrics system.
&
cs I think we should be fine, though this could also change between
>> runners as well.
>>
>> Finally, considering "split-metrics" is interesting because on one hand
it
>> affects the pipeline directly (unbalanced partitions in Kafka that may
>> cause backlog) b
On Tue, Feb 14, 2017 at 9:07 AM Stephen Sisk
wrote:
> hi!
>
> (ben just sent his mail and he covered some similar topics to me, but I'll
> keep my comments intact since they are slightly different)
>
> * I think there are a lot of metrics that should be exposed for all
> transforms - everything f
Thanks for starting this conversation Ismael! I too have been thinking
we'll need some general approach to metrics for IO in the near future.
Two general thoughts:
1. Before making the metrics configurable, I think it would be worthwhile
to see if we can find the right set of metrics that provide
Going back to the beginning, why not "Combine.perKey(Count.fn())" or some
such?
We have a lot of boilerplate already to support "Count.perKey()" and that
will only become worse when we try to have a Count.PerKey class that has
the functionality of Combine.PerKey, why not just get rid of the
boiler
ing are:
> > - A transform that takes elements and produces batches, like Robert said
> > - A simple Beam-agnostic library that takes Java objects and produces
> > batches of Java objects, with an API that makes it convenient to use in a
> > typical batching DoFn
>
>
The third option for batching:
- Functionality within the DoFn and DoFnRunner built as part of the SDK.
I haven't thought through Batching, but at least for the
IntraBundleParallelization use case this actually does make sense to expose
as a part of the model. Knowing that a DoFn supports paralle
> >> >>> meaningful and potentially less expensive to implement in the
> absence
> >> of
> >> >>> state (this is why it needs a design discussion at all, really).
> >> >>>
> >> >>> Caveat: these APIs are new and not supported in e
t; A) Should we create a util for filtering MetricsResults based on a query,
> using the suggested filtering implementation in this thread (Substring
> match) ?
> B) Should direct runner be changed to use such a util, and conform with the
> results filtering suggested.
>
> On Tue, Jan
eeds to be discussed/nailed down
to help with your PR?
On Thu, Jan 19, 2017 at 3:57 PM Ben Chambers wrote:
> On Thu, Jan 19, 2017 at 3:28 PM Amit Sela wrote:
>
> On Fri, Jan 20, 2017 at 1:18 AM Ben Chambers >
> wrote:
>
> > Part of the problem here is that whether attempted
On Thu, Jan 19, 2017 at 3:28 PM Amit Sela wrote:
> On Fri, Jan 20, 2017 at 1:18 AM Ben Chambers >
> wrote:
>
> > Part of the problem here is that whether attempted or committed is "what
> > matters" depends on the metric. If you're counting the number of
t
> tricky since a runner is expected to keep unique naming/ids for steps, but
> users are supposed to be aware of this here and I'd suspect users might not
> follow and if they use the same ParDo in a couple of places they'll query
> it and it might be confusing for them to see &qu
Thanks for starting the discussion! I'm going to hold off saying what I
think and instead just provide some background and additional questions,
because I want to see where the discussion goes.
When I first suggested the API for querying metrics I was adding it for
parity with aggregators. A good
We should start by understanding the goals. If elements are in different
windows can they be out in the same batch? If they have different
timestamps what timestamp should the batch have?
As a composite transform this will likely require a group by key which may
affect performance. Maybe within a
We thought about it when this was added. We decided against it because
these are overrides-in-spirit. Letting them be private would be misleading
because they are called from outside the class and should be thought of in
that way.
Also, this seems similar to others in the Java ecosystem: JUnit tes
The Metrics API in Beam is proposed to support both committed metrics (only
from the successfully committed bundles) and attempted metrics (from all
attempts at processing bundles). I think any mechanism based on the workers
reporting metrics to a monitoring system will only be able to support
atte
Dan's proposal to move forward with a simple (future-proofed) version of
the ToString transform and Javadoc, and add specific features via follow-up
PRs.
On Thu, Dec 29, 2016 at 3:53 PM Jesse Anderson
wrote:
> @Ben which idea do you like?
>
> On Thu, Dec 29, 2016 at 3:20 P
I like that idea, with the caveat that we should probably come up with a
better name. Perhaps "ToString.elements()" and ToString.Elements or
something? Calling one the "default" and using "create" for it seems
moderately non-future proof.
On Thu, Dec 29, 2016 at 3:17 PM Dan Halperin
wrote:
> On
+1 to Dan's point that MapElements.via is not that hard to use and going
too far down this path leads to significant complexity.
Pipelines should generally prefer to deal with structured data as much as
possible. As has been discussed on this thread, though, sometimes it is
necessary to convert th
Don't they need to be visible for use with composed combine and combining
value state?
On Thu, Dec 22, 2016, 9:45 AM Lukasz Cwik wrote:
> Those are used internally within Sum and its expected that users instead
> call Sum.integersPerKey, or Sum.doublesPerKey, or Sum.integersGlobally, or
> ...
>
79 matches
Mail list logo