> that the operation being combined is associative, commutative and have a
> valid unary operator '-' (which will be the result of the "retraction
> combine").
> On 4/23/24 18:08, Reuven Lax via dev wrote:
>
>
>
> On Tue, Apr 23, 2024 at 7:52 AM Jan Luk
nge
> Beam into a delta-processing engine, which it arguably should be, with
> whole append-only elements being a simplest degenerate case of a delta
> (which would be highly optimized in batch/archival processing).
>
> +1
>
>
> Kenn
>
> On Tue, Apr 16, 2024 at 2:36 AM
e
the trigger upstream of the stateful ParDo, though I'm not sure if that's
the best approach.
On Mon, Apr 15, 2024 at 11:31 PM Jan Lukavský wrote:
> On 4/11/24 18:20, Reuven Lax via dev wrote:
>
> I'm not sure it would require all that. A "basic" implementation could
Some initial thoughts:
Making schema inference handle generic classes would be a nice improvement
- users occasionally bump into this restriction, and there's no reason not
to improve it.
I would recommend using the new Java reflection APIs (i.e.
getRecordComponents) to directly infer the schema.
ed me clarify some more white
> spots.
>
> Jan
> On 4/10/24 19:24, Reuven Lax via dev wrote:
>
> Are you familiar with the "sink triggers" proposal?
>
> Essentially while windowing is usually a property of the data, and
> therefore flows downwards through
y
> not preserved across multiple transforms. Maybe the correct subject of this
> thread could be "are we sure our windowing and triggering semantics is 100%
> correct"? Probably the - wrong - expectations at the beginning of this
> thread were due to conflict in my mental model
So the problem here is that windowFn is a property of the PCollection, not
the element, and the result of Flatten is a single PCollection.
In various cases, there is a notion of "compatible" windows. Basically
given window functions W1 and W2, provide a W3 that "works" with both.
Note that Beam a
I do suspect that over time we'll find more and more cases we can't
express, and will be asked to extend this little templating in more
directions. To head that off - could we easily just reuse an existing
language (SQL, LUA, something of the form?) instead of creating something
new?
On Tue, Apr 2
Can the prefix still be generated programmatically at graph creation time?
On Wed, Mar 27, 2024 at 9:40 AM Robert Bradshaw wrote:
> On Wed, Mar 27, 2024 at 9:12 AM Reuven Lax wrote:
>
>> This does seem like the best compromise, though I think there will still
>> end up being performance issues.
This does seem like the best compromise, though I think there will still
end up being performance issues. A common pattern I've seen is that there
is a long common prefix to the dynamic destination followed the dynamic
component. e.g. the destination might be
long/common/path/to/destination/files/.
On Tue, Feb 27, 2024 at 10:22 AM Robert Bradshaw via dev <
dev@beam.apache.org> wrote:
> On Mon, Feb 26, 2024 at 11:45 AM Kenneth Knowles wrote:
> >
> > Pulling out focus points:
> >
> > On Fri, Feb 23, 2024 at 7:21 PM Robert Bradshaw via dev <
> dev@beam.apache.org> wrote:
> > > I can't act on s
ionale for why things are
> the way they are. It'll be good to record if not already somewhere.
>
> Kenn
>
> On Thu, Feb 22, 2024 at 2:43 AM Jan Lukavský wrote:
>
>>
>> On 2/21/24 18:27, Reuven Lax via dev wrote:
>>
>> Agreed, that event-time throttli
probably fine for many users.
>>
>> On Wed, Feb 21, 2024, 10:30 AM Reuven Lax via dev
>> wrote:
>>
>>> Yes, that's true. The technique I proposed will work for simple
>>> pipelines in streaming (e.g. basic ETL), where the throttling threads are
>>
odel solution is probably fine for many users.
>
> On Wed, Feb 21, 2024, 10:30 AM Reuven Lax via dev
> wrote:
>
>> Yes, that's true. The technique I proposed will work for simple pipelines
>> in streaming (e.g. basic ETL), where the throttling threads are probably
>
Is there a fundamental reason we serialize java classes into Flink
savepoints.
On Wed, Feb 21, 2024 at 9:51 AM Robert Bradshaw via dev
wrote:
> We could consider merging the gradle targets without renaming the
> classpaths as an intermediate step.
>
> Optimistically, perhaps there's a small numb
Yes, that's true. The technique I proposed will work for simple pipelines
in streaming (e.g. basic ETL), where the throttling threads are probably
all scheduled. For more complicated pipelines (or batch pipelines), we
might find that it overthrottles. Maybe a hybrid solution that uses state
would w
Agreed, that event-time throttling doesn't make sense here. In theory
processing-time timers have no SLA - i.e. their firing might be delayed -
so batch runners aren't violating the model by firing them all at the end;
however it does make processing time timers less useful in batch, as we see
here
Out of curiosity, did you add a warmup time before benchmarking? Schema and
row coder does codegen, so the first usage is very slow, but subsequent
usages should be much faster. I recommend running any test for a warmup
period before starting to measure.
On Fri, Dec 1, 2023, 9:13 AM Steven van Ros
Is the schema Group transform (in Java) something along these lines?
On Wed, Oct 18, 2023 at 1:11 PM Robert Bradshaw via dev
wrote:
> Beam Yaml has good support for IOs and mappings, but one key missing
> feature for even writing a WordCount is the ability to do Aggregations
> [1]. While the tra
Or are you specifically referring to the declarative YAML pipelines?
On Thu, Oct 19, 2023 at 12:53 PM Reuven Lax wrote:
> Is the schema Group transform (in Java) something along these lines?
>
> On Wed, Oct 18, 2023 at 1:11 PM Robert Bradshaw via dev <
> dev@beam.apache.org> wrote:
>
>> Beam Yam
I suspect some simple pattern templating would solve most use cases. We
probably would want to support timestamp formatting (e.g. $ $M $D) as
well.
On Tue, Oct 10, 2023 at 3:35 PM Robert Bradshaw wrote:
> On Mon, Oct 9, 2023 at 3:09 PM Chamikara Jayalath
> wrote:
>
>> I would say:
>>
>>
Just FYI - the reason why names (including prefixes) in DynamicDestinations
were parameterized via a lambda instead of just having the user add it via
MapElements is performance. We discussed something along the lines of what
you are suggesting (essentially having the user create a KV where the key
Bundle-FinishBundle,
> are there such cases? I'd say yes, but I'm a little struggling finding a
> specific example that cannot be solved using Setup or lazy init.
> On 9/27/23 19:58, Reuven Lax via dev wrote:
>
> DoFns are allowed to be non deterministic, so they don't have
performance to reduce shuffle size, as
opposed to a guaranteed RemoveDuplicates. This scenario doesn't require
FinishBundle, though it does require a StartBundle.
On Tue, Sep 26, 2023 at 11:59 AM Kenneth Knowles wrote:
>
>
> On Tue, Sep 26, 2023 at 1:33 PM Reuven Lax via dev
>
reason that we did not do so. Maybe we (incorrectly) thought that this
> was an issue that only the Java SDK harness needed to know about.
>
> Kenn
>
>
>> [1] https://github.com/apache/beam/issues/28649
>>
>> [2] https://github.com/apache/beam/issues/28650
>>
On Mon, Sep 25, 2023 at 6:19 AM Jan Lukavský wrote:
>
> On 9/23/23 18:16, Reuven Lax via dev wrote:
>
> Two separate things here:
>
> 1. Yes, a watermark can update in the middle of a bundle.
> 2. The records in the bundle themselves will prevent the watermark from
> up
Two separate things here:
1. Yes, a watermark can update in the middle of a bundle.
2. The records in the bundle themselves will prevent the watermark from
updating as they are still in flight until after finish bundle. Therefore
simply caching the records should always be watermark safe, regardle
Let's be careful about whether these tests are included in our presubmits.
Contrib code with flaky tests has been a major pain point in the past.
On Sat, Sep 2, 2023 at 12:02 PM Austin Bennett wrote:
> Wanting us to not miss this. @Mazlum TOSUN is
> happy to donate Asgarde to our project.
>
> I
Curious why these failing tests didn't block submission.
For now rollback seems to be the simplest option. However is there a path
forward on Java 11, or is our model irretrievably broken on Java 11?
On Fri, Jul 21, 2023 at 8:57 AM Kenneth Knowles wrote:
> This is a tricky situation that I don'
The goal was to make schema transforms as efficient as hand-written coders.
We found the avro encoding/decoding to often be quite inefficient, which is
one reason we didn't use it for schemas.
Our schema encoding is internal to Beam though, and not suggested for use
external to a pipeline. For sou
+1 (binding)
On Tue, May 30, 2023 at 2:43 PM Ahmet Altay via dev
wrote:
> +1 (binding)
>
> On Tue, May 30, 2023 at 2:01 PM Ritesh Ghorse via dev
> wrote:
>
>> Thanks Danny and Jack! Dataflow containers are up!
>>
>> Only PMC votes count but feel free to test your use cases and vote on
>> this t
Those particular errors are often expected in the sink due to the protocol
used. If a work item retries before committing (which could happen for many
reasons including worker crashes), it will experience those errors.
On Fri, Apr 28, 2023 at 12:55 PM Ahmed Abualsaud
wrote:
> @Danny McCormick @
On Tue, Mar 28, 2023 at 12:39 AM Jan Lukavský wrote:
>
> On 3/27/23 19:44, Reuven Lax via dev wrote:
>
>
>
> On Mon, Mar 27, 2023 at 5:43 AM Jan Lukavský wrote:
>
>> Hi,
>>
>> I'd like to clarify my understanding. Side inputs generally perform a
>
On Mon, Mar 27, 2023 at 5:43 AM Jan Lukavský wrote:
> Hi,
>
> I'd like to clarify my understanding. Side inputs generally perform a left
> (outer) join, LHS side is the main input, RHS is the side input.
>
Not completely - it's more of what I would call a nested-loop join. I.e. if
the side input
We are trying to reproduce and debug the issue we saw to validate whether
it was a real regression or not. Will update when we know more.
On Wed, Mar 8, 2023 at 11:31 AM Danny McCormick
wrote:
>
> @Reuven Lax found a new potential regression in
> BigQuery I/O, so I have paused the release rollo
If possible, I would like to see if we could include
https://github.com/apache/beam/pull/25642 as we believe this bug has been
impacting multiple users. This was merged 4 days ago, but this RC cut does
not seem to include it.
On Fri, Mar 3, 2023 at 12:18 PM Valentyn Tymofieiev via dev <
dev@beam.a
suggestion is to try to solve the problem in terms of what you
>>>>>> want to compute. Instead of trying to control the operational aspects
>>>>>> like
>>>>>> "read all the BQ before reading Pubsub" there is presumably some
; "happen" in a pipeline, it is often best to pivot your thinking to the data
>> and what you are trying to compute, and the built-in dependencies will
>> solve the ordering problems.
>>
>> So - is there a way to describe your problem in terms of the data and
>&g
First PCollections are completely unordered, so there is no guarantee on
what order you'll see events in the flattened PCollection.
There may be ways to process the BigQuery data in a separate transform
first, but it depends on the structure of the data. How large is the
BigQuery table? Are you do
Can you explain this use case some more? Is this a streaming pipeline? If
so, how are you reading from BigQuery?
On Thu, Feb 23, 2023 at 10:06 PM Sahil Modak via dev
wrote:
> Hi,
>
> We have a requirement wherein we are consuming input from pub/sub
> (PubSubIO) as well as BQ (BQIO)
>
> We want t
I believe that when zetasketch was added, it was also noticeably more
efficient than other sketch implementations. However this was a number of
years ago, and I don't know whether it still has an advantage or not.
On Wed, Jan 18, 2023 at 10:41 AM Byron Ellis via dev
wrote:
> Hi everyone,
>
> I w
le others like ListState we should have List<@Nullable T>.
>
> On Tue, Jan 3, 2023 at 12:37 PM Reuven Lax via dev
> wrote:
>
>> It should be @Nullable - I'm not sure why that was removed.
>>
>> On Tue, Jan 3, 2023 at 12:18 PM Ahmet Altay via dev
>> wrote:
It should be @Nullable - I'm not sure why that was removed.
On Tue, Jan 3, 2023 at 12:18 PM Ahmet Altay via dev
wrote:
> Forwarding, because this message got lost in the list moderation.
>
> -- Forwarded message --
> From: Jeeno Lentin
> To: dev@beam.apache.org
> Cc:
> Bcc:
> Da
Be very careful with the auto schema stuff around Avro. These classes
dynamically inspect Avro-generated classes (to codegen schema accessors) so
it will be easy to break this in a way that is not seen at compile time.
On Mon, Jan 2, 2023 at 12:17 PM Alexey Romanenko
wrote:
> Here is the recent
We have run into som JDK-specific issues with our use of ByteBuddy though.
On Thu, Dec 1, 2022 at 3:43 PM Luke Cwik via dev
wrote:
> We do support JDK8, JDK11 and JDK17. Our story around newer features
> within JDKs 9+ like modules is mostly non-existent though.
>
> We rarely run into JDK specif
Out of curiosity, several IOs (including PubSub) already do support
schemas. Are you planning on modifying those?
On Tue, Nov 15, 2022 at 11:50 AM Damon Douglas via dev
wrote:
> Hello Everyone,
>
> Do we like the following Java class naming convention for
> SchemaTransformProviders [1]? The pro
at for sure. :)
>
> Jan
> On 10/24/22 19:59, Kenneth Knowles wrote:
>
>
>
> On Mon, Oct 24, 2022 at 5:51 AM Jan Lukavský wrote:
>
>> On 10/22/22 21:47, Reuven Lax via dev wrote:
>>
>> I think we stated that CoGroupbyKey was also a primitive, though in
>> p
On Mon, Oct 24, 2022 at 5:50 AM Jan Lukavský wrote:
> On 10/22/22 21:47, Reuven Lax via dev wrote:
>
> I think we stated that CoGroupbyKey was also a primitive, though in
> practice it's implemented in terms of GroupByKey today.
>
> On Fri, Oct 21, 2022 at 3:05 PM
I think we stated that CoGroupbyKey was also a primitive, though in
practice it's implemented in terms of GroupByKey today.
On Fri, Oct 21, 2022 at 3:05 PM Kenneth Knowles wrote:
>
>
> On Fri, Oct 21, 2022 at 5:24 AM Jan Lukavský wrote:
>
>> Hi,
>>
>> I have some missing pieces in my understand
+1
This happens to me regularly. It fails on Jenkins but succeeds on my
machine, and it's hard to figure out why (since all you see on Jenkins is a
compile error). Then I'm always trying to remember how to enable it
locally. IMO development would be faster if this was enabled locally.
Anyone who d
PCollections's usually are persistent within a pipeline, so you can reuse
them in other parts of a pipeline with no problem.
There is no notion of state across pipelines - every pipeline is
independent. If you want state across pipelines you can write the
PCollection out to a set of files which ar
To add to this, I think one reason originally for using "sickbay" was to
emphasize that this should be temporary. Removing tests from pre/post
commits permanently is a bad state to be in - at that point why even have
the test? Ideally if a test is extremely flaky, fixing that is highly
prioritized.
How do you want to use the side input?
On Wed, Jul 20, 2022 at 10:45 PM Sahil Modak
wrote:
> Hi,
>
> We are looking to use the side input feature for one of our DoFns. The
> side input has to be a PCollection which is being constructed from a
> subscription using PubsubIO.read
>
> We want our pr
Welcome Steve!
On Tue, Jul 19, 2022 at 1:05 PM Connell O'Callaghan via dev <
dev@beam.apache.org> wrote:
>
> +++1 Woohoo! Congratulations Steven (and to the BEAM community) on this
> announcement!!!
>
> Thank you Luke for this update
>
>
> On Tue, Jul 19, 2022 at 12:34 PM Robert Burke wrote:
This generally means you have a CLASSPATH problem - either the JAR
containing that class isn't ending up being linked in, or the wrong version
of the JAR is.
On Fri, Jul 8, 2022 at 2:12 PM Sahith Nallapareddy via dev <
dev@beam.apache.org> wrote:
> Hello,
>
> I sent an email sometime ago about ja
55 matches
Mail list logo