Re: PCollection#applyWindowingStrategyInternal

2024-04-25 Thread Reuven Lax via dev
> that the operation being combined is associative, commutative and have a > valid unary operator '-' (which will be the result of the "retraction > combine"). > On 4/23/24 18:08, Reuven Lax via dev wrote: > > > > On Tue, Apr 23, 2024 at 7:52 AM Jan Luk

Re: PCollection#applyWindowingStrategyInternal

2024-04-23 Thread Reuven Lax via dev
nge > Beam into a delta-processing engine, which it arguably should be, with > whole append-only elements being a simplest degenerate case of a delta > (which would be highly optimized in batch/archival processing). > > +1 > > > Kenn > > On Tue, Apr 16, 2024 at 2:36 AM

Re: PCollection#applyWindowingStrategyInternal

2024-04-15 Thread Reuven Lax via dev
e the trigger upstream of the stateful ParDo, though I'm not sure if that's the best approach. On Mon, Apr 15, 2024 at 11:31 PM Jan Lukavský wrote: > On 4/11/24 18:20, Reuven Lax via dev wrote: > > I'm not sure it would require all that. A "basic" implementation could

Re: [Feature proposal] Java Record Schema inference

2024-04-15 Thread Reuven Lax via dev
Some initial thoughts: Making schema inference handle generic classes would be a nice improvement - users occasionally bump into this restriction, and there's no reason not to improve it. I would recommend using the new Java reflection APIs (i.e. getRecordComponents) to directly infer the schema.

Re: PCollection#applyWindowingStrategyInternal

2024-04-11 Thread Reuven Lax via dev
ed me clarify some more white > spots. > > Jan > On 4/10/24 19:24, Reuven Lax via dev wrote: > > Are you familiar with the "sink triggers" proposal? > > Essentially while windowing is usually a property of the data, and > therefore flows downwards through

Re: PCollection#applyWindowingStrategyInternal

2024-04-10 Thread Reuven Lax via dev
y > not preserved across multiple transforms. Maybe the correct subject of this > thread could be "are we sure our windowing and triggering semantics is 100% > correct"? Probably the - wrong - expectations at the beginning of this > thread were due to conflict in my mental model

Re: PCollection#applyWindowingStrategyInternal

2024-04-06 Thread Reuven Lax via dev
So the problem here is that windowFn is a property of the PCollection, not the element, and the result of Flatten is a single PCollection. In various cases, there is a notion of "compatible" windows. Basically given window functions W1 and W2, provide a W3 that "works" with both. Note that Beam a

Re: Supporting Dynamic Destinations in a portable context

2024-04-02 Thread Reuven Lax via dev
I do suspect that over time we'll find more and more cases we can't express, and will be asked to extend this little templating in more directions. To head that off - could we easily just reuse an existing language (SQL, LUA, something of the form?) instead of creating something new? On Tue, Apr 2

Re: Supporting Dynamic Destinations in a portable context

2024-03-27 Thread Reuven Lax via dev
Can the prefix still be generated programmatically at graph creation time? On Wed, Mar 27, 2024 at 9:40 AM Robert Bradshaw wrote: > On Wed, Mar 27, 2024 at 9:12 AM Reuven Lax wrote: > >> This does seem like the best compromise, though I think there will still >> end up being performance issues.

Re: Supporting Dynamic Destinations in a portable context

2024-03-27 Thread Reuven Lax via dev
This does seem like the best compromise, though I think there will still end up being performance issues. A common pattern I've seen is that there is a long common prefix to the dynamic destination followed the dynamic component. e.g. the destination might be long/common/path/to/destination/files/.

Re: [DISCUSS] Processing time timers in "batch" (faster-than-wall-time [re]processing)

2024-02-27 Thread Reuven Lax via dev
On Tue, Feb 27, 2024 at 10:22 AM Robert Bradshaw via dev < dev@beam.apache.org> wrote: > On Mon, Feb 26, 2024 at 11:45 AM Kenneth Knowles wrote: > > > > Pulling out focus points: > > > > On Fri, Feb 23, 2024 at 7:21 PM Robert Bradshaw via dev < > dev@beam.apache.org> wrote: > > > I can't act on s

Re: Throttle PTransform

2024-02-22 Thread Reuven Lax via dev
ionale for why things are > the way they are. It'll be good to record if not already somewhere. > > Kenn > > On Thu, Feb 22, 2024 at 2:43 AM Jan Lukavský wrote: > >> >> On 2/21/24 18:27, Reuven Lax via dev wrote: >> >> Agreed, that event-time throttli

Re: Throttle PTransform

2024-02-21 Thread Reuven Lax via dev
probably fine for many users. >> >> On Wed, Feb 21, 2024, 10:30 AM Reuven Lax via dev >> wrote: >> >>> Yes, that's true. The technique I proposed will work for simple >>> pipelines in streaming (e.g. basic ETL), where the throttling threads are >>

Re: Throttle PTransform

2024-02-21 Thread Reuven Lax via dev
odel solution is probably fine for many users. > > On Wed, Feb 21, 2024, 10:30 AM Reuven Lax via dev > wrote: > >> Yes, that's true. The technique I proposed will work for simple pipelines >> in streaming (e.g. basic ETL), where the throttling threads are probably >

Re: Pipeline upgrade to 2.55.0-SNAPSHOT broken for FlinkRunner

2024-02-21 Thread Reuven Lax via dev
Is there a fundamental reason we serialize java classes into Flink savepoints. On Wed, Feb 21, 2024 at 9:51 AM Robert Bradshaw via dev wrote: > We could consider merging the gradle targets without renaming the > classpaths as an intermediate step. > > Optimistically, perhaps there's a small numb

Re: Throttle PTransform

2024-02-21 Thread Reuven Lax via dev
Yes, that's true. The technique I proposed will work for simple pipelines in streaming (e.g. basic ETL), where the throttling threads are probably all scheduled. For more complicated pipelines (or batch pipelines), we might find that it overthrottles. Maybe a hybrid solution that uses state would w

Re: Throttle PTransform

2024-02-21 Thread Reuven Lax via dev
Agreed, that event-time throttling doesn't make sense here. In theory processing-time timers have no SLA - i.e. their firing might be delayed - so batch runners aren't violating the model by firing them all at the end; however it does make processing time timers less useful in batch, as we see here

Re: Row compatible generated coders for custom classes

2023-12-02 Thread Reuven Lax via dev
Out of curiosity, did you add a warmup time before benchmarking? Schema and row coder does codegen, so the first usage is very slow, but subsequent usages should be much faster. I recommend running any test for a warmup period before starting to measure. On Fri, Dec 1, 2023, 9:13 AM Steven van Ros

Re: [YAML] Aggregations

2023-10-19 Thread Reuven Lax via dev
Is the schema Group transform (in Java) something along these lines? On Wed, Oct 18, 2023 at 1:11 PM Robert Bradshaw via dev wrote: > Beam Yaml has good support for IOs and mappings, but one key missing > feature for even writing a WordCount is the ability to do Aggregations > [1]. While the tra

Re: [YAML] Aggregations

2023-10-19 Thread Reuven Lax via dev
Or are you specifically referring to the declarative YAML pipelines? On Thu, Oct 19, 2023 at 12:53 PM Reuven Lax wrote: > Is the schema Group transform (in Java) something along these lines? > > On Wed, Oct 18, 2023 at 1:11 PM Robert Bradshaw via dev < > dev@beam.apache.org> wrote: > >> Beam Yam

Re: [YAML] Fileio sink parameterization (streaming, sharding, and naming)

2023-10-10 Thread Reuven Lax via dev
I suspect some simple pattern templating would solve most use cases. We probably would want to support timestamp formatting (e.g. $ $M $D) as well. On Tue, Oct 10, 2023 at 3:35 PM Robert Bradshaw wrote: > On Mon, Oct 9, 2023 at 3:09 PM Chamikara Jayalath > wrote: > >> I would say: >> >>

Re: [YAML] Fileio sink parameterization (streaming, sharding, and naming)

2023-10-09 Thread Reuven Lax via dev
Just FYI - the reason why names (including prefixes) in DynamicDestinations were parameterized via a lambda instead of just having the user add it via MapElements is performance. We discussed something along the lines of what you are suggesting (essentially having the user create a KV where the key

Re: Runner Bundling Strategies

2023-09-27 Thread Reuven Lax via dev
Bundle-FinishBundle, > are there such cases? I'd say yes, but I'm a little struggling finding a > specific example that cannot be solved using Setup or lazy init. > On 9/27/23 19:58, Reuven Lax via dev wrote: > > DoFns are allowed to be non deterministic, so they don't have

Re: Runner Bundling Strategies

2023-09-27 Thread Reuven Lax via dev
performance to reduce shuffle size, as opposed to a guaranteed RemoveDuplicates. This scenario doesn't require FinishBundle, though it does require a StartBundle. On Tue, Sep 26, 2023 at 11:59 AM Kenneth Knowles wrote: > > > On Tue, Sep 26, 2023 at 1:33 PM Reuven Lax via dev >

Re: Runner Bundling Strategies

2023-09-26 Thread Reuven Lax via dev
reason that we did not do so. Maybe we (incorrectly) thought that this > was an issue that only the Java SDK harness needed to know about. > > Kenn > > >> [1] https://github.com/apache/beam/issues/28649 >> >> [2] https://github.com/apache/beam/issues/28650 >>

Re: Runner Bundling Strategies

2023-09-25 Thread Reuven Lax via dev
On Mon, Sep 25, 2023 at 6:19 AM Jan Lukavský wrote: > > On 9/23/23 18:16, Reuven Lax via dev wrote: > > Two separate things here: > > 1. Yes, a watermark can update in the middle of a bundle. > 2. The records in the bundle themselves will prevent the watermark from > up

Re: Runner Bundling Strategies

2023-09-23 Thread Reuven Lax via dev
Two separate things here: 1. Yes, a watermark can update in the middle of a bundle. 2. The records in the bundle themselves will prevent the watermark from updating as they are still in flight until after finish bundle. Therefore simply caching the records should always be watermark safe, regardle

Re: Contribution of Asgarde: Error Handling for Beam?

2023-09-04 Thread Reuven Lax via dev
Let's be careful about whether these tests are included in our presubmits. Contrib code with flaky tests has been a major pain point in the past. On Sat, Sep 2, 2023 at 12:02 PM Austin Bennett wrote: > Wanting us to not miss this. @Mazlum TOSUN is > happy to donate Asgarde to our project. > > I

Re: ByteBuddy ClassLoadingStrategy.Default.INJECTION vs getClassLoadingStrategy

2023-07-21 Thread Reuven Lax via dev
Curious why these failing tests didn't block submission. For now rollback seems to be the simplest option. However is there a path forward on Java 11, or is our model irretrievably broken on Java 11? On Fri, Jul 21, 2023 at 8:57 AM Kenneth Knowles wrote: > This is a tricky situation that I don'

Re: Proposal to reduce the steps to make a Java transform portable

2023-06-22 Thread Reuven Lax via dev
The goal was to make schema transforms as efficient as hand-written coders. We found the avro encoding/decoding to often be quite inefficient, which is one reason we didn't use it for schemas. Our schema encoding is internal to Beam though, and not suggested for use external to a pipeline. For sou

Re: [VOTE] Release 2.48.0 release candidate #2

2023-05-30 Thread Reuven Lax via dev
+1 (binding) On Tue, May 30, 2023 at 2:43 PM Ahmet Altay via dev wrote: > +1 (binding) > > On Tue, May 30, 2023 at 2:01 PM Ritesh Ghorse via dev > wrote: > >> Thanks Danny and Jack! Dataflow containers are up! >> >> Only PMC votes count but feel free to test your use cases and vote on >> this t

Re: [VOTE] Release 2.46.0, release candidate #1

2023-04-28 Thread Reuven Lax via dev
Those particular errors are often expected in the sink due to the protocol used. If a work item retries before committing (which could happen for many reasons including worker crashes), it will experience those errors. On Fri, Apr 28, 2023 at 12:55 PM Ahmed Abualsaud wrote: > @Danny McCormick @

Re: [DESIGN] Beam Triggered side input specification

2023-03-28 Thread Reuven Lax via dev
On Tue, Mar 28, 2023 at 12:39 AM Jan Lukavský wrote: > > On 3/27/23 19:44, Reuven Lax via dev wrote: > > > > On Mon, Mar 27, 2023 at 5:43 AM Jan Lukavský wrote: > >> Hi, >> >> I'd like to clarify my understanding. Side inputs generally perform a >

Re: [DESIGN] Beam Triggered side input specification

2023-03-27 Thread Reuven Lax via dev
On Mon, Mar 27, 2023 at 5:43 AM Jan Lukavský wrote: > Hi, > > I'd like to clarify my understanding. Side inputs generally perform a left > (outer) join, LHS side is the main input, RHS is the side input. > Not completely - it's more of what I would call a nested-loop join. I.e. if the side input

Re: [VOTE] Release 2.46.0, release candidate #1

2023-03-08 Thread Reuven Lax via dev
We are trying to reproduce and debug the issue we saw to validate whether it was a real regression or not. Will update when we know more. On Wed, Mar 8, 2023 at 11:31 AM Danny McCormick wrote: > > @Reuven Lax found a new potential regression in > BigQuery I/O, so I have paused the release rollo

Re: [VOTE] Release 2.46.0, release candidate #1

2023-03-03 Thread Reuven Lax via dev
If possible, I would like to see if we could include https://github.com/apache/beam/pull/25642 as we believe this bug has been impacting multiple users. This was merged 4 days ago, but this RC cut does not seem to include it. On Fri, Mar 3, 2023 at 12:18 PM Valentyn Tymofieiev via dev < dev@beam.a

Re: Consuming one PCollection before consuming another with Beam

2023-03-01 Thread Reuven Lax via dev
suggestion is to try to solve the problem in terms of what you >>>>>> want to compute. Instead of trying to control the operational aspects >>>>>> like >>>>>> "read all the BQ before reading Pubsub" there is presumably some

Re: Consuming one PCollection before consuming another with Beam

2023-02-27 Thread Reuven Lax via dev
; "happen" in a pipeline, it is often best to pivot your thinking to the data >> and what you are trying to compute, and the built-in dependencies will >> solve the ordering problems. >> >> So - is there a way to describe your problem in terms of the data and >&g

Re: Consuming one PCollection before consuming another with Beam

2023-02-24 Thread Reuven Lax via dev
First PCollections are completely unordered, so there is no guarantee on what order you'll see events in the flattened PCollection. There may be ways to process the BigQuery data in a separate transform first, but it depends on the structure of the data. How large is the BigQuery table? Are you do

Re: Consuming one PCollection before consuming another with Beam

2023-02-23 Thread Reuven Lax via dev
Can you explain this use case some more? Is this a streaming pipeline? If so, how are you reading from BigQuery? On Thu, Feb 23, 2023 at 10:06 PM Sahil Modak via dev wrote: > Hi, > > We have a requirement wherein we are consuming input from pub/sub > (PubSubIO) as well as BQ (BQIO) > > We want t

Re: Thoughts on extensions/datasketches vs adding to the existing sketching library?

2023-01-18 Thread Reuven Lax via dev
I believe that when zetasketch was added, it was also noticeably more efficient than other sketch implementations. However this was a number of years ago, and I don't know whether it still has an advantage or not. On Wed, Jan 18, 2023 at 10:41 AM Byron Ellis via dev wrote: > Hi everyone, > > I w

Re: Beam Java SDK - ReadableState.read() shouldn't it be Nullable?

2023-01-03 Thread Reuven Lax via dev
le others like ListState we should have List<@Nullable T>. > > On Tue, Jan 3, 2023 at 12:37 PM Reuven Lax via dev > wrote: > >> It should be @Nullable - I'm not sure why that was removed. >> >> On Tue, Jan 3, 2023 at 12:18 PM Ahmet Altay via dev >> wrote:

Re: Beam Java SDK - ReadableState.read() shouldn't it be Nullable?

2023-01-03 Thread Reuven Lax via dev
It should be @Nullable - I'm not sure why that was removed. On Tue, Jan 3, 2023 at 12:18 PM Ahmet Altay via dev wrote: > Forwarding, because this message got lost in the list moderation. > > -- Forwarded message -- > From: Jeeno Lentin > To: dev@beam.apache.org > Cc: > Bcc: > Da

Re: [DISCUSS] Avro dependency update, design doc

2023-01-02 Thread Reuven Lax via dev
Be very careful with the auto schema stuff around Avro. These classes dynamically inspect Avro-generated classes (to codegen schema accessors) so it will be easy to break this in a way that is not seen at compile time. On Mon, Jan 2, 2023 at 12:17 PM Alexey Romanenko wrote: > Here is the recent

Re: [DISCUSSION][JAVA] Current state of Java 17 support

2022-12-01 Thread Reuven Lax via dev
We have run into som JDK-specific issues with our use of ByteBuddy though. On Thu, Dec 1, 2022 at 3:43 PM Luke Cwik via dev wrote: > We do support JDK8, JDK11 and JDK17. Our story around newer features > within JDKs 9+ like modules is mostly non-existent though. > > We rarely run into JDK specif

Re: SchemaTransformProvider | Java class naming convention

2022-11-15 Thread Reuven Lax via dev
Out of curiosity, several IOs (including PubSub) already do support schemas. Are you planning on modifying those? On Tue, Nov 15, 2022 at 11:50 AM Damon Douglas via dev wrote: > Hello Everyone, > > Do we like the following Java class naming convention for > SchemaTransformProviders [1]? The pro

Re: Questions on primitive transforms hierarchy

2022-10-26 Thread Reuven Lax via dev
at for sure. :) > > Jan > On 10/24/22 19:59, Kenneth Knowles wrote: > > > > On Mon, Oct 24, 2022 at 5:51 AM Jan Lukavský wrote: > >> On 10/22/22 21:47, Reuven Lax via dev wrote: >> >> I think we stated that CoGroupbyKey was also a primitive, though in >> p

Re: Questions on primitive transforms hierarchy

2022-10-24 Thread Reuven Lax via dev
On Mon, Oct 24, 2022 at 5:50 AM Jan Lukavský wrote: > On 10/22/22 21:47, Reuven Lax via dev wrote: > > I think we stated that CoGroupbyKey was also a primitive, though in > practice it's implemented in terms of GroupByKey today. > > On Fri, Oct 21, 2022 at 3:05 PM

Re: Questions on primitive transforms hierarchy

2022-10-22 Thread Reuven Lax via dev
I think we stated that CoGroupbyKey was also a primitive, though in practice it's implemented in terms of GroupByKey today. On Fri, Oct 21, 2022 at 3:05 PM Kenneth Knowles wrote: > > > On Fri, Oct 21, 2022 at 5:24 AM Jan Lukavský wrote: > >> Hi, >> >> I have some missing pieces in my understand

Re: [PROPOSAL] Re-enable checkerframework by default

2022-10-21 Thread Reuven Lax via dev
+1 This happens to me regularly. It fails on Jenkins but succeeds on my machine, and it's hard to figure out why (since all you see on Jenkins is a compile error). Then I'm always trying to remember how to enable it locally. IMO development would be faster if this was enabled locally. Anyone who d

Re: Staging a PCollection in Beam | Dataflow Runner

2022-10-19 Thread Reuven Lax via dev
PCollections's usually are persistent within a pipeline, so you can reuse them in other parts of a pipeline with no problem. There is no notion of state across pipelines - every pipeline is independent. If you want state across pipelines you can write the PCollection out to a set of files which ar

Re: Inclusive terminology: "Sickbay" ==> "Disabled test"

2022-10-17 Thread Reuven Lax via dev
To add to this, I think one reason originally for using "sickbay" was to emphasize that this should be temporary. Removing tests from pre/post commits permanently is a bad state to be in - at that point why even have the test? Ideally if a test is extremely flaky, fixing that is highly prioritized.

Re: Using unbounded source as a side input for a DoFn

2022-07-20 Thread Reuven Lax via dev
How do you want to use the side input? On Wed, Jul 20, 2022 at 10:45 PM Sahil Modak wrote: > Hi, > > We are looking to use the side input feature for one of our DoFns. The > side input has to be a PCollection which is being constructed from a > subscription using PubsubIO.read > > We want our pr

Re: [ANNOUNCE] New committer: Steven Niemitz

2022-07-19 Thread Reuven Lax via dev
Welcome Steve! On Tue, Jul 19, 2022 at 1:05 PM Connell O'Callaghan via dev < dev@beam.apache.org> wrote: > > +++1 Woohoo! Congratulations Steven (and to the BEAM community) on this > announcement!!! > > Thank you Luke for this update > > > On Tue, Jul 19, 2022 at 12:34 PM Robert Burke wrote:

Re: ClassNotFoundException when using Java external transforms in a Java job

2022-07-09 Thread Reuven Lax via dev
This generally means you have a CLASSPATH problem - either the JAR containing that class isn't ending up being linked in, or the wrong version of the JAR is. On Fri, Jul 8, 2022 at 2:12 PM Sahith Nallapareddy via dev < dev@beam.apache.org> wrote: > Hello, > > I sent an email sometime ago about ja