Re: How do side inputs relate to stage fusion?

2023-12-15 Thread Chamikara Jayalath via dev
Created related feature request https://github.com/apache/beam/issues/29789 We have to put more thought into exactly how to come up with merged environments that do not result in conflicts. I prefer trying to automatically do this on the SDK side instead of pushing the complexity to the user (for

Re: How do side inputs relate to stage fusion?

2023-12-15 Thread Joey Tran
Yeah, we already have `ResourceHint.get_merged_value(cls, outer_value, inner_value)` for reconciling resources within a composite, in the future we could possibly just have another similar method and have the environment merging logic hook into that. On Fri, Dec 15, 2023 at 3:53 PM Robert

Re: How do side inputs relate to stage fusion?

2023-12-15 Thread Robert Bradshaw via dev
There is definitely a body of future work in intelligently merging compatible-but-not-equal environments. (Dataflow does this for example.) Defining/detecting compatibility is not always easy, but sometimes is, and we should at least cover those cases and grow them over time. On Fri, Dec 15, 2023

Re: How do side inputs relate to stage fusion?

2023-12-15 Thread Joey Tran
Yeah I can confirm for the python runners (based on my reading of the translations.py [1]) that only identical environments are merged together. The funny thing is that we _originally_ implemented this hint as an annotation but then changed it to hint because it semantically felt more correct. I

Re: How do side inputs relate to stage fusion?

2023-12-15 Thread Robert Burke
That would do it. We got so tunnel visioned on side inputs we missed that! IIRC the python local runner and Prism both only fuse transforms in identical environments together. So any environmental diffs will prevent fusion. Runners as a rule are usually free to ignore/manage hints as they like.

Re: How do side inputs relate to stage fusion?

2023-12-15 Thread Joey Tran
I figured out my issue. I thought side inputs were breaking up my pipeline but after experimenting with my transforms I now realize what was actually breaking it up was different transform environments that weren't considered compatible. We have a custom resource hint (for specifying whether a

Re: How do side inputs relate to stage fusion?

2023-12-14 Thread Robert Burke
Building on what Robert Bradshaw has said, basically, if these fusion breaks don't exist, the pipeline can live lock, because the side input is unable to finish computing for a given input element's window. I have recently added fusion to the Go Prism runner based on the python side input

Re: How do side inputs relate to stage fusion?

2023-12-14 Thread Joey Tran
Thanks for the explanation! That matches with my intuition - are there any other rules with side inputs? I might be misunderstanding the actual cause of the fusion breaks in our pipeline, but we essentially have one part of the graph that produces many small collections that are used as side

Re: How do side inputs relate to stage fusion?

2023-12-14 Thread Robert Bradshaw via dev
That is correct. Side inputs give a view of the "whole" PCollection and hence introduce a fusion-producing barrier. For example, suppose one has a DoFn that produces two outputs, mainPColl and sidePColl, that are consumed (as the main and side input respectively) of DoFnB.

How do side inputs relate to stage fusion?

2023-12-14 Thread Joey Tran
Hey all, We have a pretty big pipeline and while I was inspecting the stages, I noticed there is less fusion than I expected. I suspect it has to do with the heavy use of side inputs in our workflow. In the python sdk, I see that side inputs are considered when determining whether two stages are