Created related feature request https://github.com/apache/beam/issues/29789
We have to put more thought into exactly how to come up with merged
environments that do not result in conflicts. I prefer trying to
automatically do this on the SDK side instead of pushing the complexity to
the user (for
Yeah, we already have `ResourceHint.get_merged_value(cls, outer_value,
inner_value)` for reconciling resources within a composite, in the future
we could possibly just have another similar method and have the environment
merging logic hook into that.
On Fri, Dec 15, 2023 at 3:53 PM Robert
There is definitely a body of future work in intelligently merging
compatible-but-not-equal environments. (Dataflow does this for example.)
Defining/detecting compatibility is not always easy, but sometimes is, and
we should at least cover those cases and grow them over time.
On Fri, Dec 15, 2023
Yeah I can confirm for the python runners (based on my reading of the
translations.py [1]) that only identical environments are merged together.
The funny thing is that we _originally_ implemented this hint as an
annotation but then changed it to hint because it semantically felt more
correct. I
That would do it. We got so tunnel visioned on side inputs we missed that!
IIRC the python local runner and Prism both only fuse transforms in
identical environments together. So any environmental diffs will prevent
fusion.
Runners as a rule are usually free to ignore/manage hints as they like.
I figured out my issue. I thought side inputs were breaking up my pipeline
but after experimenting with my transforms I now realize what was actually
breaking it up was different transform environments that weren't considered
compatible.
We have a custom resource hint (for specifying whether a
Building on what Robert Bradshaw has said, basically, if these fusion
breaks don't exist, the pipeline can live lock, because the side input is
unable to finish computing for a given input element's window.
I have recently added fusion to the Go Prism runner based on the python
side input
Thanks for the explanation!
That matches with my intuition - are there any other rules with side
inputs?
I might be misunderstanding the actual cause of the fusion breaks in our
pipeline, but we essentially have one part of the graph that produces many
small collections that are used as side
That is correct. Side inputs give a view of the "whole" PCollection and
hence introduce a fusion-producing barrier. For example, suppose one has a
DoFn that produces two outputs, mainPColl and sidePColl, that are consumed
(as the main and side input respectively) of DoFnB.
Hey all,
We have a pretty big pipeline and while I was inspecting the stages, I
noticed there is less fusion than I expected. I suspect it has to do with
the heavy use of side inputs in our workflow. In the python sdk, I see that
side inputs are considered when determining whether two stages are
10 matches
Mail list logo