There is no difference; FlatMapElements is implemented in terms of a DoFn that invokes context.output multiple times. And, yes, Dataflow will fuse consecutive operations automatically. So if you have something like
... -> DoFnA -> DoFnB -> GBK -> DoFnC -> ... Dataflow will fuse DoFnA and DoFnB together, and if DoFnA produces a lot of data for DoFnB to consume then more workers will be allocated to handle the (DoFnA + DoFnB) combination. If the fanout is so huge that a single worker would not be expected to handle the output DoFnA produces from a single input, you could look into making DoFnA into a SplittableDoFn https://beam.apache.org/blog/splittable-do-fn-is-available . If DoFnB is just really expensive, you can also decouple the parallelism between the two with a Reshuffle. Most of the time neither of these is needed. On Wed, Dec 27, 2023 at 5:44 PM hsy...@gmail.com <hsy...@gmail.com> wrote: > > Hello > > I have a question. If I have a transform for each input it will emit 1 or > many output (same collection) > I can do it with ParDo + DoFun while in processElement method for each input, > call context.output multiply times vs doing it with FlatMapElements, is there > any difference? Does the dataflow fuse the downstream transform > automatically? Eventually I want more downstream transform workers cause it > needs to handle more data, How do I supposed to do that? > > Regards, > Siyuan