Re: Question abount Spark Runner's Filter in parDo

Danny McCormick via dev Mon, 23 Sep 2024 04:48:35 -0700

This seems like a reasonable optimization to me, I think moving it to a
pull request is a good idea - thanks!


- Danny

On Sun, Sep 22, 2024 at 11:58 PM LDesire <two_som...@icloud.com> wrote:

> Hello Beam community.
>
> I'm currently trying out Spark Runner and while going through the code,
> I noticed that when evaluating a ParDo operation,
> it applies too many filter operations (from line 467 in
> TransformTranslator.java).
>
> The original intent of this code seems to be to apply filters because the
> output of the ParDo can have multiple outputs.
> In other words, it makes sense to apply the filter operation when there
> are multiple outputs, but I believe that applying the filter operation when
> there is only one output actually degrades pipeline performance (because
> the equals operation has to be applied to each element to compare them).
>
>
> So I changed the PTransform to only apply when there are multiple outputs
> and tested it.
> I need to do more testing, but it didn't affect the output and the results
> weren't bad.
> If this is ok, would it be ok to make a PR?
>
> Also, if I'm missing anything, I'd be grateful if you could let me know.
>
> Cheers.

Re: Question abount Spark Runner's Filter in parDo

Reply via email to