Chinmay,

I would recommend to collapse ParDo sequences to the maximum extend
possible using THREAD_LOCAL affinity. The Apex runner has a configuration
file option that can be used to tune the chaining when needed (
https://issues.apache.org/jira/browse/BEAM-980). What you build into the
pipeline translation should just be the default behavior.

Note that in cases where operators have multiple inputs, you may not be
able to use THREAD_LOCAL and may have to use CONTAINER_LOCAL instead.

Thanks,
Thomas


On Wed, Feb 22, 2017 at 10:59 PM, Chinmay Kolhatkar <chin...@apache.org>
wrote:

> Dear Community,
>
> I'm working on BEAM-831 to implement ParDo chaining for Apache Apex Runner.
>
> As suggested on Jira, chaining needs to be done using Stream locality of
> Apache Apex engine.
>
> I got some links from Eugene Kirpichov on the Jira. I'm currently focusing
> on producer-consumer fusion optimization. I'm unsure how much good it is to
> do sibling fusion for Apex Runner as of now.
>
> For producer-consumer fusion, I am able to identify which stages are
> ParDos.
> Only thing that I'm not sure about is when to stop the merging of ParDos...
> i.e. if the DAG is like ParDo A -> ParDo B -> ParDo C -> ParDo D.
> Then at time it might be efficient to merge only B & C and not merge all of
> them...
>
> How should this decision be made? Any reference for available for it?
>
> Please suggest.
>
> Thanks,
> Chinmay.
>

Reply via email to