[jira] [Comment Edited] (BEAM-831) ParDo Chaining

Chinmay Kolhatkar (JIRA) Wed, 15 Feb 2017 04:18:07 -0800

    [ 
https://issues.apache.org/jira/browse/BEAM-831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15867722#comment-15867722
 ]


Chinmay Kolhatkar edited comment on BEAM-831 at 2/15/17 12:17 PM:
------------------------------------------------------------------

[~thw], [~jkff], I went through the links provided and understood 
producer-consumer fusion and sibling fusion concepts.

I'm currently focusing on producer-consumer fusion optimization. I'm unsure how 
much good it is to do sibling fusion for Apex Runner.
To do that here is the approach I'm considering (Still working on a POC yet, so 
might change):
1. Majority of the changes would go in TranslateContext of apex runner.
2. The streams variable can hold information about Locality as well which will 
later be used in TranslationContext.populateDAG method.
3. In populateDAG Api, before the streams are connected, We can traverse the 
streams/operators in topological order and find out adjacent ParDo stages for 
putting them in either thread local OR container local and update hte field in 
streams variable with right Locality.

Only thing that I'm not sure about is when to stop the merging of ParDos... 
i.e. if the DAG is like ParDo A -> ParDo B -> ParDo C -> ParDo D.
Then at time it might be efficient to merge only B & C and not merge all of 
them... Any thoughts on this?

Also, has any runner already done this?

I'm also considering to update ApexFlattenOperator and put the streams in 
ThreadLocal instead of Default locality. Might be a different Jira for that.

Please share your opinion.


was (Author: chinmay):
[~thw], [~jkff], I wet through the links provided and understood 
producer-consumer fusion and sibling fusion concepts.

I'm currently focusing on producer-consumer fusion optimization.
To do that here is the approach I'm considering (Still working on a POC yet, so 
might change):
1. Majority of the changes would go in TranslateContext of apex runner.
2. The streams variable can hold information about Locality as well which will 
later be used in TranslationContext.populateDAG method.
3. In populateDAG Api, before the streams are connected, We can traverse the 
streams/operators in topological order and find out adjacent ParDo stages for 
putting them in either thread local OR container local and update hte field in 
streams variable with right Locality.

Only thing that I'm not sure about is when to stop the merging of ParDos... 
i.e. if the DAG is like ParDo A -> ParDo B -> ParDo C -> ParDo D.
Then at time it might be efficient to merge only B & C and not merge all of 
them... Any thoughts on this?

Also, has any runner already done this?

I'm also considering to update ApexFlattenOperator and put the streams in 
ThreadLocal instead of Default locality. Might be a different Jira for that.

Please share your opinion.

> ParDo Chaining
> --------------
>
>                 Key: BEAM-831
>                 URL: https://issues.apache.org/jira/browse/BEAM-831
>             Project: Beam
>          Issue Type: Improvement
>          Components: runner-apex
>            Reporter: Thomas Weise
>
> Current state of Apex runner creates a plan that will place each operator in 
> a separate container (which would be processes when running on a YARN 
> cluster). Often the ParDo operators can be collocated in same thread or 
> container. Use Apex affinity/stream locality attributes for more efficient 
> execution plan.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Comment Edited] (BEAM-831) ParDo Chaining

Reply via email to