[ 
https://issues.apache.org/jira/browse/GIRAPH-273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13446099#comment-13446099
 ] 

Eli Reisman commented on GIRAPH-273:
------------------------------------

That makes a lot of sense, maybe thats the right way to go then. What I was 
thinking (just for reference) is along these lines:

We aggregate all values in aggregators at each worker during compute() cycle, 
so really we have total messages per aggregation at each superstep come to (# 
of workers) * (# of aggregators in that application.)

We set a single ZK node at the end of each superstep that the master creates 
once all workers have put up nodes to say they are done with that superstep. 
When this new node appears, workers start sending their aggregated values from 
that superstep. They have their own Worker ID number already, and they can get 
"-w" (the total workers in the application run) from Configuration. So then 
they have a sort of heap swim() function that takes these two values and gives 
back their parent node in the tree. Since all other network activity has ceased 
for a moment, passing messages along should not be too expensive compared to 
the volumes we send during the work phases of a superstep already. If they get 
aggregator messages, they pass them to parent. It gets a bit busy at the top of 
the tree, but even then our typical messaging should be much more volume so it 
ought to be ok if we got this far without crashing already?

So...at the top worker 1, 2 report to the top of the heap, which is the master 
(worker 0) and that is where all the final aggregating takes place, since the 
master has nothing to do. Alternately, the top couple nodes in the tree (as 
determined by their height in the tree) might do some sub aggregating to cut 
down on message volume. This could be set up to whatever tests the best 
(probably some sub aggregating)

Finally, when the master gets (# of workers) * (# of aggregators) values (or 
with sub-agg, 2 messages * # of aggregators) then it writes to that znode a 
child that says "time to move on with the superstep" and we go forward. If we 
pass a timeout without hearing from everyone, retry or app fail etc.

This means no new connections except a single one to master from nodes 1 & 2 
which is nice. We would love to scale up further into the 4 figures and the # 
of connections maintained per worker is starting to become a bit of a problem. 
I definitely agree doing the work at the master when possible for aggregators 
(as with the 1 connect from each worker method) is good because the master is 
not busy in our current scheme.

Anyway, great work on the other sections so far, thats a lot of code to write! 
Looking forward to this!

                
> Aggregators shouldn't use Zookeeper
> -----------------------------------
>
>                 Key: GIRAPH-273
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-273
>             Project: Giraph
>          Issue Type: Improvement
>            Reporter: Maja Kabiljo
>            Assignee: Maja Kabiljo
>
> We use Zookeeper znodes to transfer aggregated values from workers to master 
> and back. Zookeeper is supposed to be used for coordination, and it also has 
> a memory limit which prevents users from having aggregators with large value 
> objects. These are the reasons why we should implement aggregators gathering 
> and distribution in a different way.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to