[ 
https://issues.apache.org/jira/browse/GIRAPH-322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13451370#comment-13451370
 ] 

Eli Reisman commented on GIRAPH-322:
------------------------------------

When I run the instrumented copy, everything runs great and messages get where 
they are going until the very last flush of the leftovers at the end of super 
step 0 where we hit NettyWorkerClient#waitAllRequests. In here, I see a 
continual loop on all the flushing workers where they endlessly try to 
re-connect with their destinations. I am not seeing anything to indicate the 
destinations have crashed is the weird thing. I am very suspicious my request 
wiring is not quite right. Will take a deeper look in the next day or two. If 
anyone sees anything obvious, please let me know.

I will also attempt to try out the disk spill and tune it better, but I will be 
operating on a smaller setup now so I will need good results confirmed in the 
end by those better resourced than myself. ;) but thats a ways off still...

                
> Run Length Encoding for Vertex#sendMessageToAllEdges might curb out of 
> control message growth in large scale jobs
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: GIRAPH-322
>                 URL: https://issues.apache.org/jira/browse/GIRAPH-322
>             Project: Giraph
>          Issue Type: Improvement
>          Components: bsp
>    Affects Versions: 0.2.0
>            Reporter: Eli Reisman
>            Assignee: Eli Reisman
>            Priority: Minor
>             Fix For: 0.2.0
>
>         Attachments: GIRAPH-322-1.patch, GIRAPH-322-2.patch
>
>
> Vertex#sendMessageToAllEdges is a case that goes against the grain of the 
> data structures and code paths used to transport messages through a Giraph 
> application and out on the network. While messages to a single vertex can be 
> combined (and should be) in some applications that could make use of this 
> broadcast messaging, the out of control message growth of algorithms like 
> triangle closing means we need to de-duplicate messages bound for many 
> vertices/partitions.
> This will be an evolving solution (this first patch is just the first step) 
> and currently it does not present a robust solution for disk-spill message 
> stores. I figure I can get some advice about that or it can be a follow-up 
> JIRA if this turns out to be a fruitful pursuit. This first patch is also 
> Netty-only and simply defaults to the old sendMessagesToAllEdges() 
> implementation if USE_NETTY is false. All this can be cleaned up when we know 
> this works and/or is worth pursuing.
> The idea is to send as few broadcast messages as possible by run-length 
> encoding their delivery and only duplicating message on the network when they 
> are bound for different partitions. This is also best when combined with 
> "-Dhash.userPartitionCount=# of workers" so you don't do too much of that.
> If this shows promise I will report back and keep working on this. As it is, 
> it represents an end-to-end solution, using Netty, for in-memory messaging. 
> It won't break with spill to disk, but you do lose the de-duplicating effect.
> More to follow, comments/ideas welcome. I expect this to change a lot as I 
> test it and ideas/suggestions crop up.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to