[jira] [Commented] (FLINK-2548) VertexCentricIteration should avoid doing a coGroup with the edges and the solution set

Gabor Gevay (JIRA) Wed, 19 Aug 2015 10:36:14 -0700

    [ 
https://issues.apache.org/jira/browse/FLINK-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14703423#comment-14703423
 ]


Gabor Gevay commented on FLINK-2548:
------------------------------------

{quote}
Actually, the second co-group is not a real co-group. It only queries the 
solution set for the vertices that have a message.
{quote}
Can you please explain how does it achieve this? I can't understand looking at 
the code, how does it not get called on every vertex.
{quote}
We used coGroup initially, because it fits the "pregel" model where you have an 
iterator over your neighbors.
{quote}
I do a groupBy after the join, so I have the same iterator.
{quote}
This could be realized by a join as well, although it is hard to realize that 
in a memory-safe fashion.
{quote}
What exactly do you mean here by memory-safe? I see one drawback of the 
join-then-groupBy approach memory-wise: the workset vertex-value tuples get 
replicated that many times as the vertex's out degree. Did you mean this 
problem?
{quote}
Breaking this into three UDFs (Scatter / Gather / Apply), implemented as (Join, 
Reduce, Join) would work and give the efficiency you seek. The 
ConnectedComponents example follows pretty much that pattern.
{quote}
Thanks, I will look into this.

> VertexCentricIteration should avoid doing a coGroup with the edges and the 
> solution set
> ---------------------------------------------------------------------------------------
>
>                 Key: FLINK-2548
>                 URL: https://issues.apache.org/jira/browse/FLINK-2548
>             Project: Flink
>          Issue Type: Improvement
>          Components: Gelly
>    Affects Versions: 0.9, 0.10
>            Reporter: Gabor Gevay
>            Assignee: Gabor Gevay
>
> Currently, the performance of vertex centric iteration is suboptimal in those 
> iterations where the workset is small, because the complexity of one 
> iteration contains the number of edges and vertices of the graph because of 
> coGroups:
> VertexCentricIteration.buildMessagingFunction does a coGroup between the 
> edges and the workset, to get the neighbors to the messaging UDF. This is 
> problematic from a performance point of view, because the coGroup UDF gets 
> called on all the edge groups, including those that are not getting any 
> messages.
> An analogous problem is present in 
> VertexCentricIteration.createResultSimpleVertex at the creation of the 
> updates: a coGroup happens between the messages and the solution set, which 
> has the number of vertices of the graph included in its complexity.
> Both of these coGroups could be avoided by doing a join instead (with the 
> same keys that the coGroup uses), and then a groupBy. The complexity of these 
> operations would be dominated by the size of the workset, as opposed to the 
> number of edges or vertices of the graph. The joins should have the edges and 
> the solution set at the build side to achieve this complexity. (They will not 
> be rebuilt at every iteration.)
> I made some experiments with this, and the initial results seem promising. On 
> some workloads, this achieves a 2 times speedup, because later iterations 
> often have quite small worksets, and these get a huge speedup from this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2548) VertexCentricIteration should avoid doing a coGroup with the edges and the solution set

Reply via email to