[ https://issues.apache.org/jira/browse/FLINK-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14703423#comment-14703423 ]
Gabor Gevay commented on FLINK-2548: ------------------------------------ {quote} Actually, the second co-group is not a real co-group. It only queries the solution set for the vertices that have a message. {quote} Can you please explain how does it achieve this? I can't understand looking at the code, how does it not get called on every vertex. {quote} We used coGroup initially, because it fits the "pregel" model where you have an iterator over your neighbors. {quote} I do a groupBy after the join, so I have the same iterator. {quote} This could be realized by a join as well, although it is hard to realize that in a memory-safe fashion. {quote} What exactly do you mean here by memory-safe? I see one drawback of the join-then-groupBy approach memory-wise: the workset vertex-value tuples get replicated that many times as the vertex's out degree. Did you mean this problem? {quote} Breaking this into three UDFs (Scatter / Gather / Apply), implemented as (Join, Reduce, Join) would work and give the efficiency you seek. The ConnectedComponents example follows pretty much that pattern. {quote} Thanks, I will look into this. > VertexCentricIteration should avoid doing a coGroup with the edges and the > solution set > --------------------------------------------------------------------------------------- > > Key: FLINK-2548 > URL: https://issues.apache.org/jira/browse/FLINK-2548 > Project: Flink > Issue Type: Improvement > Components: Gelly > Affects Versions: 0.9, 0.10 > Reporter: Gabor Gevay > Assignee: Gabor Gevay > > Currently, the performance of vertex centric iteration is suboptimal in those > iterations where the workset is small, because the complexity of one > iteration contains the number of edges and vertices of the graph because of > coGroups: > VertexCentricIteration.buildMessagingFunction does a coGroup between the > edges and the workset, to get the neighbors to the messaging UDF. This is > problematic from a performance point of view, because the coGroup UDF gets > called on all the edge groups, including those that are not getting any > messages. > An analogous problem is present in > VertexCentricIteration.createResultSimpleVertex at the creation of the > updates: a coGroup happens between the messages and the solution set, which > has the number of vertices of the graph included in its complexity. > Both of these coGroups could be avoided by doing a join instead (with the > same keys that the coGroup uses), and then a groupBy. The complexity of these > operations would be dominated by the size of the workset, as opposed to the > number of edges or vertices of the graph. The joins should have the edges and > the solution set at the build side to achieve this complexity. (They will not > be rebuilt at every iteration.) > I made some experiments with this, and the initial results seem promising. On > some workloads, this achieves a 2 times speedup, because later iterations > often have quite small worksets, and these get a huge speedup from this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)