Re: Distributed Graph Algorithms

Jeff Eastman Fri, 11 Jun 2010 09:24:15 -0700

On 6/10/10 6:22 PM, Ted Dunning wrote:

I think I understand the problem, but I haven't been reading super carefully
as you describe your algorithm.


The basic idea, though, is pretty simple.  You are producing higher and
higher powers of the adjacency matrix while labeling each connected
component with the lowest component.

The algorithm as you described it sounds like you are keeping around the
entire matrix with appropriate labeling.  An alternative is to reduce the
matrix each time that you discover that a connected component has merged
with another.  That would mean that the graph you are processing would
decrease in size on each iteration terminating when it has a single node for
each connected sub-graph.  The problem with that approach is remembering the
labeling of each node in the original graph.  One way to do that fairly
efficiently without an explosion in the amount of data being carried around
would be create a set of side files at each step that contain the mapping
from nodes in one generation to their label in the next generation.  If you
have k steps of reduction (k \le log_2 N where N is the size of the graph, I
think), then you would have k side mapping files.  After you converge in the
merging process, you can run k joins to reverse the process and propagate
the labels from the last generation back to the first.  The first set of k
map-reduces should run progressively faster as the graph collapses and the
second set of k map-reduces in which you do the joins should run
progressively faster as the number of labels being processed increases.

I've been reading this thread with interest as it may offer somescalability improvements to the current MeanShiftCanopy clusteringimplementation. That code currently runs a sequence ofCanopy-clustering-like iterations to build, bottom-up, a graph ofclusters which contain the identities of the points they have subsumedas their centers shift towards their local centers of maximum densityand they merge together. These clusters can become very large objects asthe list of subsumed ids grows and this becomes an ultimate scalabilitylimitation. I've been stewing about an approach similar to what is aboveto write out the mergers in each iteration to a side-mapping file. Itwould run pretty much like the last two sentences above.

Re: Distributed Graph Algorithms

Reply via email to