I think I understand the problem, but I haven't been reading super carefully as you describe your algorithm.
The basic idea, though, is pretty simple. You are producing higher and higher powers of the adjacency matrix while labeling each connected component with the lowest component. The algorithm as you described it sounds like you are keeping around the entire matrix with appropriate labeling. An alternative is to reduce the matrix each time that you discover that a connected component has merged with another. That would mean that the graph you are processing would decrease in size on each iteration terminating when it has a single node for each connected sub-graph. The problem with that approach is remembering the labeling of each node in the original graph. One way to do that fairly efficiently without an explosion in the amount of data being carried around would be create a set of side files at each step that contain the mapping from nodes in one generation to their label in the next generation. If you have k steps of reduction (k \le log_2 N where N is the size of the graph, I think), then you would have k side mapping files. After you converge in the merging process, you can run k joins to reverse the process and propagate the labels from the last generation back to the first. The first set of k map-reduces should run progressively faster as the graph collapses and the second set of k map-reduces in which you do the joins should run progressively faster as the number of labels being processed increases. Doesn't this approach avoid the problems you mention with lots of labels being passed around all the time? The optimization that I would apply to this algorithm (someday as you mention) is to detect when the data becomes small enough to process in-memory on a single machine. This would avoid the overhead of invoking a map-reduce program. My guess is that you should be able to process a graph with up to 10^7 or 10^8 nodes in memory faster than you can with a multi-node map-reduce program. My suggestion On Thu, Jun 10, 2010 at 6:09 PM, Neal Clark <[email protected]> wrote: > The problem is too many duplicate edges but the fact that as the algorithm > grows the number of edges adjacent to the minim vertexes increases. In a > worst case scenario the graph is completely connected and only has a single > component. So far I have not figured out an efficient method of > partitioning > the edges of a single vertex while still passing on the minimum vertex to > each partition. >
