I have uploaded a very basic version of my connectivity MapReduce job here http://pastebin.org/322216
<http://pastebin.org/322216>The algorithm takes an edge list consists of integer vertex names. <v1>\t<v2>\n <v2>\t<v1>\n The current algorithms modifies the graph by adding and removing edges such so that each vertex eventually connects only too its smallest numbered neighbour. In each phase the current vertex finds the smallest numbered vertex among itself and its neighbours. It then creates edges for each of its neighbours to the smallest known vertex. It also creates a final edge between the smallest known vertex back to itself. This ensures that if the smallest known vertex learns of a smaller vertex it can propagate this information to the rest of the graph. The algorithm is designed to extract the maximum amount of information from the graph. The output data can easily be processed to create list of vertexes and which component it belongs to. Arbitrary graphs can easily be converted to the necessary integer pair edge list format. If the graph consists of text labelled vertexes a random 3*log2(n) bit integer can be assigned to each vertex in parallel. Where n is the number of vertexes in the initial graph. using 3*log2(n) bits ensures that it is extremely unlikely that two vertexes will receive the same integer label. I am currently looking for a good way to generate large graphs or get access to large graph datasets. Since the number of vertexes usually far exceeds the number of reduce nodes the overloading a single vertex shouldn't be too much of a problem. More testing is needed to confirm this. If overloading does prove to be a problem we can use a two phase approach to determine the min vertexes. Open questions: 1) What input formats should be supported? 2) Do you have any suggestions on what intermediary format could be used between phases? 3) How best to approach integrating these algorithms into Mahout? 4) Does anyone know where I can find some large test graphs? 5) Do you think that this type of algorithm is a good fit for Mahout? Thanks, Neal. On Mon, May 31, 2010 at 2:54 PM, Ted Dunning <[email protected]> wrote: > To help you get started, we have a collection of sparse matrix structures, > some of which are amenable to row-wise distribution to mappers in > map-reduce > programs. If your connectivity program is basically just the transitive > closure of the graph, then that would probably suffice (although I would > worry about the output getting large). The MST algorithm will probably > stress things a bit more. > > On Mon, May 31, 2010 at 2:47 PM, Neal Clark <[email protected]> wrote: > > > I will have to take a closer look at the Mahout data structures before > > I can be certain how hard it would be. > > >
