I have uploaded a very basic version of my connectivity MapReduce job here
http://pastebin.org/322216

 <http://pastebin.org/322216>The algorithm takes an edge list consists of
integer vertex names.

<v1>\t<v2>\n
<v2>\t<v1>\n

The current algorithms modifies the graph by adding and removing edges such
so that each vertex eventually connects only too its smallest numbered
neighbour.

In each phase the current vertex finds the smallest numbered vertex among
itself and its neighbours. It then creates edges for each of its neighbours
to the smallest known vertex. It also creates a final edge between the
smallest known vertex back to itself. This ensures that if the smallest
known vertex learns of a smaller vertex it can propagate this information to
the rest of the graph.

The algorithm is designed to extract the maximum amount of information from
the graph. The output data can easily be processed to create list of
vertexes and which component it belongs to.

Arbitrary graphs can easily be converted to the necessary integer pair edge
list format. If the graph consists of text labelled vertexes a random
3*log2(n) bit integer can be assigned to each vertex in parallel. Where n is
the number of vertexes in the initial graph. using 3*log2(n) bits ensures
that it is extremely unlikely that two vertexes will receive the same
integer label.

I am currently looking for a good way to generate large graphs or get access
to large graph datasets. Since the number of vertexes usually far exceeds
the number of reduce nodes the overloading a single vertex shouldn't be too
much of a problem. More testing is needed to confirm this. If overloading
does prove to be a problem we can use a two phase approach to determine the
min vertexes.

Open questions:
1) What input formats should be supported?
2) Do you have any suggestions on what intermediary format could be used
between phases?
3) How best to approach integrating these algorithms into Mahout?
4) Does anyone know where I can find some large test graphs?
5) Do you think that this type of algorithm is a good fit for Mahout?

Thanks,

Neal.




On Mon, May 31, 2010 at 2:54 PM, Ted Dunning <[email protected]> wrote:

> To help you get started, we have a collection of sparse matrix structures,
> some of which are amenable to row-wise distribution to mappers in
> map-reduce
> programs.   If your connectivity program is basically just the transitive
> closure of the graph, then that would probably suffice (although I would
> worry about the output getting large).  The MST algorithm will probably
> stress things a bit more.
>
> On Mon, May 31, 2010 at 2:47 PM, Neal Clark <[email protected]> wrote:
>
> > I will have to take a closer look at the Mahout data structures before
> > I can be certain how hard it would be.
> >
>

Reply via email to