On Wed, Jun 9, 2010 at 5:12 PM, Jake Mannix <[email protected]> wrote:

> On Wed, Jun 9, 2010 at 5:04 PM, Ted Dunning <[email protected]> wrote:
> >
> >  > Open questions:
> > > 1) What input formats should be supported?
> >
>
> Your text input format is good, and fairly standard, actually.  Another
> would be
> something like SequenceFile<IntWritable,IntWritable>, which is basically
> what your current output format looks like!


With large datasets I think using a binary intermediary format would provide
much better results. I will experiment with the SequenceFile and see how it
performs.

>
> > > 2) Do you have any suggestions on what intermediary format could be
> used
> > > between phases?
> > >
> >
> > These should be sequence files of some kind.  Using the Mahout vector
> > format
> > would probably work well at the cost of a bit of overhead due to using
> > doubles to store integers.
> >
>
> Yeah, we really should extend at some point to allowing ints and booleans
> too.
> But then again, double is only double the size of an int.
> It's not like it's a *huge* factor.
>
>
>
> > > 3) How best to approach integrating these algorithms into Mahout?
> > >
> >
> > you are breaking new ground here with graph algorithms in mahout.
> >
>
> I agree - do what you feel comfortable with, we don't have anything
> currently on this.
>
>
> > >  4) Does anyone know where I can find some large test graphs?
> > >
> >
> > Consider the wikipedia link graph.  Also interesting might be the
> > cooccurrence graph of words in a large corpus.  The twitter social graph
> > might be interesting as well.
> >
>


>
>
 The twitter social graph is pretty humongous - you can get the torrent
> here: http://an.kaist.ac.kr/traces/WWW2010.html
> And I've got it hiding on Amazon S3 too, ask me offline if you want
> access to that one.
>

Thanks for the link to the twitter dataset. I have been running the
connectivity algorithm on a 40 node blade cluster with 4 core 2.66 Ghz Xeon
processors in each blade.  It has become pretty clear that I need to
slightly rethink my approach. This doesn't surprise me in the least.

Instead of finding the min vertex in memory I am going to try to use the
SecondarySort implementation, from the Hadoop examples, to sort the edges by
first and second vertex, grouping them by v1. This would allow a Mapper to
read in the first pair of each partition and use it as the minimum vertex
therefore eliminating the need to buffer any edges. Duplicates can be
removed by checking the previous edge.

Using this approach it should be possible to use do the bulk of the
algorithm in the Mapper and then use the SecondarySort Reducer to sort the
output.

Reply via email to