On Wed, Jun 9, 2010 at 5:04 PM, Ted Dunning <[email protected]> wrote:
>
>  > Open questions:
> > 1) What input formats should be supported?
>

Your text input format is good, and fairly standard, actually.  Another
would be
something like SequenceFile<IntWritable,IntWritable>, which is basically
what your current output format looks like!



> > 2) Do you have any suggestions on what intermediary format could be used
> > between phases?
> >
>
> These should be sequence files of some kind.  Using the Mahout vector
> format
> would probably work well at the cost of a bit of overhead due to using
> doubles to store integers.
>

Yeah, we really should extend at some point to allowing ints and booleans
too.
But then again, double is only double the size of an int.
It's not like it's a *huge* factor.



> > 3) How best to approach integrating these algorithms into Mahout?
> >
>
> you are breaking new ground here with graph algorithms in mahout.
>

I agree - do what you feel comfortable with, we don't have anything
currently on this.


> >  4) Does anyone know where I can find some large test graphs?
> >
>
> Consider the wikipedia link graph.  Also interesting might be the
> cooccurrence graph of words in a large corpus.  The twitter social graph
> might be interesting as well.
>

The twitter social graph is pretty humongous - you can get the torrent
here: http://an.kaist.ac.kr/traces/WWW2010.html
And I've got it hiding on Amazon S3 too, ask me offline if you want
access to that one.

  -jake

Reply via email to