On Wed, Jun 9, 2010 at 5:04 PM, Ted Dunning <[email protected]> wrote: > > > Open questions: > > 1) What input formats should be supported? >
Your text input format is good, and fairly standard, actually. Another would be something like SequenceFile<IntWritable,IntWritable>, which is basically what your current output format looks like! > > 2) Do you have any suggestions on what intermediary format could be used > > between phases? > > > > These should be sequence files of some kind. Using the Mahout vector > format > would probably work well at the cost of a bit of overhead due to using > doubles to store integers. > Yeah, we really should extend at some point to allowing ints and booleans too. But then again, double is only double the size of an int. It's not like it's a *huge* factor. > > 3) How best to approach integrating these algorithms into Mahout? > > > > you are breaking new ground here with graph algorithms in mahout. > I agree - do what you feel comfortable with, we don't have anything currently on this. > > 4) Does anyone know where I can find some large test graphs? > > > > Consider the wikipedia link graph. Also interesting might be the > cooccurrence graph of words in a large corpus. The twitter social graph > might be interesting as well. > The twitter social graph is pretty humongous - you can get the torrent here: http://an.kaist.ac.kr/traces/WWW2010.html And I've got it hiding on Amazon S3 too, ask me offline if you want access to that one. -jake
