Hi Sören, RowSimilarityJob expects IntWritable,VectorWritable as input. It should be a reasonable choice for comparing the pairwise similarities between text documents. I suggest you throw away the 1% most frequent terms as described in http://terpconnect.umd.edu/~oard/pdf/acl08elsayed2.pdf. I think SparseVectorsFromSequenceFiles is already doing that per default.
Would be great if let the mailinglist know how it worked out for you. Greetings to Galway! Sebastian On 08.11.2011 17:33, Sören Brunk wrote: > Hi, > > I'm trying to use RowSimilarityJob (current trunk) to calculate pairwise > similarities between feature vectors but I'm struggling a bit with the > correct input format. > > I used SparseVectorsFromSequenceFiles to create a bunch of vectors from > documents. But using the tfidf vectors directly as input doesn't work as > it produces vectors with Strings as keys, while RowSimilarityJob seems > to expect IntWritable. > I've also seen something about DistributedRowMatrix as input in some > older docs. > > Any hints? Is RowSimilarityJob a good choice for that task at all? > > Thanks for your help, > Sören