Hi Sören,

RowSimilarityJob expects IntWritable,VectorWritable as input. It should
be a reasonable choice for comparing the pairwise similarities between
text documents. I suggest you throw away the 1% most frequent terms as
described in http://terpconnect.umd.edu/~oard/pdf/acl08elsayed2.pdf. I
think SparseVectorsFromSequenceFiles is already doing that per default.

Would be great if let the mailinglist know how it worked out for you.

Greetings to Galway!
Sebastian

On 08.11.2011 17:33, Sören Brunk wrote:
> Hi,
> 
> I'm trying to use RowSimilarityJob (current trunk) to calculate pairwise
> similarities between feature vectors but I'm struggling a bit with the
> correct input format.
> 
> I used SparseVectorsFromSequenceFiles to create a bunch of vectors from
> documents. But using the tfidf vectors directly as input doesn't work as
> it produces vectors with Strings as keys, while RowSimilarityJob seems
> to expect IntWritable.
> I've also seen something about DistributedRowMatrix as input in some
> older docs.
> 
> Any hints? Is RowSimilarityJob a good choice for that task at all?
> 
> Thanks for your help,
> Sören

Reply via email to