Ok after simply converting the vector keys from Text to IntWritable, it worked fine for me. Took a while though, but it ran only on my local machine with default vectorization settings and almost no preprocessing, so there's much room for improvement.

Thanks for your help!
Sören

On 08/11/11 16:45, Sebastian Schelter wrote:
Hi Sören,

RowSimilarityJob expects IntWritable,VectorWritable as input. It should
be a reasonable choice for comparing the pairwise similarities between
text documents. I suggest you throw away the 1% most frequent terms as
described in http://terpconnect.umd.edu/~oard/pdf/acl08elsayed2.pdf. I
think SparseVectorsFromSequenceFiles is already doing that per default.

Would be great if let the mailinglist know how it worked out for you.

Greetings to Galway!
Sebastian

On 08.11.2011 17:33, Sören Brunk wrote:
Hi,

I'm trying to use RowSimilarityJob (current trunk) to calculate pairwise
similarities between feature vectors but I'm struggling a bit with the
correct input format.

I used SparseVectorsFromSequenceFiles to create a bunch of vectors from
documents. But using the tfidf vectors directly as input doesn't work as
it produces vectors with Strings as keys, while RowSimilarityJob seems
to expect IntWritable.
I've also seen something about DistributedRowMatrix as input in some
older docs.

Any hints? Is RowSimilarityJob a good choice for that task at all?

Thanks for your help,
Sören

Reply via email to