You have to supply that number, however if you don't use it number in
the similarity computation (only SIMILARITY_LOGLIKELIHOOD uses it) you
can safely ignore it and pass in any number.
--sebastian
On 28.10.2010 12:02, Divya wrote:
Hi Sebastian,
From where can I get the numberOfColumns.
How can I calculate I have these many columns my matrix has as
SparseVectorsFromSequenceFiles generates vectors in binary format.
Regards,
Divya
-----Original Message-----
From: Sebastian Schelter [mailto:[email protected]]
Sent: Thursday, October 28, 2010 4:28 PM
To: [email protected]
Subject: Re: generate similar documents
Hi Divya,
--similarityClassname should point to an implementation of
org.apache.mahout.math.hadoop.similarity.vector.DistributedVectorSimilarity,
you can use any value from
org.apache.mahout.math.hadoop.similarity.SimilarityType to use a
predefined similarity measure or you can point to an implementation of
your own
--numberOfColumns is the number of columns of the input matrix, which
would be the number of unique terms as I suppose your matrix is
documents x terms
--sebastian
On 28.10.2010 10:11, Divya wrote:
Hi,
I have directory of documents from which I have generated Sequence file
using SequenceFilesFromDirectory and then converted it into vectors
SparseVectorsFromSequenceFiles
Now referring below link to generate a list of most similar documents
http://mail-archives.apache.org/mod_mbox/mahout-user/201007.mbox/%3C4C2E3EED
[email protected]%3e
How can I use RowSimilarityJob to generate list of similar documents .
<ol>
*<li>-Dmapred.input.dir=(path): Directory containing a {...@link
DistributedRowMatrix} as a
* SequenceFile<IntWritable,VectorWritable></li>
*<li>-Dmapred.output.dir=(path): output path where the computations
output
should go (a {...@link DistributedRowMatrix}
* stored as a SequenceFile<IntWritable,VectorWritable>)</li>
*<li>--numberOfColumns: the number of columns in the input matrix</li>
*<li>--similarityClassname (classname): an implementation of {...@link
DistributedVectorSimilarity} used to compute the
* similarity</li>
*<li>--maxSimilaritiesPerRow (integer): cap the number of similar rows
per
row to this number (100)</li>
*</ol>
*
Which argument should I pass numberOfColumns and similarityClassname ?
Regards,
Divya