Re: generate similar documents

Sebastian Schelter Thu, 28 Oct 2010 01:28:23 -0700

Hi Divya,

--similarityClassname should point to an implementation oforg.apache.mahout.math.hadoop.similarity.vector.DistributedVectorSimilarity,you can use any value fromorg.apache.mahout.math.hadoop.similarity.SimilarityType to use apredefined similarity measure or you can point to an implementation ofyour own

--numberOfColumns is the number of columns of the input matrix, whichwould be the number of unique terms as I suppose your matrix isdocuments x terms


--sebastian

On 28.10.2010 10:11, Divya wrote:

Hi,

I have directory of documents from which I have generated Sequence file
using SequenceFilesFromDirectory and then converted it into vectors
SparseVectorsFromSequenceFiles

Now referring below link to  generate a list of most similar documents



http://mail-archives.apache.org/mod_mbox/mahout-user/201007.mbox/%3C4C2E3EED
.6070...@googlemail.com%3e



How can I use RowSimilarityJob to generate list of similar documents  .



<ol>

  *<li>-Dmapred.input.dir=(path): Directory containing a {...@link
DistributedRowMatrix} as a

  * SequenceFile<IntWritable,VectorWritable></li>

  *<li>-Dmapred.output.dir=(path): output path where the computations output
should go (a {...@link DistributedRowMatrix}

  * stored as a SequenceFile<IntWritable,VectorWritable>)</li>

  *<li>--numberOfColumns: the number of columns in the input matrix</li>

  *<li>--similarityClassname (classname): an implementation of {...@link
DistributedVectorSimilarity} used to compute the

  * similarity</li>

  *<li>--maxSimilaritiesPerRow (integer): cap the number of similar rows per
row to this number (100)</li>

  *</ol>

  *



Which argument should I pass numberOfColumns and similarityClassname ?





Regards,

Divya

Re: generate similar documents

Reply via email to