Hi Sebastian, >From where can I get the numberOfColumns. How can I calculate I have these many columns my matrix has as SparseVectorsFromSequenceFiles generates vectors in binary format.
Regards, Divya -----Original Message----- From: Sebastian Schelter [mailto:[email protected]] Sent: Thursday, October 28, 2010 4:28 PM To: [email protected] Subject: Re: generate similar documents Hi Divya, --similarityClassname should point to an implementation of org.apache.mahout.math.hadoop.similarity.vector.DistributedVectorSimilarity, you can use any value from org.apache.mahout.math.hadoop.similarity.SimilarityType to use a predefined similarity measure or you can point to an implementation of your own --numberOfColumns is the number of columns of the input matrix, which would be the number of unique terms as I suppose your matrix is documents x terms --sebastian On 28.10.2010 10:11, Divya wrote: > Hi, > > I have directory of documents from which I have generated Sequence file > using SequenceFilesFromDirectory and then converted it into vectors > SparseVectorsFromSequenceFiles > > Now referring below link to generate a list of most similar documents > > > > http://mail-archives.apache.org/mod_mbox/mahout-user/201007.mbox/%3C4C2E3EED > [email protected]%3e > > > > How can I use RowSimilarityJob to generate list of similar documents . > > > > <ol> > > *<li>-Dmapred.input.dir=(path): Directory containing a {...@link > DistributedRowMatrix} as a > > * SequenceFile<IntWritable,VectorWritable></li> > > *<li>-Dmapred.output.dir=(path): output path where the computations output > should go (a {...@link DistributedRowMatrix} > > * stored as a SequenceFile<IntWritable,VectorWritable>)</li> > > *<li>--numberOfColumns: the number of columns in the input matrix</li> > > *<li>--similarityClassname (classname): an implementation of {...@link > DistributedVectorSimilarity} used to compute the > > * similarity</li> > > *<li>--maxSimilaritiesPerRow (integer): cap the number of similar rows per > row to this number (100)</li> > > *</ol> > > * > > > > Which argument should I pass numberOfColumns and similarityClassname ? > > > > > > Regards, > > Divya > > >
