Re: Help needed on TF IDF.

Grant Ingersoll Mon, 09 Jan 2012 05:44:48 -0800

On Jan 8, 2012, at 7:44 PM, Junaid Surve wrote:

> Hi
> 
> I got your email address from one of the Mahout forum.
> 
> I need some help.
> 
> I have about 60 docs for which I am calculating the TF IDF.


Just so I am not making any assumptions, is this just a sample of the docs you 
are going to work with or is this the total size of your collection?

> 
> The steps that I am following -
> 1. Convert the files into Sequence file using SequenceFilesFromDirectory
> run() method.
> 2. Tokenize the generated sequence file using DocumentProcessor
> tokenizeDocuments() method.
> 3. Create Term Frequency Vector using - DictionaryVectorizer
> createTermFrequencyVectors() method.
> 4. Create the TF IDF using TFIDFConverter processTfIdf() method.

Have a look at the SparseVectorsFromSequenceFiles class (or see "bin/mahout 
seq2sparse") which does all of these steps for you.

> 5. Create the Matrix using code from RowIdJob.
> 
> What more is to be done?
> 
> *I want to find the similarity between each document. Something like *
> *Doc 1 - Doc 2 is XXX similar*
> *Doc 1 - Doc 3 is YYY similar*
> *Doc 2 - Doc 3 is ZZZ similar*

Have a look at the RowSimilarityJob, which will do pairwise similarity.

> *
> *
> Can you please help?
> 
> -- 
> Regards
> Junaid

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com

Re: Help needed on TF IDF.

Reply via email to