Hi, as Chirag said, try LDA. You can also check an implementation of pLSA, but it is not part of Mahout, you can find it here: https://github.com/akopich/dplsa
--David On Thu, Mar 26, 2015 at 2:01 PM, 3316 Chirag Nagpal < chiragnagpal_12...@aitpune.edu.in> wrote: > A better approach I can think of for the aformentioned task is to use > Latent Dirichlet Allocation > > You can force, LDA to learn topics with certain specific words by > assigning higher probability values to those words in the initial dirichlet > distribution. > > That way you will be able to discover topics better > > Chirag Nagpal > Department of Computer Engineering > Army Institute of Technology, Pune > > ________________________________________ > From: Hersheeta Chandankar <hersheetachandan...@gmail.com> > Sent: Thursday, March 26, 2015 6:25 PM > To: user@mahout.apache.org > Subject: Latent Semantic Analysis for Document Categorization > > Hi, > > I'm working on a document categorization project wherein I have some > crawled text documents on different topics which I want to categorize into > pre-decided categories like travel,sports,education etc. > Currently the approach I've used is of building a NaiveBayes Classification > model in mahout which has given good accuracy result of 70%-75%. But I > would still like to improve the accuracy by retrieving the semantic > dependencies between words of the documents. > I've read about Latent Semantic Analysis(LSA) which creates a term-document > matrix and subjects it to mathematical transformation called Singular Value > Decomposition(SVD). > I'd thought of firstly subjecting the raw documents to LSA followed by > k-means clustering on LSA output and then giving the clustered output as > input to the NaiveBayes Classifier. > But on trying out LSA in Mahout the end result seemed to be in numerical > format and which after clustering were not acceptable by the NaiveBayes > classifier. > > Is my expirimental approach wrong? Has anybody worked on a similar issue > like this? > Could someone help me with the implementation of LSA or suggest any other > approach for semantic analysis of text documents. > > Thanks > -Hersheeta >