Also, if you can include linking information between documents, you should be able to substantially improve accuracy. Same goes for behavioral data like browsing history.
On Thu, Mar 26, 2015 at 6:10 AM, Hersheeta Chandankar < hersheetachandan...@gmail.com> wrote: > Thank you so much Chirag and David for your suggestion. > I'll surely try it. > > On Thu, Mar 26, 2015 at 6:31 PM, 3316 Chirag Nagpal < > chiragnagpal_12...@aitpune.edu.in> wrote: > > > A better approach I can think of for the aformentioned task is to use > > Latent Dirichlet Allocation > > > > You can force, LDA to learn topics with certain specific words by > > assigning higher probability values to those words in the initial > dirichlet > > distribution. > > > > That way you will be able to discover topics better > > > > Chirag Nagpal > > Department of Computer Engineering > > Army Institute of Technology, Pune > > > > ________________________________________ > > From: Hersheeta Chandankar <hersheetachandan...@gmail.com> > > Sent: Thursday, March 26, 2015 6:25 PM > > To: user@mahout.apache.org > > Subject: Latent Semantic Analysis for Document Categorization > > > > Hi, > > > > I'm working on a document categorization project wherein I have some > > crawled text documents on different topics which I want to categorize > into > > pre-decided categories like travel,sports,education etc. > > Currently the approach I've used is of building a NaiveBayes > Classification > > model in mahout which has given good accuracy result of 70%-75%. But I > > would still like to improve the accuracy by retrieving the semantic > > dependencies between words of the documents. > > I've read about Latent Semantic Analysis(LSA) which creates a > term-document > > matrix and subjects it to mathematical transformation called Singular > Value > > Decomposition(SVD). > > I'd thought of firstly subjecting the raw documents to LSA followed by > > k-means clustering on LSA output and then giving the clustered output as > > input to the NaiveBayes Classifier. > > But on trying out LSA in Mahout the end result seemed to be in numerical > > format and which after clustering were not acceptable by the NaiveBayes > > classifier. > > > > Is my expirimental approach wrong? Has anybody worked on a similar issue > > like this? > > Could someone help me with the implementation of LSA or suggest any other > > approach for semantic analysis of text documents. > > > > Thanks > > -Hersheeta > > >