Hi Ted, Thank you for a quick reply. It would be of great help if you could please explain what kind of 'linking information between documents' I should look for.
On Fri, Mar 27, 2015 at 2:45 AM, Ted Dunning <ted.dunn...@gmail.com> wrote: > Also, if you can include linking information between documents, you should > be able to substantially improve accuracy. Same goes for behavioral data > like browsing history. > > > > On Thu, Mar 26, 2015 at 6:10 AM, Hersheeta Chandankar < > hersheetachandan...@gmail.com> wrote: > > > Thank you so much Chirag and David for your suggestion. > > I'll surely try it. > > > > On Thu, Mar 26, 2015 at 6:31 PM, 3316 Chirag Nagpal < > > chiragnagpal_12...@aitpune.edu.in> wrote: > > > > > A better approach I can think of for the aformentioned task is to use > > > Latent Dirichlet Allocation > > > > > > You can force, LDA to learn topics with certain specific words by > > > assigning higher probability values to those words in the initial > > dirichlet > > > distribution. > > > > > > That way you will be able to discover topics better > > > > > > Chirag Nagpal > > > Department of Computer Engineering > > > Army Institute of Technology, Pune > > > > > > ________________________________________ > > > From: Hersheeta Chandankar <hersheetachandan...@gmail.com> > > > Sent: Thursday, March 26, 2015 6:25 PM > > > To: user@mahout.apache.org > > > Subject: Latent Semantic Analysis for Document Categorization > > > > > > Hi, > > > > > > I'm working on a document categorization project wherein I have some > > > crawled text documents on different topics which I want to categorize > > into > > > pre-decided categories like travel,sports,education etc. > > > Currently the approach I've used is of building a NaiveBayes > > Classification > > > model in mahout which has given good accuracy result of 70%-75%. But I > > > would still like to improve the accuracy by retrieving the semantic > > > dependencies between words of the documents. > > > I've read about Latent Semantic Analysis(LSA) which creates a > > term-document > > > matrix and subjects it to mathematical transformation called Singular > > Value > > > Decomposition(SVD). > > > I'd thought of firstly subjecting the raw documents to LSA followed by > > > k-means clustering on LSA output and then giving the clustered output > as > > > input to the NaiveBayes Classifier. > > > But on trying out LSA in Mahout the end result seemed to be in > numerical > > > format and which after clustering were not acceptable by the NaiveBayes > > > classifier. > > > > > > Is my expirimental approach wrong? Has anybody worked on a similar > issue > > > like this? > > > Could someone help me with the implementation of LSA or suggest any > other > > > approach for semantic analysis of text documents. > > > > > > Thanks > > > -Hersheeta > > > > > >