Also, if you can include linking information between documents, you should
be able to substantially improve accuracy.  Same goes for behavioral data
like browsing history.



On Thu, Mar 26, 2015 at 6:10 AM, Hersheeta Chandankar <
hersheetachandan...@gmail.com> wrote:

> Thank you so much Chirag and David for your suggestion.
> I'll surely try it.
>
> On Thu, Mar 26, 2015 at 6:31 PM, 3316 Chirag Nagpal <
> chiragnagpal_12...@aitpune.edu.in> wrote:
>
> > A better approach I can think of for the aformentioned task is to use
> > Latent Dirichlet Allocation
> >
> > You can force, LDA to learn topics with certain specific words by
> > assigning higher probability values to those words in the initial
> dirichlet
> > distribution.
> >
> > That way you will be able to discover topics better
> >
> > Chirag Nagpal
> > Department of Computer Engineering
> > Army Institute of Technology, Pune
> >
> > ________________________________________
> > From: Hersheeta Chandankar <hersheetachandan...@gmail.com>
> > Sent: Thursday, March 26, 2015 6:25 PM
> > To: user@mahout.apache.org
> > Subject: Latent Semantic Analysis for Document Categorization
> >
> > Hi,
> >
> > I'm working on a document categorization project wherein I have some
> > crawled text documents on different topics which I want to categorize
> into
> > pre-decided categories like travel,sports,education etc.
> > Currently the approach I've used is of building a NaiveBayes
> Classification
> > model in mahout which has given good accuracy result of 70%-75%. But I
> > would still like to improve the accuracy by retrieving the semantic
> > dependencies between words of the documents.
> > I've read about Latent Semantic Analysis(LSA) which creates a
> term-document
> > matrix and subjects it to mathematical transformation called Singular
> Value
> > Decomposition(SVD).
> > I'd thought of firstly subjecting the raw documents to LSA followed by
> > k-means clustering on LSA output and then giving the clustered output as
> > input to the NaiveBayes Classifier.
> > But on trying out LSA in Mahout the end result seemed to be in numerical
> > format and which after clustering were not acceptable by the NaiveBayes
> > classifier.
> >
> > Is my expirimental approach wrong? Has anybody worked on a similar issue
> > like this?
> > Could someone help me with the implementation of LSA or suggest any other
> > approach for semantic analysis of text documents.
> >
> > Thanks
> > -Hersheeta
> >
>

Reply via email to