Hi Ted,

Thank you for a quick reply.
It would be of great help if you could please explain what kind of 'linking
information between documents' I should look for.

On Fri, Mar 27, 2015 at 2:45 AM, Ted Dunning <ted.dunn...@gmail.com> wrote:

> Also, if you can include linking information between documents, you should
> be able to substantially improve accuracy.  Same goes for behavioral data
> like browsing history.
>
>
>
> On Thu, Mar 26, 2015 at 6:10 AM, Hersheeta Chandankar <
> hersheetachandan...@gmail.com> wrote:
>
> > Thank you so much Chirag and David for your suggestion.
> > I'll surely try it.
> >
> > On Thu, Mar 26, 2015 at 6:31 PM, 3316 Chirag Nagpal <
> > chiragnagpal_12...@aitpune.edu.in> wrote:
> >
> > > A better approach I can think of for the aformentioned task is to use
> > > Latent Dirichlet Allocation
> > >
> > > You can force, LDA to learn topics with certain specific words by
> > > assigning higher probability values to those words in the initial
> > dirichlet
> > > distribution.
> > >
> > > That way you will be able to discover topics better
> > >
> > > Chirag Nagpal
> > > Department of Computer Engineering
> > > Army Institute of Technology, Pune
> > >
> > > ________________________________________
> > > From: Hersheeta Chandankar <hersheetachandan...@gmail.com>
> > > Sent: Thursday, March 26, 2015 6:25 PM
> > > To: user@mahout.apache.org
> > > Subject: Latent Semantic Analysis for Document Categorization
> > >
> > > Hi,
> > >
> > > I'm working on a document categorization project wherein I have some
> > > crawled text documents on different topics which I want to categorize
> > into
> > > pre-decided categories like travel,sports,education etc.
> > > Currently the approach I've used is of building a NaiveBayes
> > Classification
> > > model in mahout which has given good accuracy result of 70%-75%. But I
> > > would still like to improve the accuracy by retrieving the semantic
> > > dependencies between words of the documents.
> > > I've read about Latent Semantic Analysis(LSA) which creates a
> > term-document
> > > matrix and subjects it to mathematical transformation called Singular
> > Value
> > > Decomposition(SVD).
> > > I'd thought of firstly subjecting the raw documents to LSA followed by
> > > k-means clustering on LSA output and then giving the clustered output
> as
> > > input to the NaiveBayes Classifier.
> > > But on trying out LSA in Mahout the end result seemed to be in
> numerical
> > > format and which after clustering were not acceptable by the NaiveBayes
> > > classifier.
> > >
> > > Is my expirimental approach wrong? Has anybody worked on a similar
> issue
> > > like this?
> > > Could someone help me with the implementation of LSA or suggest any
> other
> > > approach for semantic analysis of text documents.
> > >
> > > Thanks
> > > -Hersheeta
> > >
> >
>

Reply via email to