A better approach I can think of for the aformentioned task is to use Latent 
Dirichlet Allocation

You can force, LDA to learn topics with certain specific words by assigning 
higher probability values to those words in the initial dirichlet distribution.

That way you will be able to discover topics better

Chirag Nagpal
Department of Computer Engineering
Army Institute of Technology, Pune

________________________________________
From: Hersheeta Chandankar <hersheetachandan...@gmail.com>
Sent: Thursday, March 26, 2015 6:25 PM
To: user@mahout.apache.org
Subject: Latent Semantic Analysis for Document Categorization

Hi,

I'm working on a document categorization project wherein I have some
crawled text documents on different topics which I want to categorize into
pre-decided categories like travel,sports,education etc.
Currently the approach I've used is of building a NaiveBayes Classification
model in mahout which has given good accuracy result of 70%-75%. But I
would still like to improve the accuracy by retrieving the semantic
dependencies between words of the documents.
I've read about Latent Semantic Analysis(LSA) which creates a term-document
matrix and subjects it to mathematical transformation called Singular Value
Decomposition(SVD).
I'd thought of firstly subjecting the raw documents to LSA followed by
k-means clustering on LSA output and then giving the clustered output as
input to the NaiveBayes Classifier.
But on trying out LSA in Mahout the end result seemed to be in numerical
format and which after clustering were not acceptable by the NaiveBayes
classifier.

Is my expirimental approach wrong? Has anybody worked on a similar issue
like this?
Could someone help me with the implementation of LSA or suggest any other
approach for semantic analysis of text documents.

Thanks
-Hersheeta

Reply via email to