Hi, I'm working on a document categorization project wherein I have some crawled text documents on different topics which I want to categorize into pre-decided categories like travel,sports,education etc. Currently the approach I've used is of building a NaiveBayes Classification model in mahout which has given good accuracy result of 70%-75%. But I would still like to improve the accuracy by retrieving the semantic dependencies between words of the documents. I've read about Latent Semantic Analysis(LSA) which creates a term-document matrix and subjects it to mathematical transformation called Singular Value Decomposition(SVD). I'd thought of firstly subjecting the raw documents to LSA followed by k-means clustering on LSA output and then giving the clustered output as input to the NaiveBayes Classifier. But on trying out LSA in Mahout the end result seemed to be in numerical format and which after clustering were not acceptable by the NaiveBayes classifier.
Is my expirimental approach wrong? Has anybody worked on a similar issue like this? Could someone help me with the implementation of LSA or suggest any other approach for semantic analysis of text documents. Thanks -Hersheeta