Hi, I have a collection of crawled text documents on different topics which I want to categorize into pre-decided categories like travel,sports,education etc. For this I've firstly clustered these documents using k-means clustering and then built a complimentary-naive bayes model of these clustered documents. The accuracy and reliability of the model was 83% & 63% respectively. Now the problem is that, on deploying the model the results recorded are absurd (eg- A sports document is categorized under business category). On analyzing the problem, I found that the clusters formed were not clean (contained unrelated documents) which may have led to creation of wrong dictionary file.
In order to avoid this, is there any other way to get the input data preprocessed and clustered ? or Is there any other alternative approach that could be used for the categorization? Thanks, -Hersheeta