Hello Hersheeta,

Are you vectorizing the new text using the same dictionary as you used to train 
the models?  If not, this will likely severely impact the performance of the 
classifier.



> Date: Fri, 24 Oct 2014 21:28:06 +0530
> Subject: Categorization of documents using clustering and classification
> From: hersheetachandan...@gmail.com
> To: user@mahout.apache.org
> 
> Hi,
> 
> I have a collection of crawled text documents on different topics which I
> want to categorize into pre-decided categories like travel,sports,education
> etc.
> For this I've firstly clustered these documents using k-means clustering
> and then built a complimentary-naive bayes model of these clustered
> documents.
> The accuracy and reliability of the model was 83% & 63% respectively.
> Now the problem is that, on deploying the model the results recorded are
> absurd
> (eg- A sports document is categorized under business category).
> On analyzing the problem, I found that the clusters formed were not clean
> (contained unrelated documents) which may have led to creation of wrong
> dictionary file.
> 
> In order to avoid this, is there any other way to get the input data
> preprocessed and clustered ?
> or
> Is there any other alternative approach that could be used for the
> categorization?
> 
> Thanks,
> -Hersheeta
                                          

Reply via email to