Hi All,

I have ran 20newsgroups example. Got a very good idea of how cluster is
working for a defined dataset.

But i have a slightly different situation here.

* I have few thousands of documents (50k).
* Everyday i get some e.g. 1k documents and out of which 600 are already
classified so i need to classify only 400 documents everyday.

So my approach would be:

1. Get all the documents into hdfs
2. Train classifier based on data in hdfs
3. Classify new unclassified document.

Right now i don't see a way to add more training documents (600 already
classified docs) into system? Am i missing something?

Also I don't want to remove and then create training model again.

Thanks!
Mani Kumar

Reply via email to