Hi All, I have ran 20newsgroups example. Got a very good idea of how cluster is working for a defined dataset.
But i have a slightly different situation here. * I have few thousands of documents (50k). * Everyday i get some e.g. 1k documents and out of which 600 are already classified so i need to classify only 400 documents everyday. So my approach would be: 1. Get all the documents into hdfs 2. Train classifier based on data in hdfs 3. Classify new unclassified document. Right now i don't see a way to add more training documents (600 already classified docs) into system? Am i missing something? Also I don't want to remove and then create training model again. Thanks! Mani Kumar
