On Mon, Dec 28, 2009 at 11:24 AM, Robin Anil <[email protected]> wrote:
> Long answer, You can use your 600 docs to test the classifier and see your > accuracy. Then retrain with the entire documents and then test a test data > set. So daily you can choose to include or exclude the 600 documents that > come and ensure that you keep your classifier at the top performance. > After > some amount of documents, you dont get much benefit of retraining. Further > training would only add over fitting errors. > The suggestion that the 600 new documents be used to monitor performance is an excellent one. It should be pretty easy to add the "train on incremental data" option to K-means. Also, the k-means algorithm definitely will reach a point of diminishing returns, but it should be very resistant to over training.
