Thanks for the quick response. @Robin absolutely agree on your suggestion regarding using 600 docs for monitoring performance.
lets talk about bigger numbers e.g. i have more than 1 million docs and i get 10k new docs every day out of which 6k is already classified. Monitoring performance is good but it can be done weekly instead of daily just to reduce cost. I actually wanted to avoid the retraining as much as possible because it comes with huge cost for large dataset. Better solution could that we'll use 50k docs from every category order by created_at desc, to reduce the amount of data and stay tuned with latest trends. Thanks a lot guys. -Mani Kumar On Tue, Dec 29, 2009 at 1:22 AM, Ted Dunning <[email protected]> wrote: > On Mon, Dec 28, 2009 at 11:24 AM, Robin Anil <[email protected]> wrote: > > > Long answer, You can use your 600 docs to test the classifier and see > your > > accuracy. Then retrain with the entire documents and then test a test > data > > set. So daily you can choose to include or exclude the 600 documents that > > come and ensure that you keep your classifier at the top performance. > > After > > some amount of documents, you dont get much benefit of retraining. > Further > > training would only add over fitting errors. > > > > The suggestion that the 600 new documents be used to monitor performance is > an excellent one. > > It should be pretty easy to add the "train on incremental data" option to > K-means. > > Also, the k-means algorithm definitely will reach a point of diminishing > returns, but it should be very resistant to over training. >
