with a 50K set, you may/may not loose out on some features. Depends entirely on the data. If you dont mind answering - What is the number of categories that you have?
I agree that re-training 1 million docs is cumbersome. But if i remember correctly, I trained(CBayes) on a 3GB subject of wikipedia on 6 pentium-4 HT systems in 20 mins. I dont know how big your data or how big your cluster is. But a daily 1 hour map/reduce job is not that expensive (Maybe I am blind and have no sense of what is big after working at google). I say, try and estimate it yourself. On the other hand. You could also try a dual fold approach. A sturdy 1 million docs trained classifier and recent 50K docs classifier. And do some form of voting. I am sure you will not be able to load the 1mil model in to memory, you might need to use Hbase there. Instead you can use 50K model in memory for fast classification. Then run a batch classification job daily to re-classify your dataset based on the 1mil model Robin On Tue, Dec 29, 2009 at 3:03 AM, Mani Kumar <[email protected]>wrote: > Thanks for the quick response. > > @Robin absolutely agree on your suggestion regarding using 600 docs for > monitoring performance. > > lets talk about bigger numbers e.g. i have more than 1 million docs and i > get 10k new docs every day out of which 6k is already classified. > > Monitoring performance is good but it can be done weekly instead of daily > just to reduce cost. > > I actually wanted to avoid the retraining as much as possible because it > comes with huge cost for large dataset. > > Better solution could that we'll use 50k docs from every category order by > created_at desc, to reduce the amount of data and stay tuned with latest > trends. > > Thanks a lot guys. > > -Mani Kumar > > On Tue, Dec 29, 2009 at 1:22 AM, Ted Dunning <[email protected]> > wrote: > > > On Mon, Dec 28, 2009 at 11:24 AM, Robin Anil <[email protected]> > wrote: > > > > > Long answer, You can use your 600 docs to test the classifier and see > > your > > > accuracy. Then retrain with the entire documents and then test a test > > data > > > set. So daily you can choose to include or exclude the 600 documents > that > > > come and ensure that you keep your classifier at the top performance. > > > After > > > some amount of documents, you dont get much benefit of retraining. > > Further > > > training would only add over fitting errors. > > > > > > > The suggestion that the 600 new documents be used to monitor performance > is > > an excellent one. > > > > It should be pretty easy to add the "train on incremental data" option to > > K-means. > > > > Also, the k-means algorithm definitely will reach a point of diminishing > > returns, but it should be very resistant to over training. > > >
