Re: Bayes expiration logic
On Mon, 06 Jul 2009 16:13:17 -0400 Rosenbaum, Larry M. rosenbau...@ornl.gov wrote: Has anybody considered revising the Bayes expiration logic? Maybe it's just our data that's weird, but the built-in expiration logic doesn't seem to work very well for us. Here are my observations: There's no point in checking anything older than oldest_atime. For this value and older, zero tokens will be expired. The current estimation pass logic goes back 256 days, even if the oldest atime is one week and the calculations have already started returning zeroes. And there's another problem there. If deleting tokens over 256 days would delete more than the target number, then no tokens at all are deleted. If the database was trained from historic corpora, then most of the tokens could be older, and in the worst case, the database could grow to 175% of it's configured maximum.
Re: Bayes expiration logic
On Fri, 10 Jul 2009 02:28:30 +0100 RW rwmailli...@googlemail.com wrote: On Mon, 06 Jul 2009 16:13:17 -0400 Rosenbaum, Larry M. rosenbau...@ornl.gov wrote: Has anybody considered revising the Bayes expiration logic? Maybe it's just our data that's weird, but the built-in expiration logic doesn't seem to work very well for us. Here are my observations: There's no point in checking anything older than oldest_atime. For this value and older, zero tokens will be expired. The current estimation pass logic goes back 256 days, even if the oldest atime is one week and the calculations have already started returning zeroes. And there's another problem there. If deleting tokens over 256 days would delete more than the target number, then no tokens at all are deleted. If the database was trained from historic corpora, then most of the tokens could be older, and in the worst case, the database could grow to 175% of it's configured maximum. On reflection that should be 175% of the unique tokens in the corpora, which means that the database could grow to an unlimited size.
Re: Bayes expiration logic
On Mon, 06 Jul 2009 16:13:17 -0400 Rosenbaum, Larry M. rosenbau...@ornl.gov wrote: Has anybody considered revising the Bayes expiration logic? Maybe it's just our data that's weird, but the built-in expiration logic doesn't seem to work very well for us. Here are my observations: There's no point in checking anything older than oldest_atime. For this value and older, zero tokens will be expired. The current estimation pass logic goes back 256 days, even if the oldest atime is one week and the calculations have already started returning zeroes. If your target corresponds to a delta of more than a few days, you're unlikely to get very close to it because the estimation pass logic uses exponentially increasing intervals. There could be a big difference between 8 days and 16 days for delta. This sounds pretty bad. What I would do is this: Compute the fraction of the tokens to be deleted. Sample a few thousand random tokens, and apply the same fraction to the samples. Take the atime of the oldest sample token that would survive, use that as the new max-atime, and delete anything older from the database.