Re: Bayes expiration logic

2009-07-09 Thread RW
On Mon, 06 Jul 2009 16:13:17 -0400
Rosenbaum, Larry M. rosenbau...@ornl.gov wrote:

 Has anybody considered revising the Bayes expiration logic?  Maybe
 it's just our data that's weird, but the built-in expiration logic
 doesn't seem to work very well for us.  Here are my observations:
 
 There's no point in checking anything older than oldest_atime.  For
 this value and older, zero tokens will be expired.  The current
 estimation pass logic goes back 256 days, even if the oldest atime is
 one week and the calculations have already started returning zeroes.

And there's another problem there. If deleting tokens over 256
days would delete more than the target number, then no tokens at all
are deleted. If the database was trained from historic corpora, then
most of the tokens could be older, and in the worst case, the database
could grow to 175% of it's configured maximum.


Re: Bayes expiration logic

2009-07-09 Thread RW
On Fri, 10 Jul 2009 02:28:30 +0100
RW rwmailli...@googlemail.com wrote:

 On Mon, 06 Jul 2009 16:13:17 -0400
 Rosenbaum, Larry M. rosenbau...@ornl.gov wrote:
 
  Has anybody considered revising the Bayes expiration logic?  Maybe
  it's just our data that's weird, but the built-in expiration logic
  doesn't seem to work very well for us.  Here are my observations:
  
  There's no point in checking anything older than oldest_atime.  For
  this value and older, zero tokens will be expired.  The current
  estimation pass logic goes back 256 days, even if the oldest atime
  is one week and the calculations have already started returning
  zeroes.
 
 And there's another problem there. If deleting tokens over 256
 days would delete more than the target number, then no tokens at all
 are deleted. If the database was trained from historic corpora, then
 most of the tokens could be older, and in the worst case, the database
 could grow to 175% of it's configured maximum.

On reflection that should be 175% of the unique tokens in the corpora,
which means that the database could grow to an unlimited size.


Re: Bayes expiration logic

2009-07-07 Thread RW
On Mon, 06 Jul 2009 16:13:17 -0400
Rosenbaum, Larry M. rosenbau...@ornl.gov wrote:

 Has anybody considered revising the Bayes expiration logic?  Maybe
 it's just our data that's weird, but the built-in expiration logic
 doesn't seem to work very well for us.  Here are my observations:
 
 There's no point in checking anything older than oldest_atime.  For
 this value and older, zero tokens will be expired.  The current
 estimation pass logic goes back 256 days, even if the oldest atime is
 one week and the calculations have already started returning zeroes.
 
 If your target corresponds to a delta of more than a few days, you're
 unlikely to get very close to it because the estimation pass logic
 uses exponentially increasing intervals.  There could be a big
 difference between 8 days and 16 days for delta.

This sounds pretty bad. What I would do is this:

Compute the fraction of the tokens to be deleted. Sample a few thousand
random tokens, and apply the same fraction to the samples. Take the
atime of the oldest sample token that would survive, use that as the
new max-atime, and delete anything older from the database.