-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Michael Parker writes:
>On Thu, Feb 12, 2004 at 04:14:52PM -0500, Theo Van Dinter wrote:
>> On Thu, Feb 12, 2004 at 11:06:25AM -0800, Justin Mason wrote:
>> > BTW, having said that, I'd reckon it might be worthwhile just providing
>> > a tool that'll take "sa-learn --dump" output and reload it into a db.
>> > Much easier than mucking with the binary data...
>> 
>
>I worked up a quick script and sent it to Adam to try out, would be
>trivial and actually a little smaller to fold it into sa-learn.  I'll
>work up a patch.
>
>> Yeah, I was thinking of a similar tool for letting people merge 2 DBs
>> together since that seems to come up occasionally.  I haven't really
>> considered it a high priority though.
>> 
>> The whole thing would be pretty simple I'd say.  Something like:
>> 
>> sa-learn --dump > output
>> sa-learn --loaddb output
>> sa-learn --mergedb output
>> 
>> Where loaddb would overwrite, and mergedb would, well, merge. ;)
>> 
>> 
>> This then brings up the question of the seen DB and whether that should
>> be dump/merge-able, if it should expire, etc, etc.
>
>
>Here is my problem with merging two databases, maybe my concerns are
>unfounded and it doesn't matter.  It basically has to do with
>collisions.  If you are merging two databases that may have "learned"
>from the same data then you could skew your results.  It would be
>similar to learning the same message twice.  One or two messages
>probably won't matter, but if it's a good number, then you basically
>double the numbers on those tokens.  Like I said, perhaps this isn't
>such a big deal.

yes -- this is an "emergency use only" tool, and that issue has to
be noted very clearly.

> Now, if we stored which tokens where associated with
>which message ids, then it would be much easier.

But a bigger DB, probably :(

aside: here's a possibly-good way to do this.

Basically, go back to a message counter.  so first msg learned is 1,
second 2, etc.  Store msg reception time -- as used in expiry -- in a
per-message db, possibly db_seen.

Then use the message counter in the per-token db, instead of msg reception
time, and when doing expiry, expire whole messages -- including
decrementing all of the tokens that were in the message being expired.

as far as I can see that would not be a big db bloat issue.)

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFAK/nUQTcbUG5Y7woRAr+KAJ9WWMwNob9PQVAHsFHdJjBfSbAAGwCdF98X
KmFPhnEMJAkojz5BLWUhbVg=
=KxkB
-----END PGP SIGNATURE-----

Reply via email to