-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Michael Parker writes: >On Thu, Feb 12, 2004 at 04:14:52PM -0500, Theo Van Dinter wrote: >> On Thu, Feb 12, 2004 at 11:06:25AM -0800, Justin Mason wrote: >> > BTW, having said that, I'd reckon it might be worthwhile just providing >> > a tool that'll take "sa-learn --dump" output and reload it into a db. >> > Much easier than mucking with the binary data... >> > >I worked up a quick script and sent it to Adam to try out, would be >trivial and actually a little smaller to fold it into sa-learn. I'll >work up a patch. > >> Yeah, I was thinking of a similar tool for letting people merge 2 DBs >> together since that seems to come up occasionally. I haven't really >> considered it a high priority though. >> >> The whole thing would be pretty simple I'd say. Something like: >> >> sa-learn --dump > output >> sa-learn --loaddb output >> sa-learn --mergedb output >> >> Where loaddb would overwrite, and mergedb would, well, merge. ;) >> >> >> This then brings up the question of the seen DB and whether that should >> be dump/merge-able, if it should expire, etc, etc. > > >Here is my problem with merging two databases, maybe my concerns are >unfounded and it doesn't matter. It basically has to do with >collisions. If you are merging two databases that may have "learned" >from the same data then you could skew your results. It would be >similar to learning the same message twice. One or two messages >probably won't matter, but if it's a good number, then you basically >double the numbers on those tokens. Like I said, perhaps this isn't >such a big deal. yes -- this is an "emergency use only" tool, and that issue has to be noted very clearly. > Now, if we stored which tokens where associated with >which message ids, then it would be much easier. But a bigger DB, probably :( aside: here's a possibly-good way to do this. Basically, go back to a message counter. so first msg learned is 1, second 2, etc. Store msg reception time -- as used in expiry -- in a per-message db, possibly db_seen. Then use the message counter in the per-token db, instead of msg reception time, and when doing expiry, expire whole messages -- including decrementing all of the tokens that were in the message being expired. as far as I can see that would not be a big db bloat issue.) - --j. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.3 (GNU/Linux) Comment: Exmh CVS iD8DBQFAK/nUQTcbUG5Y7woRAr+KAJ9WWMwNob9PQVAHsFHdJjBfSbAAGwCdF98X KmFPhnEMJAkojz5BLWUhbVg= =KxkB -----END PGP SIGNATURE-----
