On Thu, Feb 12, 2004 at 04:14:52PM -0500, Theo Van Dinter wrote:
> On Thu, Feb 12, 2004 at 11:06:25AM -0800, Justin Mason wrote:
> > BTW, having said that, I'd reckon it might be worthwhile just providing
> > a tool that'll take "sa-learn --dump" output and reload it into a db.
> > Much easier than mucking with the binary data...
> 

I worked up a quick script and sent it to Adam to try out, would be
trivial and actually a little smaller to fold it into sa-learn.  I'll
work up a patch.

> Yeah, I was thinking of a similar tool for letting people merge 2 DBs
> together since that seems to come up occasionally.  I haven't really
> considered it a high priority though.
> 
> The whole thing would be pretty simple I'd say.  Something like:
> 
> sa-learn --dump > output
> sa-learn --loaddb output
> sa-learn --mergedb output
> 
> Where loaddb would overwrite, and mergedb would, well, merge. ;)
> 
> 
> This then brings up the question of the seen DB and whether that should
> be dump/merge-able, if it should expire, etc, etc.


Here is my problem with merging two databases, maybe my concerns are
unfounded and it doesn't matter.  It basically has to do with
collisions.  If you are merging two databases that may have "learned"
from the same data then you could skew your results.  It would be
similar to learning the same message twice.  One or two messages
probably won't matter, but if it's a good number, then you basically
double the numbers on those tokens.  Like I said, perhaps this isn't
such a big deal.  Now, if we stored which tokens where associated with
which message ids, then it would be much easier.

Michael

Reply via email to