-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Theo Van Dinter writes:
> On Tue, Feb 10, 2004 at 12:08:56PM -0500, Adam Denenberg wrote:
> > thanks Theo. I would love to send my bayes_toks thru db_dump and fix the
> > "broken" records.  However i am not familiar with the format. is there
> > an existing script, or a site that will allow me to properly remove
> > entries with bad atime values?
> 
> Not that I know of.  If you're really keen on trying this, here's the
> basics...  Some of this probably should be documented somewhere besides
> the code anyway ...:

(cough) wiki.SpamAssassin.org ;)

- --j.

> # stop spamassassin ...
> # make a backup copy of bayes_toks!
> 
> $ db_dump -p -f out .spamassassin/bayes_toks
> $ sa-learn --dump data | perl -nle 'print if ( (split)[3] > time )' > out2
> 
> out2 now contains the list of tokens you need to fix.  go through each
> one in "out" and fix it.  for instance, assume "anticipate" was a token
> that needed fixing, in "out" you'd see something like:
> 
>  anticipate
>  \00\fa\00\00\00\e0\00\00\00l\87*@
> 
> That's 13 bytes, which means it's the CVVV format.  If it was 5 bytes,
> it's CV format, fyi.  Now you want to throw the data through unpack to
> get the actual values out:
> 
> $ perl -e 'print join("\n", unpack("CVVV", 
> "\x00\xfa\x00\x00\x00\xe0\x00\x00\x00l\x87*@"),"")'
> 0
> 250
> 224
> 1076529004
> 
> There's probably an easier way to do that, but ...  perl expect hex
> values in "\x##" format, but db_dump outputs in "\##" format, so you
> have to put the "x" in appropriately there.
> 
> The first 3 numbers you don't care about, but they're packing format
> (0 for CVVV, or 192 for CV), # of spam matches, # of ham matches.
> The fourth number is atime.  Change the atime to whatever you want, I'd
> choose the current time() value (use the same one for all of the ones
> you want to fix...)  In my case, I'm going to use 1076685969 just for example.
> Now you get to put it back in the right format ...
> 
> $ perl -e 'print map { sprintf "\\%02x",$_ } unpack("C13", pack("CVVV", 0, 
> 250, 224, 1076685969));print "\n"'
> \00\fa\00\00\00\e0\00\00\00\91\ec\2c\40
> 
> Take that and put it in the "out" file appropriately.  Now repeat
> for the other tokens.  At the end, find the newest atime magic token
> "\0d\01\07\09\03NEWESTAGE", and change the value (it's just a string)
> from the current one to whatever atime you used, 1076685969 in this case.
> 
> $ db_load -f out .spamassassin/bayes_toks
> 
> You can now do a "sa-learn --dump" to make sure it all looks right...:
> 
> [...]
> 0.000          0 1076685969          0  non-token data: newest atime
> [...]
> 0.158        250        224 1076685969  anticipate
> [...]
> 
> Now, here's the fun part -- if you have tokens in CV format (which is
> very likely in your case since the ham/spam counts are very likely to be
> both < 8), this whole thing becomes a lot more complicated to do by hand...
> So, let's switch to the more simple, but uglier, way of doing things:
> 
> $ perl -MMail::SpamAssassin::BayesStore -e 'print join("\n", \
> Mail::SpamAssassin::BayesStore::tok_unpack({db_version => 2}, \
> "\x00\xfa\x00\x00\x00\xe0\x00\x00\x00l\x87*@"),"")'
> 250
> 224
> 1076529004
> 
> $ perl -MMail::SpamAssassin::BayesStore -e 'print map { sprintf "\\%02x",$_ } 
> \
> unpack("C*", Mail::SpamAssassin::BayesStore->tok_pack(250, 224, 
> 1076529004));print "\n"'
> \00\fa\00\00\00\e0\00\00\00\6c\87\2a\40
> 
> This code has the benefit of working for both CVVV and CV formats...
> 
> For example: "\xd0\x1fU(@"
> 2
> 0
> 1076385055
> 
> [...]
> 
> \d0\1f\55\28\40
> 
> Please note that by editing your DB by hand, any future issues that
> arise will be blamed on the editing.  aka: no support.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFAK8bLQTcbUG5Y7woRAkwgAKCrO/28yt1JpwdVzbD8IYXm9U5D8gCgzd8W
JFuRyMgL4Jb2tChTydnqn+g=
=b0an
-----END PGP SIGNATURE-----

Reply via email to