On Tue, Feb 10, 2004 at 12:08:56PM -0500, Adam Denenberg wrote: > thanks Theo. I would love to send my bayes_toks thru db_dump and fix the > "broken" records. However i am not familiar with the format. is there > an existing script, or a site that will allow me to properly remove > entries with bad atime values?
Not that I know of. If you're really keen on trying this, here's the
basics... Some of this probably should be documented somewhere besides
the code anyway ...:
# stop spamassassin ...
# make a backup copy of bayes_toks!
$ db_dump -p -f out .spamassassin/bayes_toks
$ sa-learn --dump data | perl -nle 'print if ( (split)[3] > time )' > out2
out2 now contains the list of tokens you need to fix. go through each
one in "out" and fix it. for instance, assume "anticipate" was a token
that needed fixing, in "out" you'd see something like:
anticipate
\00\fa\00\00\00\e0\00\00\00l\87*@
That's 13 bytes, which means it's the CVVV format. If it was 5 bytes,
it's CV format, fyi. Now you want to throw the data through unpack to
get the actual values out:
$ perl -e 'print join("\n", unpack("CVVV",
"\x00\xfa\x00\x00\x00\xe0\x00\x00\x00l\x87*@"),"")'
0
250
224
1076529004
There's probably an easier way to do that, but ... perl expect hex
values in "\x##" format, but db_dump outputs in "\##" format, so you
have to put the "x" in appropriately there.
The first 3 numbers you don't care about, but they're packing format
(0 for CVVV, or 192 for CV), # of spam matches, # of ham matches.
The fourth number is atime. Change the atime to whatever you want, I'd
choose the current time() value (use the same one for all of the ones
you want to fix...) In my case, I'm going to use 1076685969 just for example.
Now you get to put it back in the right format ...
$ perl -e 'print map { sprintf "\\%02x",$_ } unpack("C13", pack("CVVV", 0, 250,
224, 1076685969));print "\n"'
\00\fa\00\00\00\e0\00\00\00\91\ec\2c\40
Take that and put it in the "out" file appropriately. Now repeat
for the other tokens. At the end, find the newest atime magic token
"\0d\01\07\09\03NEWESTAGE", and change the value (it's just a string)
from the current one to whatever atime you used, 1076685969 in this case.
$ db_load -f out .spamassassin/bayes_toks
You can now do a "sa-learn --dump" to make sure it all looks right...:
[...]
0.000 0 1076685969 0 non-token data: newest atime
[...]
0.158 250 224 1076685969 anticipate
[...]
Now, here's the fun part -- if you have tokens in CV format (which is
very likely in your case since the ham/spam counts are very likely to be
both < 8), this whole thing becomes a lot more complicated to do by hand...
So, let's switch to the more simple, but uglier, way of doing things:
$ perl -MMail::SpamAssassin::BayesStore -e 'print join("\n", \
Mail::SpamAssassin::BayesStore::tok_unpack({db_version => 2}, \
"\x00\xfa\x00\x00\x00\xe0\x00\x00\x00l\x87*@"),"")'
250
224
1076529004
$ perl -MMail::SpamAssassin::BayesStore -e 'print map { sprintf "\\%02x",$_ } \
unpack("C*", Mail::SpamAssassin::BayesStore->tok_pack(250, 224,
1076529004));print "\n"'
\00\fa\00\00\00\e0\00\00\00\6c\87\2a\40
This code has the benefit of working for both CVVV and CV formats...
For example: "\xd0\x1fU(@"
2
0
1076385055
[...]
\d0\1f\55\28\40
Please note that by editing your DB by hand, any future issues that
arise will be blamed on the editing. aka: no support.
--
Randomly Generated Tagline:
"The programmer needs the machine to run long enough to destroy it."
- Prof. Michaelson
pgpJawpsy2bJo.pgp
Description: PGP signature
