-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Theo Van Dinter writes:
> On Tue, Feb 10, 2004 at 12:08:56PM -0500, Adam Denenberg wrote:
> > thanks Theo. I would love to send my bayes_toks thru db_dump and fix the
> > "broken" records. However i am not familiar with the format. is there
> > an existing script, or a site that will allow me to properly remove
> > entries with bad atime values?
>
> Not that I know of. If you're really keen on trying this, here's the
> basics... Some of this probably should be documented somewhere besides
> the code anyway ...:
(cough) wiki.SpamAssassin.org ;)
- --j.
> # stop spamassassin ...
> # make a backup copy of bayes_toks!
>
> $ db_dump -p -f out .spamassassin/bayes_toks
> $ sa-learn --dump data | perl -nle 'print if ( (split)[3] > time )' > out2
>
> out2 now contains the list of tokens you need to fix. go through each
> one in "out" and fix it. for instance, assume "anticipate" was a token
> that needed fixing, in "out" you'd see something like:
>
> anticipate
> \00\fa\00\00\00\e0\00\00\00l\87*@
>
> That's 13 bytes, which means it's the CVVV format. If it was 5 bytes,
> it's CV format, fyi. Now you want to throw the data through unpack to
> get the actual values out:
>
> $ perl -e 'print join("\n", unpack("CVVV",
> "\x00\xfa\x00\x00\x00\xe0\x00\x00\x00l\x87*@"),"")'
> 0
> 250
> 224
> 1076529004
>
> There's probably an easier way to do that, but ... perl expect hex
> values in "\x##" format, but db_dump outputs in "\##" format, so you
> have to put the "x" in appropriately there.
>
> The first 3 numbers you don't care about, but they're packing format
> (0 for CVVV, or 192 for CV), # of spam matches, # of ham matches.
> The fourth number is atime. Change the atime to whatever you want, I'd
> choose the current time() value (use the same one for all of the ones
> you want to fix...) In my case, I'm going to use 1076685969 just for example.
> Now you get to put it back in the right format ...
>
> $ perl -e 'print map { sprintf "\\%02x",$_ } unpack("C13", pack("CVVV", 0,
> 250, 224, 1076685969));print "\n"'
> \00\fa\00\00\00\e0\00\00\00\91\ec\2c\40
>
> Take that and put it in the "out" file appropriately. Now repeat
> for the other tokens. At the end, find the newest atime magic token
> "\0d\01\07\09\03NEWESTAGE", and change the value (it's just a string)
> from the current one to whatever atime you used, 1076685969 in this case.
>
> $ db_load -f out .spamassassin/bayes_toks
>
> You can now do a "sa-learn --dump" to make sure it all looks right...:
>
> [...]
> 0.000 0 1076685969 0 non-token data: newest atime
> [...]
> 0.158 250 224 1076685969 anticipate
> [...]
>
> Now, here's the fun part -- if you have tokens in CV format (which is
> very likely in your case since the ham/spam counts are very likely to be
> both < 8), this whole thing becomes a lot more complicated to do by hand...
> So, let's switch to the more simple, but uglier, way of doing things:
>
> $ perl -MMail::SpamAssassin::BayesStore -e 'print join("\n", \
> Mail::SpamAssassin::BayesStore::tok_unpack({db_version => 2}, \
> "\x00\xfa\x00\x00\x00\xe0\x00\x00\x00l\x87*@"),"")'
> 250
> 224
> 1076529004
>
> $ perl -MMail::SpamAssassin::BayesStore -e 'print map { sprintf "\\%02x",$_ }
> \
> unpack("C*", Mail::SpamAssassin::BayesStore->tok_pack(250, 224,
> 1076529004));print "\n"'
> \00\fa\00\00\00\e0\00\00\00\6c\87\2a\40
>
> This code has the benefit of working for both CVVV and CV formats...
>
> For example: "\xd0\x1fU(@"
> 2
> 0
> 1076385055
>
> [...]
>
> \d0\1f\55\28\40
>
> Please note that by editing your DB by hand, any future issues that
> arise will be blamed on the editing. aka: no support.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (GNU/Linux)
Comment: Exmh CVS
iD8DBQFAK8bLQTcbUG5Y7woRAkwgAKCrO/28yt1JpwdVzbD8IYXm9U5D8gCgzd8W
JFuRyMgL4Jb2tChTydnqn+g=
=b0an
-----END PGP SIGNATURE-----