I am interesting in getting the list of tokens used by spamassassin for
Bayesian classification so that I can investigate misclassifications. A 
lot of pump-and-dump emails are getting through, and I'm trying to 
understand why. 

My email set-up has spamassassin storing tokens in a MySQL database. It
seems like the following query should do the trick:

select token, spam_count, ham_count from bayes_token

However, the tokens produced all seems like gibberish. The same happens
when I called sa-learn with the --dump data flag. For example, if I create
a file 'test' with these contents

>From MAILER-DAEMON Mon Jan 15 08:45:25 2007
Date: 15 Jan 2007 08:45:25 -0800
From: Mail System Internal Data <[EMAIL PROTECTED]>
Subject: DON'T DELETE THIS MESSAGE -- FOLDER INTERNAL DATA
X-IMAP: 1168879525 0000000000
Status: RO

This text is part of the internal format of your mail folder, and is not
a real message.  It is created automatically by the mail system software.
If deleted, important folder data will be lost, and it will be re-created
with the data reset to initial values.

Then when I train on this file and do a token dump, here's what I get:

$ sa-learn --mbox --ham test
$ sa-learn --dump data
0.500          0          1 1168848000  05c058cc67
0.500          0          1 1168848000  0c0b650cae
0.500          0          1 1168848000  10f9ce0a62
0.500          0          1 1168848000  1bfa4f11ec
0.500          0          1 1168848000  1cc46ab630
0.500          0          1 1168848000  229c0dfeb1
0.500          0          1 1168848000  23e8c1e3f7
0.500          0          1 1168848000  2c5cad209a
0.500          0          1 1168848000  3edbea3386
0.500          0          1 1168848000  42587ddee2
0.500          0          1 1168848000  caaba7ab0e
0.500          0          1 1168848000  46fe8a7433
0.500          0          1 1168848000  edfa2fd24a
0.500          0          1 1168848000  ee8ecd5f2b
0.500          0          1 1168848000  4aab899f37
0.500          0          1 1168848000  4d32c676ea
0.500          0          1 1168848000  4ec210c255
0.500          0          1 1168848000  6f45cd947f
0.500          0          1 1168848000  4fd5757cca
0.500          0          1 1168848000  f59226e27d
0.500          0          1 1168848000  5375f14875
0.500          0          1 1168848000  d98711f145
0.500          0          1 1168848000  f9b78251bd
0.500          0          1 1168848000  fc150c37f6
0.500          0          1 1168848000  fd26cf0e94
0.500          0          1 1168848000  e575d0e91b
0.500          0          1 1168848000  5b7712d045
0.500          0          1 1168848000  c6db94bcd0
0.500          0          1 1168848000  82c60d59cc
0.500          0          1 1168848000  8499c0cabe
0.500          0          1 1168848000  8d55b32b1d
0.500          0          1 1168848000  93db6e3b76
0.500          0          1 1168848000  974df77ac8
0.500          0          1 1168848000  9d0f1ef80a
0.500          0          1 1168848000  9f6b0c4046
0.500          0          1 1168848000  a20f335f33
0.500          0          1 1168848000  b24c2ab56a
0.500          0          1 1168848000  b573ba3f96
0.500          0          1 1168848000  baa9800bbd
0.500          0          1 1168848000  bffa795277
0.500          0          1 1168848000  deab0e9bd0

Why don't I see any of the words from the test email in this list?

Thanks,
Stuart Robinson

+----------------------------------------+
| Stuart Robinson                        |
| Email: stuart at zapata dot org        |
| Homepage: http://www.zapata.org/stuart |
+----------------------------------------+


Reply via email to