I am interesting in getting the list of tokens used by spamassassin for Bayesian classification so that I can investigate misclassifications. A lot of pump-and-dump emails are getting through, and I'm trying to understand why.
My email set-up has spamassassin storing tokens in a MySQL database. It seems like the following query should do the trick: select token, spam_count, ham_count from bayes_token However, the tokens produced all seems like gibberish. The same happens when I called sa-learn with the --dump data flag. For example, if I create a file 'test' with these contents >From MAILER-DAEMON Mon Jan 15 08:45:25 2007 Date: 15 Jan 2007 08:45:25 -0800 From: Mail System Internal Data <[EMAIL PROTECTED]> Subject: DON'T DELETE THIS MESSAGE -- FOLDER INTERNAL DATA X-IMAP: 1168879525 0000000000 Status: RO This text is part of the internal format of your mail folder, and is not a real message. It is created automatically by the mail system software. If deleted, important folder data will be lost, and it will be re-created with the data reset to initial values. Then when I train on this file and do a token dump, here's what I get: $ sa-learn --mbox --ham test $ sa-learn --dump data 0.500 0 1 1168848000 05c058cc67 0.500 0 1 1168848000 0c0b650cae 0.500 0 1 1168848000 10f9ce0a62 0.500 0 1 1168848000 1bfa4f11ec 0.500 0 1 1168848000 1cc46ab630 0.500 0 1 1168848000 229c0dfeb1 0.500 0 1 1168848000 23e8c1e3f7 0.500 0 1 1168848000 2c5cad209a 0.500 0 1 1168848000 3edbea3386 0.500 0 1 1168848000 42587ddee2 0.500 0 1 1168848000 caaba7ab0e 0.500 0 1 1168848000 46fe8a7433 0.500 0 1 1168848000 edfa2fd24a 0.500 0 1 1168848000 ee8ecd5f2b 0.500 0 1 1168848000 4aab899f37 0.500 0 1 1168848000 4d32c676ea 0.500 0 1 1168848000 4ec210c255 0.500 0 1 1168848000 6f45cd947f 0.500 0 1 1168848000 4fd5757cca 0.500 0 1 1168848000 f59226e27d 0.500 0 1 1168848000 5375f14875 0.500 0 1 1168848000 d98711f145 0.500 0 1 1168848000 f9b78251bd 0.500 0 1 1168848000 fc150c37f6 0.500 0 1 1168848000 fd26cf0e94 0.500 0 1 1168848000 e575d0e91b 0.500 0 1 1168848000 5b7712d045 0.500 0 1 1168848000 c6db94bcd0 0.500 0 1 1168848000 82c60d59cc 0.500 0 1 1168848000 8499c0cabe 0.500 0 1 1168848000 8d55b32b1d 0.500 0 1 1168848000 93db6e3b76 0.500 0 1 1168848000 974df77ac8 0.500 0 1 1168848000 9d0f1ef80a 0.500 0 1 1168848000 9f6b0c4046 0.500 0 1 1168848000 a20f335f33 0.500 0 1 1168848000 b24c2ab56a 0.500 0 1 1168848000 b573ba3f96 0.500 0 1 1168848000 baa9800bbd 0.500 0 1 1168848000 bffa795277 0.500 0 1 1168848000 deab0e9bd0 Why don't I see any of the words from the test email in this list? Thanks, Stuart Robinson +----------------------------------------+ | Stuart Robinson | | Email: stuart at zapata dot org | | Homepage: http://www.zapata.org/stuart | +----------------------------------------+