Gateways, analyze first, insert into bayes later ?

Herold Heiko 11 Apr 2005 16:31:12 -0000

Newbie Alert - New to Spamassassin. Pondering enhancement to my current
basic setup, which is a filter gateway in front of MS exchange.
Filter gw is amavisd-new + dual-sendmail-setup + clamav+spamassassin 3.02.


I'm looking how to feed back sorted spam/ham info into the spamassassin
bayes database, skimming through the list archives I basically found people
talking about some different possibilities I basically was thinking about,
too:

- feed msgs back the spam/ham with a "forward". Problem: outlook munges
vital headers, attachments are possibly in different encoding, since
exchange decoded the whole body and attachments, and re-encodes them again
on forward - after all internally exchange isn't based on smtp (at least not
exch55 which we are still using).

- feed msgs back by having the users copy/paste the headers into the
"forward" email, extract and reconstruct somehow. Problem: cumbersome
(management would certainly yell), still the body/attachment encoding
problem.

- Have users sort Spam (and wrongly marked Ham) in different folder, attach
with CDO or OLE automation of outlook. Users are happy, but the whole
message would need reconstruction based on original headers, body and
attachments, losing valuable information.

- Have users sort Spam and Ham in different folder, extract with IMAP. Users
are happy, headers should be fine, but still I think the original encoding
used for body and attachments are lost, what we feed back to sa-learn is a
freshly reencoded (by exchange) mail.

Anybody with more knowledge of the working of Spamassassin can tell me if
the loss of the original encoding of body and attachments is a VERY BAD
THING ?

If it is, I was thinking, Spamassassin did already analyse all those
(inbound) messages the first time when delivered.
Is it possible (are there any hooks to...) extract the logical information
of that analyzation ?
I didn't yet find anything relevant in the Mail::SpamAssassin pod, I suppose
I'll have to check the gory details of the learn() and parse() methods.
Possibly the returned Mail::SpamAssassin::PerMsgLearner object will be
useful.

So we could save that information (for some time... say a couple of weeks,
depends on size and so on) using the message-id as a key.
Later then instead of sa-learn -spam <path_to_spam_msg we could retrieve
that info (extract the msg-id from the headers, retrieve analyze data from
db) and feed it back.

Anybody with better knowledge of the internal workings of SpamAssassin could
tell me
- if this is even necessary / useful ? After all I AM a newbie in this area,
maybe there is some other easy way I didn't spot yet, OR the loss of the
original encoding is not so important

- if this is already possible

- if not, if this could be possible with the current codebase. I suppose so,
basically in learn() locate the necessary data structures, encode in
standard and portable format, save it somewhere. Reverse at inserting stage.

- any pointer where to start implementing the hooks ore pitfall to avoid

- if something similar possibly is already wip somewhere 

Thanks

Heiko Herold

-- 
-- PREVINET S.p.A. www.previnet.it
-- Heiko Herold [EMAIL PROTECTED] [EMAIL PROTECTED]
-- +39-041-5907073 ph
-- +39-041-5907472 fax

Gateways, analyze first, insert into bayes later ?

Reply via email to