RE: Gateways, analyze first, insert into bayes later ?

Matt Yackley 13 Apr 2005 01:16:09 -0000

Herold Heiko said:
>> From: Matt Yackley [mailto:[EMAIL PROTECTED]
>> Are you using a sitewide bayes DB?  This may affect your
>
> I will at first, I need to start as soon as possible,


This should be a bit easier to manage and quicker to setup and you may find 
that it
works well enough to skip trying to setup per-user bayes.

>
>> I use a public folders for message submission, users can see
>> the folders, create
>
> I suppose a public folder in order to not needing to access indiviual
> mailboxes with imap ?

Thats one of the reasons I went with this type of setup, its much easier to 
manage. 
Setup once and you pretty much don't have to mess with it again.  Multiple 
servers
in distant locations, no problem, just setup replicas on all servers.

>
>> Are you thing of having the users "push" the messages to the
>> relay server or pulling
>> the message out of Exchange from the relay server?
>
> Pull with Imap I think (another possibility would be extract with CDO/MAPI
> and push, but that has the drawback of more encoding work).
> At least until migration ti ex2k*

An IMAP pull would most likely be the quickest and easiest setup, but you may 
lose
some headers when messages are pulled from public folders...

One warning... when we migrated to ex2K and had users on both we continued to 
use
the IMC on 55 for inbound mail, when message were transferred to an ex2k box all
SMTP headers were removed.  Our fix was to shutdown the IMC on 5.5  and have 
inbound
mail come into ex2k first, then when messages were delivered to a 5.5 box the 
SMTP
headers were left intact.

>
snip
>
> I'd try to strip big binary attachments before storing, should probably save
> lot of space.
> Still I'd prefer going the analyse first, insert into bayes later route,
> since it needs to store only bayes data, not whole emails (potentially huge
> db). But that has the counterpart of needing a spamassassin patch.
> For me it is a moot point for now anyway, not enough time, I'll try the imap
> route first, think about a better solution later.

AFAIK, that would require much hacking of the SA and sa-learn code.  When a 
message
goes through SA bayes will compare various "tokens" in the message to its DB, 
if it
finds enough data to work with it simply says, "hey I x% sure this looks like 
ham or
spam", assigns a score and moves on.  The only time that it would retain any
information is if you were running bayes with autolearn and it scored high 
enough or
low enough to "learn" the message.  At this point I believe that it will then 
record
the message ID in bayes_seen, and insert a hash of the chosen "tokens" into the 
DB
as spam or ham.  It does not retain any info on which "tokens" came from which
email, only that it was ham or spam.

Under this model I don't think there is a way to come back later and use the 
"bayes
analysis" data at a later time.  This is why I'm thinking about a tool that 
would
store all messages in a rolling window of say two weeks, and then use a message 
ID
from a submitted message to pull the original and feed it to sa-learn.


Cheers,
matt

RE: Gateways, analyze first, insert into bayes later ?

Reply via email to