Hmmm.  Doesn't sound good.  I sent a simple text message through a large ISP,
to my server, arrived in a mbox.  Compared that message to the one that was
POPed, then sent back as an attachment and stripped out via the existing
script.

These sanitized messages are pretty short but I put in pastebin:
https://pastebin.com/b38RXHgx

When looking in Outlook the headers all appear intact, but forwarding as an
attachment appears to strip these:
Delivered-To:
All X- headers added by my SA
All X- headers added by sending ISP (X-Yahoo*)
Authentication results and DKIM signature
Status: R

Otherwise the rest of the headers were unaffected.

I'm not sure how bad that stripping of X-headers, DKIM, etc screws up bayes
learning?.  Doesnt' SEEM that bad, but it's out of my skillset.  Nor how bad
it munges other stuff that SA needs to see in a more complex message that
some of you mentioned.

I need a way to go from Outlook to train SA if I'm to train at all.  FOr
most of my users the inbound mail is handed off to a 3rd party Exchange
server that I don't have access to.  So setting up a public IMAP folder on
the exchange server type solution is probably not possible.  And I presume
that process messes with the messages too anyway.  I can't cc the users mail
on my server for later review, there would be too many.

If I'm forwarded spam as an attachment for learning, I would require ham
from the same method.

My plan wasn't to make this a daily routine.  Only to help a few users who
say they are getting too much spam slipping through all the other checks
untagged.  To help train bayes to assist on those problem users.  Old email
accounts that can't be changed and are on the golden spam lists.

The reason to "reassemble" the extracted attachments was just to make it
easier for me to access the messages and review them.  Too tedious at the
console.  Don't know how to use formal to do it, and wont' it add some more
headers to the mess too?

FWIW, I did try sa-learn on a sample of extracted attachments in their raw
form.  It was happy with them:
[root@tn3 msg-1502747659-31280-0]# sa-learn --spam *
Learned tokens from 97 message(s) (97 message(s) examined)

But picking through them to vet them would be too tedious at the console. 
They get random number type filenames as part of the extraction.

My constraints are:
- messages are sent to 3rd party exchange server
- exchange server access does not exist at this time
- users use Outlook client at least v2003
- I use site wide bayes
- I don't trust the users to feed bayes. 
- I can't cc their Email on my server for later feeding.
- I want to use this process for corpus building, not daily maintenance.

My plan was:
- receive spam and ham (separately) "as attachments" form outlook
- extract attachments
- review attachments
- feed attachments to sa-learn

Open for a better method..

Grateful for help with a formail command to assemble and try out if someone
is a guru.  To get it into mboxcl2 format that my Dovecot uses and SA would
be happy with (https://wiki2.dovecot.org/MailboxFormat/mbox)

Thanks 






















--
View this message in context: 
http://spamassassin.1065346.n5.nabble.com/message-rfc822-to-mbox-script-for-use-with-sa-learn-workflow-tp138362p138379.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.

Reply via email to