sanitizing/normalizing messages for feeding sa-learn

btb Wed, 27 Aug 2014 14:07:52 -0700

hi-

we have a system [zimbra] where users can select a message in the muainterface and click a spam or not spam button. this generates a message[containing the selected message] which is ultimately delivered to amailbox. i intend on retrieving these messages via imap and feedingsa-learn, but they've been a bit adulterated by the time they'reretrieved, and i believe some cleanup is probably necessary prior tofeeding sa-learn.


here are two samples:

http://dpaste.com/0B6S3FN.txt [claimed to be spam]
http://dpaste.com/3ZZ733Z.txt [claimed to be not spam]

the original message is encapsulated as an attachment, so i was planningon extracting this and discarding the rest of the message - unlesssa-learn is magical enough to handle this?

aside from that, i've readhttps://wiki.apache.org/spamassassin/BayesInSpamAssassin and man 1sa-learn about spamassassin markup/headers, but would appreciate anyfeedback for the above samples that might be pertinent - particularheaders that i may not have considered removing, etc.


thanks
-ben

sanitizing/normalizing messages for feeding sa-learn

Reply via email to