After some research, I've come up with this documentation to describe the different message 'parts' that SA uses to evaluate messages. This is a work in progress, so please let me know of any corrections that need to be made. The documentation is based on the behavior I have seen with SA from SVN a few days old. If I have this documentation confirmed, I will generate some test messages to display the behavior, then I'll work on writing in tests for SA so that the tests can be included in future SA versions to make sure the parts behavior doesn't change inadvertently.
Full: The 'Full' part of a message is the complete, un-decoded, RFC822 message. It includes all parts (textual or otherwise) and all headers, including headers from all parts. Essentially, the 'Full' part of a message is a text dump of the original message, except any \r\n linebreaks are converted to \n. Rawbody: The 'Rawbody' part of a message is a concatenation of all the decoded textual parts (parts with a Content-Type of 'text/') of the message, and also includes the headers from message/rfc822 child parts. Any \r\n linebreaks are converted to \n. Body: The 'Body' part of a message is a concatenation of all the decoded textual parts of the message, including headers from any textual child parts, just like the Rawbody part, with these exceptions: Single linebreaks are removed, multiple whitespaces in a row are converted to just one whitespace, and multiple linebreaks in a row are converted to a single linebreak. HTML-like tags are removed (anything starting with <, with at least one character in it, and ending with >. Tags like <>, <<<>>>, < >, <.>, etc, are not removed). Anything that looked like a URI in the text (in a HTML tag or not) will be listed as URI:uri. Then, the value from the top-part Subject header is appended to the top of the Body part. URI: The 'URI' part is a list of all the URIs collected from the Rawbody part. Valid URIs are http://, https:// ftp:// mailto:// javascript:// and file://. URIs not in this valid list (such as news://) are not included in the URI part. The URI collection will also include schemeless URIs which have an obvious scheme, such as www.domain.com, [EMAIL PROTECTED] and ftp.domain.com. Some will be added to the list as both the original (www.domain.com, ftp.domain.com) and as they would be schemed (http://www.domain.com, ftp://ftp.domain.com), except mailto://, which is only added schemed (mailto://[EMAIL PROTECTED]). Also, URIs that are malformed, such as http://www.domain.com?var are added to the list 'fixed', http://www.domain.com/?var. There are no duplicate entries in the URI list. Based on what I've seen, I've made a few assumptions as to reasoning for the parts. Please let me know if these sound about right. Full is the unaltered RFC822, other than \r\n to \n newlines. You would write rules against this part to match against specific headers, message boundaries, and any other normally not mail-client-visible text (included encoded text that would normally be decoded by a client). Rawbody is all of the undecoded body text from a message. You would write rules against this part to match against HTML, relations between HTML and text, and rules that are dependant on whitespace or linebreaks. Body is all of the undecoded body text from a message, mostly as it would be displayed in a mail client. You would write rules against this part to match specific sentences, words, or sequences of either. URI is all of the known-URIs in the message. You would write rules against this part to match against a specific URI. With that in mind, I have a few suggestions: Full: is great as it is. Not removing parts (as 2.63 did) is a good idea, in my opinion. Rawbody: is good, not putting in the boundaries for missing parts as 2.63 did, though it only includes headers for some message parts. Would it possibly be a good idea to make Rawbody exactly the same as full, except with decoded textual parts and decoded headers, and non-textual parts removed? That would mean rawbody would have all the headers for each part (even top part), except any that were encoded would be decoded. Body: I don't understand why the value from the Subject header is appending to the top of body. If you want to match the Subject header against something, wouldn't you use Full? Also, I don't understand why URIs are inserted as URI:uri, I thought perhaps to make the URI list collection easier later, but from what I can see the URI collection is done from Rawbody anyway. I can see having the URI:uri as not being desirable, as it could interfere with a rule. (if the URI: part was 'hidden' in a client, and a rule was written for two words around it). Removing the HTML is good too, and though Daniel mentioned the HTML was 'rendered' I can't see how. It would be a good idea, I think, to replace specific tags with the 'rendered' equivalents. (<br>, <p>, etc). URI: As is, some URIs are 'missed' in this list. (Try news://news.com, or <img src=www.missed.com). I also think it would be a good idea to add unschemed mailto:// as both schemed and unschemed to the URI list as someone could write a rule to match start ([EMAIL PROTECTED], for example). Also, I wonder why SA doesn't use the RFC 2396 regex to match all the valid URIs. Seems it could grab all the valid URIs from a message in one sweep, unless I'm missing something? Then one could go through the list and scheme any URIs that needed it. Please let me know of any comments/suggestions!
