After some research, I've come up with this documentation to describe
the different message 'parts' that SA uses to evaluate messages. This is
a work in progress, so please let me know of any corrections that need
to be made. The documentation is based on the behavior I have seen with
SA from SVN a few days old. If I have this documentation confirmed, I
will generate some test messages to display the behavior, then I'll work
on writing in tests for SA so that the tests can be included in future
SA versions to make sure the parts behavior doesn't change
inadvertently.

Full:
The 'Full' part of a message is the complete, un-decoded, RFC822
message. It includes all parts (textual or otherwise) and all headers,
including headers from all parts. Essentially, the 'Full' part of a
message is a text dump of the original message, except any \r\n
linebreaks are converted to \n.

Rawbody:
The 'Rawbody' part of a message is a concatenation of all the decoded
textual parts (parts with a Content-Type of 'text/') of the message, and
also includes the headers from message/rfc822 child parts. Any \r\n
linebreaks are converted to \n.

Body:
The 'Body' part of a message is a concatenation of all the decoded
textual parts of the message, including headers from any textual child
parts, just like the Rawbody part, with these exceptions: Single
linebreaks are removed, multiple whitespaces in a row are converted to
just one whitespace, and multiple linebreaks in a row are converted to a
single linebreak. HTML-like tags are removed (anything starting with <,
with at least one character in it, and ending with >. Tags like <>,
<<<>>>, < >, <.>, etc, are not removed). Anything that looked like a URI
in the text (in a HTML tag or not) will be listed as URI:uri. Then, the
value from the top-part Subject header is appended to the top of the
Body part.

URI:
The 'URI' part is a list of all the URIs collected from the Rawbody
part. Valid URIs are http://, https:// ftp:// mailto:// javascript://
and file://. URIs not in this valid list (such as news://) are not
included in the URI part. The URI collection will also include
schemeless URIs which have an obvious scheme, such as www.domain.com,
[EMAIL PROTECTED] and ftp.domain.com. Some will be added to the list as
both the original (www.domain.com, ftp.domain.com) and as they would be
schemed (http://www.domain.com, ftp://ftp.domain.com), except mailto://,
which is only added schemed (mailto://[EMAIL PROTECTED]). Also, URIs that
are malformed, such as http://www.domain.com?var are added to the list
'fixed', http://www.domain.com/?var. There are no duplicate entries in
the URI list.



Based on what I've seen, I've made a few assumptions as to reasoning for
the parts. Please let me know if these sound about right.

Full is the unaltered RFC822, other than \r\n to \n newlines. You would
write rules against this part to match against specific headers, message
boundaries, and any other normally not mail-client-visible text
(included encoded text that would normally be decoded by a client).

Rawbody is all of the undecoded body text from a message. You would
write rules against this part to match against HTML, relations between
HTML and text, and rules that are dependant on whitespace or linebreaks.

Body is all of the undecoded body text from a message, mostly as it
would be displayed in a mail client. You would write rules against this
part to match specific sentences, words, or sequences of either.

URI is all of the known-URIs in the message. You would write rules
against this part to match against a specific URI.


With that in mind, I have a few suggestions:

Full: is great as it is. Not removing parts (as 2.63 did) is a good
idea, in my opinion.

Rawbody: is good, not putting in the boundaries for missing parts as
2.63 did, though it only includes headers for some message parts. Would
it possibly be a good idea to make Rawbody exactly the same as full,
except with decoded textual parts and decoded headers, and non-textual
parts removed? That would mean rawbody would have all the headers for
each part (even top part), except any that were encoded would be
decoded.

Body: I don't understand why the value from the Subject header is
appending to the top of body. If you want to match the Subject header
against something, wouldn't you use Full? Also, I don't understand why
URIs are inserted as URI:uri, I thought perhaps to make the URI list
collection easier later, but from what I can see the URI collection is
done from Rawbody anyway. I can see having the URI:uri as not being
desirable, as it could interfere with a rule. (if the URI: part was
'hidden' in a client, and a rule was written for two words around it).
Removing the HTML is good too, and though Daniel mentioned the HTML was
'rendered' I can't see how. It would be a good idea, I think, to replace
specific tags with the 'rendered' equivalents. (<br>, <p>, etc).

URI: As is, some URIs are 'missed' in this list. (Try news://news.com,
or <img src=www.missed.com). I also think it would be a good idea to add
unschemed mailto:// as both schemed and unschemed to the URI list as
someone could write a rule to match start ([EMAIL PROTECTED], for
example). Also, I wonder why SA doesn't use the RFC 2396 regex to match
all the valid URIs. Seems it could grab all the valid URIs from a
message in one sweep, unless I'm missing something? Then one could go
through the list and scheme any URIs that needed it.

Please let me know of any comments/suggestions!

Reply via email to