Re: SA 'Parts' Documentation WIP

David Muir Sharnoff 18 Feb 2004 23:21:52 -0000

I'm currently using 2.63 and I wish that full, body & rawbody acted as 
you describe and suggest.  It would make it much easier for me to write rules!


As for URI's, I think it would be a mistake to include <img src=> 
URIs as many of those point to innocent bystanders.

URIs should be HTML decoded.  I've run into a number of spammers who
write things like:

        h&#116;&#0x00054;p://

HTML encoding of plain text should cause a high score all by itself.
HTML encoding of URIs should be even higher.

In terms of naming, "full" and "rawbody" don't have predictable meanings
without reading the documentation.  (With 2.63 even with the documentation
they aren't predictable).

I suggest that "full" and "rawbody" be deprecated in favor of:
"alltext" and "raw" or "rawmessage".  In case it's not obvious, "alltext"
would replace "rawbody" and "raw" would replace "full".

Thanks,

-Dave


*  ...
*
* Full:
* The 'Full' part of a message is the complete, un-decoded, RFC822
* message. It includes all parts (textual or otherwise) and all headers,
* including headers from all parts. Essentially, the 'Full' part of a
* message is a text dump of the original message, except any \r\n
* linebreaks are converted to \n.
* 
* Rawbody:
* The 'Rawbody' part of a message is a concatenation of all the decoded
* textual parts (parts with a Content-Type of 'text/') of the message, and
* also includes the headers from message/rfc822 child parts. Any \r\n
* linebreaks are converted to \n.
* 
* Body:
* The 'Body' part of a message is a concatenation of all the decoded
* textual parts of the message, including headers from any textual child
* parts, just like the Rawbody part, with these exceptions: Single
* linebreaks are removed, multiple whitespaces in a row are converted to
* just one whitespace, and multiple linebreaks in a row are converted to a
* single linebreak. HTML-like tags are removed (anything starting with <,
* with at least one character in it, and ending with >. Tags like <>,
* <<<>>>, < >, <.>, etc, are not removed). Anything that looked like a URI
* in the text (in a HTML tag or not) will be listed as URI:uri. Then, the
* value from the top-part Subject header is appended to the top of the
* Body part.
* 
* URI:
* The 'URI' part is a list of all the URIs collected from the Rawbody
* part. Valid URIs are http://, https:// ftp:// mailto:// javascript://
* and file://. URIs not in this valid list (such as news://) are not
* included in the URI part. The URI collection will also include
* schemeless URIs which have an obvious scheme, such as www.domain.com,
* [EMAIL PROTECTED] and ftp.domain.com. Some will be added to the list as
* both the original (www.domain.com, ftp.domain.com) and as they would be
* schemed (http://www.domain.com, ftp://ftp.domain.com), except mailto://,
* which is only added schemed (mailto://[EMAIL PROTECTED]). Also, URIs that
* are malformed, such as http://www.domain.com?var are added to the list
* 'fixed', http://www.domain.com/?var. There are no duplicate entries in
* the URI list.
* 
*  ...
* 
* URI: As is, some URIs are 'missed' in this list. (Try news://news.com,
* or <img src=www.missed.com). I also think it would be a good idea to add
* unschemed mailto:// as both schemed and unschemed to the URI list as
* someone could write a rule to match start ([EMAIL PROTECTED], for
* example). Also, I wonder why SA doesn't use the RFC 2396 regex to match
* all the valid URIs. Seems it could grab all the valid URIs from a
* message in one sweep, unless I'm missing something? Then one could go
* through the list and scheme any URIs that needed it.
* 
* Please let me know of any comments/suggestions!
*

Re: SA 'Parts' Documentation *WIP*

Reply via email to

Re: SA 'Parts' Documentation WIP