Re: HTML Validator

2006-11-27 Thread Kenneth Porter
--On Friday, March 10, 2006 5:08 PM -0800 Kenneth Porter 
[EMAIL PROTECTED] wrote:



Anyone know of a good validator that can be run over a MIME part to
report on the quality of the HTML? This might be used as a go/no-go
filter at milter level, or it could be used as an SA plugin to assign a
variable score based on the quality of the HTML.

For mailing lists catering to newbies who love HTML and can't understand
why us old-timers hate it, we can set the list to exclude all invalid
HTML. Sure, we'll accept your HTML. But only if it's really HTML. Not
that crap that most MUA's write.


I was trying to remember a web page I found that counseled not to use 
DOCTYPE and HTML tags around email to escape spam filters (pretty weird 
advice IMO) and I ran across indications that AOL is rejecting mail that 
fails to pass validation:


http://www.petefreitag.com/item/307.cfm
http://info.aol.co.uk/about/spam/mailer-daemon.adp
http://postmaster.info.aol.com/errors/554hvufo.html
http://www.clickz.com/showPage.html?page=3490146


Re: HTML Validator

2006-03-16 Thread Philip Prindeville
Theo Van Dinter wrote:

On Wed, Mar 15, 2006 at 09:58:52PM -0700, Philip Prindeville wrote:
  

Ok, does anyone have *recent* statistical analysis (i.e. not almost a
year old)
on this?  It could be that the people using this boneheaded construct have
realized the error of their ways, and stopped doing it.



Unfortunately not.  I updated the ticket
(http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4255) with new
stats and a plugin that implements the check so people can play with it.
The best version was comparing domains:

  MSECSSPAM% HAM% S/ORANK   SCORE  NAME
  028446 50230.850   0.000.00  (all messages)
0.0  84.9921  15.00790.850   0.000.00  (all messages as %)
  0.302   0.3340   0.11950.737   0.000.01  T_HTTPS_HTTP_MISMATCH

If people want to play with the plugin and can improve the hit rate to
a usable level (or if you find a bug in the code), please let us know!
But otherwise this rule sucks pretty badly.  :(

  

Hmm.  Thanks.  Trying out the attachment, but having issues.  Using 3.1.0
on FC3 Linux.

Updated the bug.

-Philip



Re: HTML Validator

2006-03-16 Thread Theo Van Dinter
On Thu, Mar 16, 2006 at 12:50:34PM -0700, Philip Prindeville wrote:
 Hmm.  Thanks.  Trying out the attachment, but having issues.  Using 3.1.0
 on FC3 Linux.
 
 Updated the bug.

In general, it's bad to have the same conversation in multiple locations.
I'd prefer to discuss issues with the plugin here as opposed to bugzilla since
the plugin was put there so that people in the future can easily access it.
Debugging problems and such I'd prefer to talk about here.

I also responded to your issue in the ticket.  It essentially came down to:
yes, the plugin works fine with 3.1.0.  The errors you saw indicate that
you're not using 3.1.x.

-- 
Randomly Generated Tagline:
Diversity is God's way of amusing himself.


pgpZlDCZ1KuCe.pgp
Description: PGP signature


Re: HTML Validator

2006-03-15 Thread Philip Prindeville
Kenneth Porter wrote:

On Friday, March 10, 2006 9:43 PM -0700 Philip Prindeville 
[EMAIL PROTECTED] wrote:

  

Do you mean:

http://validator.w3.org/source/



I thought that was just a web form-based validator. I'll have to look at it 
to see if the validator can be run over an attachment (ie. an HTML MIME 
part) from a separate mail filter (eg. MIMEDefang).
  


I'm wondering what would be involved in putting in an HTML parser
that could call various rules to check things, like the case of:

a href=http://www.foo.com/xyzzy;http://www.bar.com/aardvark/a

where the link disagrees with the text between the anchor tags (yeah, you
could limit it to partial matches on the host-portion)...

This seems to be the Korean Chase issue that Chris encountered.

-Philip



Re: HTML Validator

2006-03-15 Thread Theo Van Dinter
On Wed, Mar 15, 2006 at 08:13:48PM -0700, Philip Prindeville wrote:
 I'm wondering what would be involved in putting in an HTML parser
 that could call various rules to check things, like the case of:

Well, you wouldn't call various rules, you'd look for a behavior while
parsing and flag it for later detection by a rule.  The current code
means modificaations have to be made to HTML.pm.

 a href=http://www.foo.com/xyzzy;http://www.bar.com/aardvark/a

This kind of rule actually doesn't need to be in the HTML parser,
you could easily write a plugin that uses the already parsed anchor
information.

FWIW though, this rule has previously been discussed and dismissed as
being non-useful (too many FPs).  Earlier today on this list even. ;)

-- 
Randomly Generated Tagline:
You can lead a bigot to water, but if you don't tie him up you can't
 make him drown. - The Psychodots


pgpoYaMYEPiT8.pgp
Description: PGP signature


Re: HTML Validator

2006-03-15 Thread Craig Morrison

Philip Prindeville wrote:

I'm wondering what would be involved in putting in an HTML parser
that could call various rules to check things, like the case of:

a href=http://www.foo.com/xyzzy;http://www.bar.com/aardvark/a

where the link disagrees with the text between the anchor tags (yeah, you
could limit it to partial matches on the host-portion)...


This is the functional equivalent of pissing in the wind. If you are 
downwind, you are going to get wet.


Anchor text in too many/most cases will not match the HREF. grep is 
good, but it isn't good enough to catch all cases without significant 
overhead. Anchor text is a descriptor, nothing more than that. It is not 
a regurgitation of the link HREF.




Re: HTML Validator

2006-03-15 Thread Philip Prindeville
Craig Morrison wrote:

Philip Prindeville wrote:
  

I'm wondering what would be involved in putting in an HTML parser
that could call various rules to check things, like the case of:

a href=http://www.foo.com/xyzzy;http://www.bar.com/aardvark/a

where the link disagrees with the text between the anchor tags (yeah, you
could limit it to partial matches on the host-portion)...



This is the functional equivalent of pissing in the wind. If you are 
downwind, you are going to get wet.

Anchor text in too many/most cases will not match the HREF. grep is 
good, but it isn't good enough to catch all cases without significant 
overhead. Anchor text is a descriptor, nothing more than that. It is not 
a regurgitation of the link HREF.

  


Usually it's not.  That's the point.  It's when the anchor text is tries
to look
like a URL that one needs to be suspicious.  At the very least, if the
anchor text
starts with https://; but the anchor URL looks like http://;, I'd say
that this is a
definite spam.

Does anyone have a way of doing a statistical analysis of ham that contains
http(s?):// as the beginning of the anchor text?

-Philip


-Philip



Re: HTML Validator

2006-03-15 Thread Theo Van Dinter
On Wed, Mar 15, 2006 at 08:40:51PM -0700, Philip Prindeville wrote:
 Does anyone have a way of doing a statistical analysis of ham that contains
 http(s?):// as the beginning of the anchor text?

So for the second time today:

http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4255

-- 
Randomly Generated Tagline:
We are what we pretend to be.
-- Kurt Vonnegut, Jr.


pgpqebd3pCJGD.pgp
Description: PGP signature


Re: HTML Validator

2006-03-15 Thread Philip Prindeville
Theo Van Dinter wrote:

On Wed, Mar 15, 2006 at 08:40:51PM -0700, Philip Prindeville wrote:
  

Does anyone have a way of doing a statistical analysis of ham that contains
http(s?):// as the beginning of the anchor text?



So for the second time today:

http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4255

  


Ok, does anyone have *recent* statistical analysis (i.e. not almost a
year old)
on this?  It could be that the people using this boneheaded construct have
realized the error of their ways, and stopped doing it.

-Philip



Re: HTML Validator

2006-03-11 Thread Eric W. Bates
Kenneth Porter wrote:
 On Wednesday, March 08, 2006 6:46 PM -0800 Kenneth Porter
 [EMAIL PROTECTED] wrote:
 
 Makes me wonder about installing outbound filters that run a validator
 and reject anything that fails. I often see flame wars on mailing lists
 about allowing HTML posts to the list, but I wonder how the arguments
 would change if one allowed only *validated* HTML. I'll bet most who
 insist on using HTML would immediately be rejected by the validator.
 Sorry, your message was rejected because your MUA vendor writes garbage
 that we can't parse, and makes you look like a spammer. ;)
 
 
 Anyone know of a good validator that can be run over a MIME part to
 report on the quality of the HTML? This might be used as a go/no-go
 filter at milter level, or it could be used as an SA plugin to assign a
 variable score based on the quality of the HTML.
 
 For mailing lists catering to newbies who love HTML and can't understand
 why us old-timers hate it, we can set the list to exclude all invalid
 HTML. Sure, we'll accept your HTML. But only if it's really HTML. Not
 that crap that most MUA's write.

I have never used it in a mail context; but tidy (from our friends at w3
http://www.w3.org/People/Raggett/tidy/) is a very nice validator. Might
be too big a load for SA, tho.  I think you will also find that M$ html
output from OE is probably full of errors anyway...

 
 



Re: HTML Validator

2006-03-11 Thread Philip Prindeville
Eric W. Bates wrote:
 I have never used it in a mail context; but tidy (from our friends at w3
 http://www.w3.org/People/Raggett/tidy/) is a very nice validator. Might
 be too big a load for SA, tho.  I think you will also find that M$ html
 output from OE is probably full of errors anyway...

All the better.  Maybe they can be shamed into fixing it.  ;-)

And maybe pigs will grow wings...  Sigh.

-Philip




HTML Validator (was: Interesting Phishing Trick)

2006-03-10 Thread Kenneth Porter
On Wednesday, March 08, 2006 6:46 PM -0800 Kenneth Porter 
[EMAIL PROTECTED] wrote:



Makes me wonder about installing outbound filters that run a validator
and reject anything that fails. I often see flame wars on mailing lists
about allowing HTML posts to the list, but I wonder how the arguments
would change if one allowed only *validated* HTML. I'll bet most who
insist on using HTML would immediately be rejected by the validator.
Sorry, your message was rejected because your MUA vendor writes garbage
that we can't parse, and makes you look like a spammer. ;)


Anyone know of a good validator that can be run over a MIME part to report 
on the quality of the HTML? This might be used as a go/no-go filter at 
milter level, or it could be used as an SA plugin to assign a variable 
score based on the quality of the HTML.


For mailing lists catering to newbies who love HTML and can't understand 
why us old-timers hate it, we can set the list to exclude all invalid HTML. 
Sure, we'll accept your HTML. But only if it's really HTML. Not that crap 
that most MUA's write.


Re: HTML Validator

2006-03-10 Thread Philip Prindeville
Kenneth Porter wrote:

 Anyone know of a good validator that can be run over a MIME part to report 
 on the quality of the HTML? This might be used as a go/no-go filter at 
 milter level, or it could be used as an SA plugin to assign a variable 
 score based on the quality of the HTML.
 
 For mailing lists catering to newbies who love HTML and can't understand 
 why us old-timers hate it, we can set the list to exclude all invalid HTML. 
 Sure, we'll accept your HTML. But only if it's really HTML. Not that crap 
 that most MUA's write.

Do you mean:

http://validator.w3.org/source/

-Philip




Re: HTML Validator

2006-03-10 Thread Kenneth Porter
On Friday, March 10, 2006 9:43 PM -0700 Philip Prindeville 
[EMAIL PROTECTED] wrote:



Do you mean:

http://validator.w3.org/source/


I thought that was just a web form-based validator. I'll have to look at it 
to see if the validator can be run over an attachment (ie. an HTML MIME 
part) from a separate mail filter (eg. MIMEDefang).