> -----Original Message-----
> From: Chris Santerre [mailto:[EMAIL PROTECTED] 
> Sent: Wednesday, December 31, 2003 10:13 AM
> To: 'Fred'; Dallas L. Engelken; 
> [EMAIL PROTECTED]
> Subject: RE: [SAtalk] Spell Checking the Subject Header (RESULTS)
> 
> 
> LOL, I wondered when you would chime in Fred :) I think you 
> have the Dictionary memorized by now! Good ideas on limiting 
> the subject matter. But I think the numbers have to be 
> watched closely. Or are you thinking to ignore them and let 
> the OBFU rules get them?
> 
> I would like to see result of that. 
> 
> --Chris
> 

spell checking hurts obfu because splitting a correctly spelled word
with a word boundary will cause 2 or more mis spelled words...

Subject: looking for xa/nax,

looking: ok
for: ok
xa: not found
nax: not found

i'm running subject spell checking through corpus again with the
following rules applied to it.

 1) require at least 3 tokens in subject before doing a spell check.  1
word subjects were killing it.
 2) remove tokens that start with numbers.. this clears up dates, and
version numbers pretty well.
 3) remove tokens that are single characters. (this one is up for debate
still until i rerun corpus without it.
 4) remove urls.
 5) remove email addresses.
 6) remove mailling list tags [SA-TALK].
 7) remove Re: and Fw:/Fwd: just in case.
 8) remove 3 and 4 letter non-vowel uppercase acronyms... i know some
vowels are in acronyms, but that causes many false matches.

i'll post some results here in a few...

dallas



-------------------------------------------------------
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id78&alloc_id371&op=click
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to