> -----Original Message-----
> From: Fred [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, December 30, 2003 5:36 PM
> To: Chris Santerre; Dallas L. Engelken; 
> [EMAIL PROTECTED]
> Cc: [EMAIL PROTECTED]
> Subject: Re: [SAtalk] Spell Checking the Subject Header (RESULTS)
> 
> 
> 
> Chris Santerre wrote:
> > WOW!!! Nice work!!
> >
> > Thanks for sharing the results!! We can put that whole spellcheck
> > thing to rest now ;)
> >
> > --Chris
> 
> I won't let this die yet, I have a few ideas to play with,
> and more when I get more time to look at some ham subjects 
> which could cause these results...
> 
> 

i added a few checks...  from some easy matches i can see happening on
ham...

 1) require at least 3 tokens in subject before doing a spell 
    check.  that way you have 0,33,66,100% possibilities to 
    give it a little flavor.  1 word subjects were killing 
    SUBJ_SPELLING_100.
 2) skip spellcheck on subject when $self->detect_mailing_list 
    returns true...  anyone have any objections to that?
 2) remove tokens that start with numbers
 3) remove tokens that are single characters.
 4) remove urls
 5) remove email addresses
 6) remove mailling list tags [SA-Talk]
 7) remove Re: and Fw:/Fwd: just in case.
 8) remove 3 and 4 letter non-vowel uppercase acronyms... i know 
    some vowels are in acronyms, but that causes many false matches.

Here are the new, improved results... so its looking better.   i'm going
to look at pulling the first 512 bytes of text from the body (stripped
html body first if multipart or text/html.. otherwise pull the
text/plain part.), and including those tokens in the check... we'll see
where it goes from there.

# Mon Jan 5 09:31:00 CST 2004 -- beginning test of
testrule.SPELLING_7.txt:

header SUBJ_SPELLING_01         eval:spell_check_subject('1','10')
describe SUBJ_SPELLING_01       1-9% mis-spelled words in subject

header SUBJ_SPELLING_10         eval:spell_check_subject('10','20')
describe SUBJ_SPELLING_10       10-19% mis-spelled words in subject

header SUBJ_SPELLING_20         eval:spell_check_subject('20','30')
describe SUBJ_SPELLING_20       20-29% mis-spelled words in subject

header SUBJ_SPELLING_30         eval:spell_check_subject('30','40')
describe SUBJ_SPELLING_30       30-39% mis-spelled words in subject

header SUBJ_SPELLING_40         eval:spell_check_subject('40','50')
describe SUBJ_SPELLING_40       40-49% mis-spelled words in subject

header SUBJ_SPELLING_50         eval:spell_check_subject('50','60')
describe SUBJ_SPELLING_50       50-59% mis-spelled words in subject

header SUBJ_SPELLING_60         eval:spell_check_subject('60','70')
describe SUBJ_SPELLING_60       60-69% mis-spelled words in subject

header SUBJ_SPELLING_70         eval:spell_check_subject('70','80')
describe SUBJ_SPELLING_70       70-80% mis-spelled words in subject

header SUBJ_SPELLING_80         eval:spell_check_subject('80','90')
describe SUBJ_SPELLING_80       80-89% mis-spelled words in subject

header SUBJ_SPELLING_90         eval:spell_check_subject('90','100')
describe SUBJ_SPELLING_90       90-99% mis-spelled words in subject

header SUBJ_SPELLING_100        eval:spell_check_subject('100','100')
describe SUBJ_SPELLING_100      100% mis-spelled words in subject

############################################################
# SUBJ_SPELLING_01 -- 70s/32h of 10971 corpus, 2004-01-05             #
############################################################
score SUBJ_SPELLING_01 0.5

############################################################
# SUBJ_SPELLING_10 -- 965s/186h of 10971 corpus, 2004-01-05
#
############################################################
score SUBJ_SPELLING_10 0.5

############################################################
# SUBJ_SPELLING_20 -- 659s/245h of 10971 corpus, 2004-01-05
#
############################################################
score SUBJ_SPELLING_20 0.5

############################################################
# SUBJ_SPELLING_30 -- 252s/138h of 10971 corpus, 2004-01-05
#
############################################################
score SUBJ_SPELLING_30 0.5

############################################################
# SUBJ_SPELLING_40 -- 78s/54h of 10971 corpus, 2004-01-05             #
############################################################
score SUBJ_SPELLING_40 0.5

############################################################
# SUBJ_SPELLING_50 -- 74s/50h of 10971 corpus, 2004-01-05             #
############################################################
score SUBJ_SPELLING_50 0.5

############################################################
# SUBJ_SPELLING_60 -- 47s/54h of 10971 corpus, 2004-01-05             #
############################################################
score SUBJ_SPELLING_60 0.5

############################################################
# SUBJ_SPELLING_70 -- 14s/4h of 10971 corpus, 2004-01-05             #
############################################################
score SUBJ_SPELLING_70 0.5

############################################################
# SUBJ_SPELLING_80 -- 14s/4h of 10971 corpus, 2004-01-05             #
############################################################
score SUBJ_SPELLING_80 0.5

############################################################
# SUBJ_SPELLING_90 -- 0s/0h of 10971 corpus, 2004-01-05             #
############################################################
score SUBJ_SPELLING_90 0.5

############################################################
# SUBJ_SPELLING_100 -- 19s/0h of 10971 corpus, 2004-01-05             #
############################################################
score SUBJ_SPELLING_100 0.5







-------------------------------------------------------
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills.  Sign up for IBM's
Free Linux Tutorials.  Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id78&alloc_id371&op=click
_______________________________________________
Spamassassin-talk mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/spamassassin-talk

Reply via email to