> -----Original Message----- > From: Fred [mailto:[EMAIL PROTECTED] > Sent: Tuesday, December 30, 2003 5:36 PM > To: Chris Santerre; Dallas L. Engelken; > [EMAIL PROTECTED] > Cc: [EMAIL PROTECTED] > Subject: Re: [SAtalk] Spell Checking the Subject Header (RESULTS) > > > > Chris Santerre wrote: > > WOW!!! Nice work!! > > > > Thanks for sharing the results!! We can put that whole spellcheck > > thing to rest now ;) > > > > --Chris > > I won't let this die yet, I have a few ideas to play with, > and more when I get more time to look at some ham subjects > which could cause these results... > >
i added a few checks... from some easy matches i can see happening on ham... 1) require at least 3 tokens in subject before doing a spell check. that way you have 0,33,66,100% possibilities to give it a little flavor. 1 word subjects were killing SUBJ_SPELLING_100. 2) skip spellcheck on subject when $self->detect_mailing_list returns true... anyone have any objections to that? 2) remove tokens that start with numbers 3) remove tokens that are single characters. 4) remove urls 5) remove email addresses 6) remove mailling list tags [SA-Talk] 7) remove Re: and Fw:/Fwd: just in case. 8) remove 3 and 4 letter non-vowel uppercase acronyms... i know some vowels are in acronyms, but that causes many false matches. Here are the new, improved results... so its looking better. i'm going to look at pulling the first 512 bytes of text from the body (stripped html body first if multipart or text/html.. otherwise pull the text/plain part.), and including those tokens in the check... we'll see where it goes from there. # Mon Jan 5 09:31:00 CST 2004 -- beginning test of testrule.SPELLING_7.txt: header SUBJ_SPELLING_01 eval:spell_check_subject('1','10') describe SUBJ_SPELLING_01 1-9% mis-spelled words in subject header SUBJ_SPELLING_10 eval:spell_check_subject('10','20') describe SUBJ_SPELLING_10 10-19% mis-spelled words in subject header SUBJ_SPELLING_20 eval:spell_check_subject('20','30') describe SUBJ_SPELLING_20 20-29% mis-spelled words in subject header SUBJ_SPELLING_30 eval:spell_check_subject('30','40') describe SUBJ_SPELLING_30 30-39% mis-spelled words in subject header SUBJ_SPELLING_40 eval:spell_check_subject('40','50') describe SUBJ_SPELLING_40 40-49% mis-spelled words in subject header SUBJ_SPELLING_50 eval:spell_check_subject('50','60') describe SUBJ_SPELLING_50 50-59% mis-spelled words in subject header SUBJ_SPELLING_60 eval:spell_check_subject('60','70') describe SUBJ_SPELLING_60 60-69% mis-spelled words in subject header SUBJ_SPELLING_70 eval:spell_check_subject('70','80') describe SUBJ_SPELLING_70 70-80% mis-spelled words in subject header SUBJ_SPELLING_80 eval:spell_check_subject('80','90') describe SUBJ_SPELLING_80 80-89% mis-spelled words in subject header SUBJ_SPELLING_90 eval:spell_check_subject('90','100') describe SUBJ_SPELLING_90 90-99% mis-spelled words in subject header SUBJ_SPELLING_100 eval:spell_check_subject('100','100') describe SUBJ_SPELLING_100 100% mis-spelled words in subject ############################################################ # SUBJ_SPELLING_01 -- 70s/32h of 10971 corpus, 2004-01-05 # ############################################################ score SUBJ_SPELLING_01 0.5 ############################################################ # SUBJ_SPELLING_10 -- 965s/186h of 10971 corpus, 2004-01-05 # ############################################################ score SUBJ_SPELLING_10 0.5 ############################################################ # SUBJ_SPELLING_20 -- 659s/245h of 10971 corpus, 2004-01-05 # ############################################################ score SUBJ_SPELLING_20 0.5 ############################################################ # SUBJ_SPELLING_30 -- 252s/138h of 10971 corpus, 2004-01-05 # ############################################################ score SUBJ_SPELLING_30 0.5 ############################################################ # SUBJ_SPELLING_40 -- 78s/54h of 10971 corpus, 2004-01-05 # ############################################################ score SUBJ_SPELLING_40 0.5 ############################################################ # SUBJ_SPELLING_50 -- 74s/50h of 10971 corpus, 2004-01-05 # ############################################################ score SUBJ_SPELLING_50 0.5 ############################################################ # SUBJ_SPELLING_60 -- 47s/54h of 10971 corpus, 2004-01-05 # ############################################################ score SUBJ_SPELLING_60 0.5 ############################################################ # SUBJ_SPELLING_70 -- 14s/4h of 10971 corpus, 2004-01-05 # ############################################################ score SUBJ_SPELLING_70 0.5 ############################################################ # SUBJ_SPELLING_80 -- 14s/4h of 10971 corpus, 2004-01-05 # ############################################################ score SUBJ_SPELLING_80 0.5 ############################################################ # SUBJ_SPELLING_90 -- 0s/0h of 10971 corpus, 2004-01-05 # ############################################################ score SUBJ_SPELLING_90 0.5 ############################################################ # SUBJ_SPELLING_100 -- 19s/0h of 10971 corpus, 2004-01-05 # ############################################################ score SUBJ_SPELLING_100 0.5 ------------------------------------------------------- This SF.net email is sponsored by: IBM Linux Tutorials. Become an expert in LINUX or just sharpen your skills. Sign up for IBM's Free Linux Tutorials. Learn everything from the bash shell to sys admin. Click now! http://ads.osdn.com/?ad_id78&alloc_id371&op=click _______________________________________________ Spamassassin-talk mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/spamassassin-talk