>-----Original Message-----
>From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
>Sent: Monday, January 03, 2005 1:18 PM
>To: Murty Rompalli
>Cc: users@spamassassin.apache.org
>Subject: Re: Rule based on English words 
>
>
>-----BEGIN PGP SIGNED MESSAGE-----
>Hash: SHA1
>
>
>Hi Murty --
>
>It should be easy enough to write a plugin which
>
>- - registers an eval rule function
>- - calls 
>$permsgstatus->get_decoded_stripped_body_text_array() in that, to
>  get the array of decoded lines in the message (HTML stripped, MIME
>  decoded etc.)
>- - splits those line strings into words and analyzes them.
>
>The results would be interesting, I think.
>
>- --j.
>
>Murty Rompalli writes:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>> 
>> Hash: SHA1
>> 
>> Hi
>> Any ideas on how to implement this are appreaciated:
>> 
>> Frequency Analysis of English Vocabulary and Grammar: Based on
>> the LOB Corpus by Stig Johansson and Knut Hofland (OUP, 1989, ISBN
>> 0-19-8242212-2) gives the top eighteen words and their frequencies
>> as:
>> 
>>       1.  the       68315
>>       2.  of        35716
>>       3.  and       27856
>>       4.  to        26760
>>       5.  a         22744
>>       6.  in        21108
>>       7.  that      11188
>>       8.  is        10978
>>       9.  was       10499
>>      10.  it        10010
>>      11.  for        9299
>>      12.  he         8776
>>      13.  as         7337
>>      14.  with       7197
>>      15.  be         7186
>>      16.  on         7027
>>      17.  I          6696
>>      18.  his        6266
>> 
>> If the body contains http: ftp: or https: link, I want to 
>test it further;
>> otherwise, skip this test. The test is as follows:
>> 
>> Check each paragraph that does not contain any of the above 18 words
>> (paragraphs seperated by \n).
>> 
>> 1. For each para without common English words, assign a score.
>> 2. For each para containing words with 0-9, ', " (anywhere), : and ~
>> (middle), assign score based on number of matches
>> 

SARE took a crack at this earlier. Fred (or possibly Carl) had been working
on code to look for these types of words. IIRC, we had horrible results. We
had such high hopes for it, but found that it FP'd far too much. 

But I'd like to test more. In my heart I still believe it could have worked.
I believe we were looking into the percentage of these "small words"
compared to the overall size of the message. It was an attempt to flag off
the Bayes fodder. 

Chris Santerre 
System Admin and SARE/SURBL Ninja
http://www.rulesemporium.com
http://www.surbl.org
'It is not the strongest of the species that survives,
not the most intelligent, but the one most responsive to change.'
Charles Darwin 

Reply via email to