On 03/29/2011 04:57 PM, Martin Gregorie wrote:
> On Wed, 2011-03-30 at 00:58 +0200, mar...@swetech.se wrote:
>> recetly i been getting ALOT of these mail with the subjects like this
>> contain a link to some scam/chinese crap factory
>>
>> i run the latest spamassassin along with amavis  but these mails keep 
>> getting through any ideas?
>>
>> Re: YouWillNotBelieveYourPennisCanBbeThhatHardAndThick!GiveYouserlfATreat
> 
> Since the longest (English) word I know has 28 letters
> (antidisestablishmentarianism), a private rule like:
> 
> header VERY_LONG_WORD  Subject =~ /Re:\s+\S{29}/
> 
> should catch that spam.

The multi-lingual dictionary that I use for this kind of purpose has 132
words that are 29+ characters.  Its longest word is 58 characters:
Llanfairpwllgwyngyllgogerychwyrndrobwllllantysiliogogogoch is a large
village on the Welsh island of Anglesey, see
http://en.wikipedia.org/wiki/Llanfairpwllgwyngyll for more.  Wikipedia
also notes a hill in New Zealand (short name Taumata) with an even
longer name.  The next longest word is
pneumonoultramicroscopicsilicovolcanoconiosis with 45 letters.  German
words, which I would have expected to take the cake, seem to be limited
to 35 or so letters.

Maybe try this instead:

header VERY_LONG_WORD  Subject =~ /Re:\s+\w(?![a-z]{40})[A-Za-z]{40}/


If anybody is interested in the dictionary I use, this should be enough
to replicate it:

$ ls -lGg |sed 's/^.* 1 //; s/ ... .. ..... / /'
total 18M
 17M all
  32 american-english -> /usr/share/dict/american-english
  37 american-english-huge -> /usr/share/dict/american-english-huge
  39 american-english-insane -> /usr/share/dict/american-english-insane
 86K beale.wordlist.asc
  25 brazilian -> /usr/share/dict/brazilian
  36 british-english-huge -> /usr/share/dict/british-english-huge
  37 canadian-english-huge -> /usr/share/dict/canadian-english-huge
 86K diceware.wordlist.asc
1.6K expurgated
  22 french -> /usr/share/dict/french
  23 italian -> /usr/share/dict/italian
 135 make-all
  23 ngerman -> /usr/share/dict/ngerman
  23 ogerman -> /usr/share/dict/ogerman
  23 spanish -> /usr/share/dict/spanish
1.7M twl06.txt
  21 words -> /usr/share/dict/words
$ cat make-all
#!/bin/sh

( cat `ls |grep -Ev '^all|.wordlist.asc'`
  sed -r '/^[0-9]{5}\s+/!d; s///; /\w/!d' *.wordlist.asc
) |sort -f |uniq -i >all


Expurgated and twl06.txt are scrabble dictionaries that you'll have to
find specifically.  The .wordlist.asc files are for diceware.
Everything else came from a Debian package.  If you're not a word nut
like me, all you really need is the largest of each of the languages,
plus perhaps the standard English dictionary so you can determine if
something is an edge case.

This made it really easy for me to verify the cialis-in-word problem we
had here earlier; `grep -ci cialis all` currently counts 287 words.

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to