Re: filtering on language

Alastair Scott Tue, 20 Nov 2001 07:15:17 -0800


----- Original Message -----
From: "Thomas F" <[EMAIL PROTECTED]>
To: "Alastair Scott on TBUDL" <[EMAIL PROTECTED]>
Sent: 20 November 2001 2:49 pm
Subject: Re: filtering on language



> Hello Alastair,
>
> On Tue, 20 Nov 2001 14:29:05 -0000 GMT (20/11/2001, 22:29 +0800 GMT),
> Alastair Scott wrote:
>
> AS> There may be more clever statistical methods - the above is Turkish,
and
> AS> it's pretty obvious the relative frequency of various letters (eg "z"
and
> AS> "i") is entirely different from that of English -
>
> This would be difficult to implement in a TB filter. But I just had an
> idea:
>
> You can actually filter for certain words that are likely to occur in
> most Turkish-language spams, such as siteler (web sites), for example.
> You can also use other simple words from the Turkish language. Without
> a scoring mechanism - i.e. just if one of those five or ten words is
> found, it's a hit - make your own, very simple, language parser in the
> form of a TB filter.

That would work - translations of "sex" and "money" would probably catch 95
per cent of spam ;)

The frequency analysis is actually very subtle - two other languages which
have lots of "z"s that come to mind are German and Polish. The huge mass of
rules needed to differentiate one language from another would probably be
just as slow as the dictionary lookup.

Alastair



_____________________________________________________________________
This message has been checked for all known viruses by the 
MessageLabs Virus Scanning Service. For further information visit
http://www.messagelabs.com/stats.asp


-- 
________________________________________________________
Archives   : http://tbudl.thebat.dutaint.com
Moderators : mailto:[EMAIL PROTECTED]
TBTech List: mailto:[EMAIL PROTECTED]
Unsubscribe: mailto:[EMAIL PROTECTED]
Latest Vers: 1.53d
FAQ        : http://faq.thebat.dutaint.com

Re: filtering on language

Reply via email to