Probably should also replace the obvious numeric and special characrters like
zer0, thr33, f|ve, $even, etc. while you are at it.
Loren
I have to wonder if it is worth the processor time though. Might be faster to
simply build a thesarus of creative misspellings and analyze the sentence that
results from the subsitiutions. I expect that is probably essentially what the
Bayes stuff does.
-----Original Message-----
From: Michal Szymanski <[EMAIL PROTECTED]>
Sent: Feb 6, 2004 6:44 AM
To: Robert Menschel <[EMAIL PROTECTED]>
Cc: [EMAIL PROTECTED]
Subject: Re: dealing with subjects forged with accented letters
On Thu, Feb 05, 2004 at 11:23:02PM -0800, Robert Menschel wrote:
>
> I use the following (we get foreign email, but since we only understand
> English, we expect all subject headings to be in English):
>
> header RM_sl_ForeignChar Subject =~ /\w[����]\w/
> ...
Hi Robert,
unfortunately, a solution that simple is not for me. We get emails in
Polish and occasionally also in Spanish or German (not to mention
English, of course, but these are no problem :) so we cannot just
spam-them-all. what we need is to filter Subject lines (changing
all "����" to "aeou" and *then* apply SA rules to them.
Michal.
--
Michal Szymanski ([EMAIL PROTECTED])
Warsaw University Observatory, Warszawa, POLAND