On Fri, 2007-02-16 at 15:35 +0000, Justin Mason wrote:
> Theo Van Dinter writes:
> > I'm assuming that there will be a Google Summer of Code 2007 going on, and
> > that the ASF will be involved again.  So it's a good time to start thinking
> > about things we'd like to put up as possible projects.
> > 
> > We still have a number of items from last year that we could use again.
> > Anything else that we'd like people to code up?

Another thing that might worth adding to GSC2007.

Internal Encoding/Charset used by SA.

I havent find anything like that, but that doesnt mean SA does not do
this already.  In this case sorry :)

Mail messages can have multiple encodings like ISO-8859-*, utf-8,
utf-16, windows-*, and so on.

Also, perl (unless set "use utf8") will default to the system encoding
like LC_CTYPE.

Rule writters needs a way to tell SA, which encoding their rules are.

This is not a real issue for english rule, but for other languages are,
like portugues, french, russian, chinese, japanese and so on.

The real problem is that a string in one encoding with special
characters is not the same in another encoding.

So, what is needed is:
1 - a way to tell SA the encoding/charset used in some rules
2 - SA convert the rules to an universal encoding internally 
    (e.g. utf-8/16).
3 - Temporary reconvert to the message encoding/charset to proper match.

I really dont know if SA does somithing like this internally, but I
think it does not.
Doing this will require a considerable amount of work (so, gsc20007).

Without this kind of support, I see it will be easier in the future
spammers playing with charset to avoid specific rules.

-Raul Dias

Reply via email to