Re: Preliminary design proposal for charset normalization support in SpamAssassin

John Gardiner Myers Sat, 20 Aug 2005 18:10:15 -0700

Daniel Quinlan wrote:

Just to play devil's advocate, I have one other question: would it be
cheaper and safer to simply run tests for certain languages using
multiple character sets?

I'm interested in more than just tests. I want the rendered data so Ican do Bayes-like things with it.

I've seen Japanese spam with GB2312 encoded-words in the headers. Sofor Japanese, you'd need a test for each of five character sets:iso-2022-jp, euc, shift-jis, utf-8, and gb2312. Spammers still wouldhave over five other Chinese and Korean character sets to use in orderto hide Japanese spam from those tests.

iso-2022-jp could have obscuring escape sequences placed between any twocharacters. Writing a test to match against encoded iso-2022-jp wouldbe like sort of like trying to write a test against encodedquoted-printable. Then you have potential problems with the test firingincorrectly because it is missing important context (like whichcharacter set has been selected by the last escape sequence).

Safer: what if you guess wrong?  what if the character set is hard to
determine correctly (intentially mixed-up, binary inserted,
half-and-half, jumbled character sets, etc.).

Then you have to update the code. This is no different than MIMEmultiparts.

Re: Preliminary design proposal for charset normalization support in SpamAssassin

Reply via email to