Daniel Quinlan wrote:
Just to play devil's advocate, I have one other question: would it be
cheaper and safer to simply run tests for certain languages using
multiple character sets?
I'm interested in more than just tests. I want the rendered data so I
can do Bayes-like things with it.
I've seen Japanese spam with GB2312 encoded-words in the headers. So
for Japanese, you'd need a test for each of five character sets:
iso-2022-jp, euc, shift-jis, utf-8, and gb2312. Spammers still would
have over five other Chinese and Korean character sets to use in order
to hide Japanese spam from those tests.
iso-2022-jp could have obscuring escape sequences placed between any two
characters. Writing a test to match against encoded iso-2022-jp would
be like sort of like trying to write a test against encoded
quoted-printable. Then you have potential problems with the test firing
incorrectly because it is missing important context (like which
character set has been selected by the last escape sequence).
Safer: what if you guess wrong? what if the character set is hard to
determine correctly (intentially mixed-up, binary inserted,
half-and-half, jumbled character sets, etc.).
Then you have to update the code. This is no different than MIME
multiparts.