Re: charset parameter in Google Groups (was Re: Indian Rupee Sign to be chosen today)

Asmus Freytag Mon, 28 Jun 2010 13:45:34 -0700

On 6/28/2010 11:38 AM, Mark Davis ☕ wrote:

The problem with slavishly following the charset parameter is that itis often incorrect. However, the charset parameter is a signal intothe character detection module, so the charset is correctly suppliedfrom the message then the results of the detection will be weightedthat direction.

The weighting factor / mechanism may be something that you might look atfor possible improvement.

Doug raised an interesting argument, i.e. that some values of a charsetparameter might have a higher probability of being correct than othervalues.

If something is tagged Latin-1 or Windows-1252, the chances are thatthis is merely an unexamined default setting. Most of the other 8859values should be much less likely to be such "blind" defaults.

I wonder whether the probability of successful charset assignmentincreases if you were to give these more "specific" charset values ahigher weight.

When I played with simple recognition algorithms about 15 years ago, Ifound that some simple methods for crude language detection gavesignatures that would allow charset detection. Even though these methodsweren't sophisticated enough to resolve actual languages (esp. amongclosely related languages) they were good enough to narrow things downto the point, where one could pick or confirm charsets.

For example, significant stretches of German can be written withoutdiacritics, and can fool charset detection unless it picks up on thestatistic patterns for German. With that in hand, the first non-ASCIIcharacter encountered is then likely to "nail" the charset. Or, absentsuch character, the statistics can be used to confirm that an existingcharset assignment is plausible. (8859-15 having been deliberatelydesigned to be "undetectable" is the exception, unless there's a Eurosign in the scanned part of the document...)

A./

Re: charset parameter in Google Groups (was Re: Indian Rupee Sign to be chosen today)

Reply via email to