On 6/28/2010 11:38 AM, Mark Davis ☕ wrote:


The problem with slavishly following the charset parameter is that it is often incorrect. However, the charset parameter is a signal into the character detection module, so the charset is correctly supplied from the message then the results of the detection will be weighted that direction.

The weighting factor / mechanism may be something that you might look at for possible improvement.

Doug raised an interesting argument, i.e. that some values of a charset parameter might have a higher probability of being correct than other values.

If something is tagged Latin-1 or Windows-1252, the chances are that this is merely an unexamined default setting. Most of the other 8859 values should be much less likely to be such "blind" defaults.

I wonder whether the probability of successful charset assignment increases if you were to give these more "specific" charset values a higher weight.

When I played with simple recognition algorithms about 15 years ago, I found that some simple methods for crude language detection gave signatures that would allow charset detection. Even though these methods weren't sophisticated enough to resolve actual languages (esp. among closely related languages) they were good enough to narrow things down to the point, where one could pick or confirm charsets.

For example, significant stretches of German can be written without diacritics, and can fool charset detection unless it picks up on the statistic patterns for German. With that in hand, the first non-ASCII character encountered is then likely to "nail" the charset. Or, absent such character, the statistics can be used to confirm that an existing charset assignment is plausible. (8859-15 having been deliberately designed to be "undetectable" is the exception, unless there's a Euro sign in the scanned part of the document...)

A./

Reply via email to