On Tue, 2 Jun 2009 09:24:01 -0500, chasd <[email protected]> wrote:
> On Jun 1, 2009, at 8:36 PM, till wrote:
>
>> And what's the performance trade off to always converting?
>
>
> It isn't the processing to to do the encoding conversion, it is that
> each message has a regex search to see how it should be converted.
> The way I read that code, even if the message is UTF-8, the regex
> will be done to determine the validity of the if statement.
>
>> if ($from == "ISO-8859-1" && preg_match("/[\x80-\x9F]/", $str))
>> $from = "WINDOWS-1252";
Like most other languages, PHP won't evaluate the second sub-expression in
an " && " expression if the first evaluates to false. My proposed order was
intentional based on that fact.
In any case (see later e-mails) it seems most efficient to skip the regex
search and just interpret ISO-8859-1 as Windows-1252 in all cases. No harm
done if the text was labeled correctly.
>
> Maybe there should be a nested if statement, so that only messages
> that marked as ISO-8859-1 are tested for the Black Hole of Windows-1252.
>
> if ($from == "ISO-8859-1")
> if (preg_match("/[\x80-\x9F]/", $str))
> $from = "WINDOWS-1252";
>
> With UTF-8 becoming more common, that would make the regex be skipped
> for likely the bulk of messages.
>
> However, the same problem could occur no matter what the message
> header says the encoding should be. A message that has a UTF-8 header
> could very well have WINDOWS-1252 encoding inside it. The above
> solution works because as the OP said :
>
>> The Windows-1252 character set is effectively a superset of the
>> iso-8859-1
>> character set,
>
> Not true of WINDOWS-1252 encoded data marked as, or should I say
> masquerading as, UTF-8 content.
>
> Does RC really want to parse all messages and apply heuristics to
> determine the encoding ?
> Yes, this is a relatively simple case, but you open the door for
> other patches to solve other specific encoding mismatches.
> We have no numbers as to how often this exact encoding mismatch
> happens other than " I ran into this once. "
> No offense to the OP, he provided a simple fix to the problem, but it
> is a very specific problem.
It is a very specific problem, but a common problem nonetheless. For
example, HTML 5 *requires* this misinterpretation:
http://dev.w3.org/html5/spec/Overview.html#character-encodings-0
> Here's one to fix :
> If you subscribe to a mail list run by mailman in plain digest mode,
> it doesn't convert the incoming messages to a consistent encoding, it
> just mashes the original message in its original encoding into the
> digest message that is labeled as 7-bit us-ascii. How does RoundCube
> handle that ? It punts because it is an upstream problem.
>
> BTW, the MIME digest mode of mailman makes each message a separate
> part that is labeled with its own encoding ( but then you get
> attachments to messages, which is sub-optimal for me).
In a case like you described, RoundCube has no knowledge of the original
encoding. In the workaround I'm suggesting, a specific no-cost
re-interpretation would be applied based on foreknowledge of common
mislabeling.
--
Eric Stadtherr
[email protected]
_______________________________________________
List info: http://lists.roundcube.net/dev/