Re: [RCD] Encoding issue, possible workaround

Eric Stadtherr Tue, 02 Jun 2009 08:09:46 -0700

On Tue, 2 Jun 2009 09:24:01 -0500, chasd <[email protected]> wrote:
> On Jun 1, 2009, at 8:36 PM, till wrote:
> 
>> And what's the performance trade off to always converting?
> 
> 
> It isn't the processing to to do the encoding conversion, it is that  
> each message has a regex search to see how it should be converted.
> The way I read that code, even if the message is UTF-8, the regex  
> will be done to determine the validity of the if statement.
> 
>> if ($from == "ISO-8859-1" && preg_match("/[\x80-\x9F]/", $str))
>> $from = "WINDOWS-1252";


Like most other languages, PHP won't evaluate the second sub-expression in
an " && " expression if the first evaluates to false. My proposed order was
intentional based on that fact.

In any case (see later e-mails) it seems most efficient to skip the regex
search and just interpret ISO-8859-1 as Windows-1252 in all cases. No harm
done if the text was labeled correctly.

> 
> Maybe there should be a nested if statement, so that only messages  
> that marked as ISO-8859-1 are tested for the Black Hole of Windows-1252.
> 
> if ($from == "ISO-8859-1")
>       if (preg_match("/[\x80-\x9F]/", $str))
>               $from = "WINDOWS-1252";
> 
> With UTF-8 becoming more common, that would make the regex be skipped  
> for likely the bulk of messages.
> 
> However, the same problem could occur no matter what the message  
> header says the encoding should be. A message that has a UTF-8 header  
> could very well have WINDOWS-1252 encoding inside it. The above  
> solution works because as the OP said :
> 
>> The Windows-1252 character set is effectively a superset of the  
>> iso-8859-1
>> character set,
> 
> Not true of WINDOWS-1252 encoded data marked as, or should I say  
> masquerading as, UTF-8 content.
> 
> Does RC really want to parse all messages and apply heuristics to  
> determine the encoding ?
> Yes, this is a relatively simple case, but you open the door for  
> other patches to solve other specific encoding mismatches.
> We have no numbers as to how often this exact encoding mismatch  
> happens other than " I ran into this once. "
> No offense to the OP, he provided a simple fix to the problem, but it  
> is a very specific problem.

It is a very specific problem, but a common problem nonetheless. For
example, HTML 5 *requires* this misinterpretation:

http://dev.w3.org/html5/spec/Overview.html#character-encodings-0

> Here's one to fix :
> If you subscribe to a mail list run by mailman in plain digest mode,  
> it doesn't convert the incoming messages to a consistent encoding, it  
> just mashes the original message in its original encoding into the  
> digest message that is labeled as 7-bit us-ascii. How does RoundCube  
> handle that ? It punts because it is an upstream problem.
> 
> BTW, the MIME digest mode of mailman makes each message a separate  
> part that is labeled with its own encoding ( but then you get  
> attachments to messages, which is sub-optimal for me).

In a case like you described, RoundCube has no knowledge of the original
encoding. In the workaround I'm suggesting, a specific no-cost
re-interpretation would be applied based on foreknowledge of common
mislabeling.


-- 
Eric Stadtherr
[email protected]
_______________________________________________
List info: http://lists.roundcube.net/dev/

Re: [RCD] Encoding issue, possible workaround

Reply via email to