Re: mojibake

Dave Anderson Sun, 01 Jul 2012 17:59:39 -0700

On Sun, 1 Jul 2012, Anthony J. Bentley wrote:

>ropers writes:
>> This diff fixes things:
>>
>> --- bsdcan11-mandoc-openbsd.html     2012-06-30 22:18:52.000000000 +0200
>> +++ bsdcan11-mandoc-openbsd.html.newentities 2012-06-30 22:34:58.000000000
>> +0200
>> @@ -13,7 +13,7 @@
>>
>>  <p><a href="http://www.flickr.com/photos/tomkoadam/4778126822/";><img
>>  src="http://farm5.static.flickr.com/4115/4778126822_555b453a1e.jpg";></a></p>
>> -<p>Csiko - Foal. - Photo: Adam Tomko @flickr (CC)</p>
>> +<p>Csik&oacute; - Foal. - Photo: Adam Tomk&oacute; @flickr (CC)</p>
>>
>>  <HR>
>>  <P>Ingo Schwarze: Mandoc in OpenBSD - page 2: INTRO I -
>> @@ -725,7 +725,7 @@
>>  <HR>
>>  <P>Ingo Schwarze: Mandoc in OpenBSD - page 22: RECURRING II -
>>  BSDCan 2011, May 13, Ottawa</P>
>> -<H1>Bogue deja vue:</H1>
>> +<H1>Bogue d&eacute;j&agrave; vue:</H1>
>>  <H2>Collecting regression tests.</H2>
>>  <UL>
>>  <LI>Slow start in 2009:
>>
>> That's it. That's all.
>
>The advantage of using pure ASCII plus HTML escapes in a page is that it
>displays the correct content regardless of declared character encoding.
>The disadvantage is that it means adding escapes *everywhere*. Can you
>imagine writing http://www.openbsd.org/cs/ in anything but native UTF-8?
>At some point we have to pick an encoding and stick with it.
>
>> So again, the complaint was that there was mojibake gibberish in
>> Ingo's presentation, because the character encoding isn't specified
>> but defaults to UTF-8 in modern browsers, while the page is actually
>> iso-8859-1 encoded.
>
>Actually, "modern" browsers do not default to a particular encoding (in
>fact, this violates the HTML standard). Instead, they attempt to autodetect
>the charset. Sometimes this works, and sometimes it doesn't -- I've seen
>UTF-8 pages incorrectly detected as ISO-8859-1, and in particularly bad
>cases, vice versa.
>
>> There were many objection to a simple addition of <HEAD><META
>> http-equiv="Content-Type" content="text/html; charset=iso-8859-1"
>> /><HEAD/> as a fix.
>
>Yes, this is pretty ugly. But the only alternative is using one encoding
>everywhere and setting the appropriate HTTP header instead of an HTML
>meta tag. Actually, that's not a bad idea, but it means using UTF-8 on all
>pages, since that's the only encoding that can handle the different
>translations on the OpenBSD website. It would also require removing or
>altering meta tags on all pages (but considering the alternative is *adding*
>meta tags to all pages...).
>
>> But then I thought, what about browsers that don't support UTF-8 yet;
>> this is going to break things for them.
>
>I challenge you to find a single browser in ports that doesn't. IE6
>supports UTF-8 properly. Even Lynx works fine when the user has a UTF-8
>locale. (And ISO-8859-* are also locale-dependent, so this is not any
>worse.)
>
>
>So, in summary, the options are:
>
>Use HTML escapes everywhere. IMO, highly impractical.
>
>Use any encoding you wish, and set a meta tag when appropriate. This is
>basically what we have now. (The front pages of /, /de/, /fr/ all use
>ISO-8859-1; /cs/ uses UTF-8; /lt/ uses ISO-8859-13.)
>
>Use UTF-8 everywhere, and enforce this either with an HTTP header or
>meta tags.


You missed one: use any encoding you wish, and configure the server to
send the proper charset value in the real headers (by encoding the
appropriate charset info in the file-name extension).  This does suffer
from combinatorial explosion if you have have both lots of different
charsets and lots of different types of files to serve, but usually
isn't especially difficult.  Done properly, it _always_ works when files
are viewed through the server, though (as someone pointed out) it
doesn't help if files are viewed directly from a browser.

        Dave

-- 
Dave Anderson
<d...@daveanderson.com>

Re: mojibake

Reply via email to