Re: browsers and unicode surrogates

Steffen Kamp Fri, 19 Apr 2002 14:59:39 -0700

>I have added a couple more variations of the Unicode supplementary
>characters example page, for utf-16 and utf-32.


I am not sure if your UTF-16 and UTF-32 test pages really conform to the
HTML standard. The server states a content type of "text/html" without
charset information. From the content type a browser should therefore
expect pure ASCII - at least until the META tag defining the documents
character encoding. 

>From the HTML 4.01 specification <http://www.w3.org/TR/html4/
charset.html>, section 5.2.2:

"The META declaration must only be used when the character encoding is
organized such that ASCII-valued bytes stand for ASCII characters (at
least until the META element is parsed)."

Your documents, however, just start with a BOM and I couldn't find
anything stating that a BOM would be a valid way of specifying the
character encoding.
Although some browsers seem to guess the character encoding from an
available BOM I wouldn't expect them to do so when there usually are
other ways of determining this information.

To get a second opinion I asked w3.org's online validation service to
check your UTF-16 document with auto detection of the character encoding.
(<http://validator.w3.org/check?uri=http://www.i18nguy.com/unicode/
plane1-utf-16.html&charset=(detect+automatically)&doctype=Inline>)
The Validator complained about the BOM as well as (not surprisingly) a
lot of ASCII zero (0x00) characters.
However, when giving the validator a ASCII-only document with a META tag
specifying UTF-16 as encoding (just for testing) it says that it does not
yet support this encoding, so I don't fully trust the validator in this case.

Steffen

-- 
Steffen Kamp
mailto:[EMAIL PROTECTED]
http://homepage.mac.com/earthlingsoft

Re: browsers and unicode surrogates

Reply via email to