Let me choose my words more carefully.

A browser may recognize UTF-32 (e.g., in a sniffer) without supporting it
(either internally or for transcoding into a different internal encoding).

If the browser supports UTF-32, then step (2) of [1] applies.


But, if the browser does not support UTF-32, then the table in step (4) of
[1] is supposed to apply, which would interpret the initial two bytes FF FE
as UTF-16LE according to the current language of [1], and further, return a
confidence level of "certain".

I see the problem now. It seems that the table in step (4) should be
changed to interpret an initial FF FE as UTF-16BE only if the following two
bytes are not 00.

On Mon, Dec 5, 2011 at 11:45 AM, Glenn Maynard <gl...@zewt.org> wrote:

> On Mon, Dec 5, 2011 at 1:00 PM, Glenn Adams <gl...@skynav.com> wrote:
>> > [2] http://www.w3.org/TR/charmod/#C030
>>> No, it wouldn't.  That doesn't say that UTF-32 must be recognized.
>> You misread me. I am not saying or supporting that UTF-32 must be
>> recognized. I am saying that MIS-recognizing UTF-32 as UTF-16 violates [2].
> It's impossible to violate that rule if the encoding isn't recognized.
> "When an IANA-registered charset name *is recognized*"; UTF-32 isn't
> recognized, so this is irrelevant.
> If a browser doesn't support UTF-32 as an incoming interchange format,
>> then it should treat it as any other character encoding it does not
>> recognize. It must not pretend it is another encoding.
> When an encoding is not recognized by the browser, the browser has full
> discretion in guessing the encoding.  (See step 7 of
> http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#determining-the-character-encoding.)
> It's perfectly reasonable for UTF-32 data to be detected as UTF-16.  For
> example, UTF-32 data is likely to contain null bytes when scanned bytewise,
> and UTF-16 is the only supported encoding where that's likely to happen.
> Steps 7 and 8 gives browsers unrestricted freedom in selecting the encoding
> when the previous steps are unable to do so; if they choose to include "if
> the charset is declared as UTF-32, return UTF-16" as one of their
> autodetection rules, the spec allows it.
> --
> Glenn Maynard

Reply via email to