On Mon, 22 Apr 2002, Stefan Persson wrote: I haven't added plane 1 characters, yet (Tex let me do that, thanks !). However, my test pages can be used to test how various web browsers interpret various forms of UTF-16 and UTF-32 with or without BOM and with or without external info. (such as MIME charset in http C-T header). This is not of practical importance/interest(UTF-8 is much less ambigous and better supported than UTF-16/32 by various web browsers), but it's interesting nonetheless because the way various forms of UTF-16/32 have to be interpreted has been discussed recently.
> ----- Original Message ----- > From: <[EMAIL PROTECTED]> > Sent: den 22 april 2002 20:24 > > Thank you for this tip. I didn't know this and ended up > > 'cluttering' my filenames with charset suffices at > > <http://jshin.net/i18n/utftest>. > > The following pages display Korean text: > > * All UTF-16 with BOM > * All UTF-32LE with BOM > * UTF-16LE without BOM, encoding specified as UTF-16 > > The following pages are displayed as Latin-1 jibberish, ASCII displayed > properly: > * UTF-16 without BOM, encoding specified as UTF-16LE, UTF-16BE, or not > specified at all > * All UTF-32BE > * All UTF-32LE without BOM > > This page is misinterpreted as UTF-16LE without line breaking: > * UTF-16BE without BOM, encoding specified as UTF-16 > > I'm using IE 5.5 under Windows 98. Thank you for your test result. MS IE 5.5. seems to *ignore* MIME charset specified in http header. It appears to *solely* rely on the presence of BOM. If it's not specified, it assumes the platform byte order. Is this behavior compatible with what Mark and Ken described as to how to interpret various forms of UTF-16 and UTF-32 last week and this week again? It doesn't seem to be. The way Mozilla interprets various forms of UTF-16|32 appears to be more in line with what Mark and Ken have written although there are some issues to be resolved as well. It'll be interesting to see how Opera does. Here's the test result with Mozilla 0.9.9 on ix86 Linux (that is, the platform byte order is the same as your case). * The following pages always get displayed as intended - All UTF-16's and UTF-32's with MIME charset (*with* endian at the end. i.e. UTF-32(LE|BE), UTF-16(LE|BE) ) specified in http header regardless of the endian and the presence of BOM (In UTF-32 pages, BOM is NOT ignored and rendered as 'ZWNBS' enclosed by a dotted square) : 8 cases - UTF-16BE with BOM but without MIME charset specified : 1 cases - UTF-16BE and UTF-32BE without BOM but MIME charset specified as UTF-16 and UTF-32 : 2 cases - UTF-16BE and UTF-32BE with BOM but MIME charset specified as UTF-16 and UTF-32 : 2 cases * For the following pages, auto-detection sometimes works but not always. - UTF-16LE and UTF-32LE with BOM but without MIME charset specified : 2 cases - UTF-32BE with BOM but without MIME charset specified : 1 cases * The following pages are recognized as Latin-1. US-ASCII characters are rendered correctly with one or three hollow boxes before or after each of them depending on the endian(BE/LE) and the size (16/32) - UTF-16LE and UTF-32LE without BOM and without MIME charset (2 cases) - UTF-16BE and UTF-32BE without BOM and without MIME charset (2 cases) * The following pages are recognized as UTF-16BE and UTF-32BE. - UTF-16LE and UTF-32LE without BOM but with MIME charset specified as UTF-16 and UTF-32 (2 cases) - UTF-16LE and UTF-32LE with BOM but with MIME charset specified as UTF-16 and UTF-32 (2 cases) Jungshik Shin