Doug responded to Mark's clarification:
> > The UTF-16M sequence <0x1234, 0x0061, 0xD800, 0xDC00> is represented
> > as one of:
> > <0x12 0x34 0x00 0x61 0xD8 0x00 0xDC 0x00> // BOMless
> > <0xFE 0xFF 0x12 0x34 0x00 0x61 0xD8 0x00 0xDC 0x00> // BOM
> > <0xFF 0xFE 0x34 0x12 0x61 0x00 0x00 0xD8 0x00
ECTED]>
Sent: Sunday, April 14, 2002 15:28
Subject: Re: Default endianness of Unicode, or not
> Mark Davis <[EMAIL PROTECTED]> wrote:
>
> > Part of the problem is that the term "UTF-16" means two different
> > things. Let me see if I can make it clearer.
>
Mark Davis <[EMAIL PROTECTED]> wrote:
> Part of the problem is that the term "UTF-16" means two different
> things. Let me see if I can make it clearer.
>
> Let "UTF-16M" refer to the in-memory form, which is sequence of 16-
> bit code units. The byte ordering is logically immaterial, since it
>
icu/tr]
http://www.macchiato.com
- Original Message -
From: "Doug Ewell" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Cc: "Kenneth Whistler" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Sent: Saturday, April 13, 2002 11:42
Subject: Re: Default endiannes
On Wednesday 2002-04-10, Kenneth Whistler <[EMAIL PROTECTED]> wrote:
> There, feel better?
Not really. I'm getting the sense on one hand that UTF-16, sans BOM,
can be big-endian or little-endian depending on the platform, on the
other hand that little-endian UTF-16 isn't "legal" unless it has a
> > So same semantics as before.
>
> Yep. The editorial committee would't be doing its job right
> if it were changing the semantics of the standard.
Agreed! Is there any mention that the non-BOM byte sequence is most
significant byte first anywhere else? You know, for the newbies?
> Joshua 1.
Yves,
> So same semantics as before.
Yep. The editorial committee would't be doing its job right
if it were changing the semantics of the standard. The intent
here is to rewrite everything so that the semantics intended
all along will finally be revealed to everyone!
It really is a little like
And of course, I have been complaining about ICU's UTF-16 converter
behavior, but glibc's one does the same assumption that "UTF-16" is in the
local endianness:
gabier% echo hello | uconv -t utf-16be | iconv -f utf-16 -t ascii
iconv: illegal input sequence at position 0
gabier%
So fixing one but
> "D43 UTF-16 character encoding scheme: the Unicode
> CES that serializes a UTF-16 code unit sequence as a byte sequence
> in either big-endian or little-endian format.
>
> * In UTF-16 (the CES), the UTF-16 code unit sequence
> <004D 0430 4E8C D800 DF02> is serialized as
> or
> o
Yves wrote, in response to Doug:
> > > The last time I read the Unicode standard UTF-16 was big endian
> > > unless a BOM was present, and that's what I expected from a UTF-16
> > > converter.
> >
> > Conformance requirement C2 (TUS 3.0, p. 37) says:
> >
> > "The Unicode Standard does not speci
> > The last time I read the Unicode standard UTF-16 was big endian
> > unless a BOM was present, and that's what I expected from a UTF-16
> > converter.
>
> Conformance requirement C2 (TUS 3.0, p. 37) says:
>
> "The Unicode Standard does not specify any order of bytes inside a
> Unicode value."
> > The last time I read the Unicode standard UTF-16 was big endian
> > unless a BOM was present, and that's what I expected from a UTF-16
> > converter.
>
> Conformance requirement C2 (TUS 3.0, p. 37) says:
>
[And other many good references where TUS does *not* say that :)]
OK, maybe in 2.0, o
12 matches
Mail list logo