My collection of test pages and of surveys of fonts and programs is becoming
too popular for my ISP's "free" Web space, so I am moving it to a proper URL
on a faster server. The new address is:
http://www.alanwood.net/unicode/
Please update any links or bookmarks you may have for the old addres
At 23:17 -0400 2002-04-09, ÇÎÅZÅZÅZÅZ ÇÎÅZÅZÅZ wrote:
>I wonder if Michael Everson will make a Gaelic kana font? Probably not.
Only if commissioned to do so, but it seems to me that the ductus of
Latin and Kana are not very related. One doesn' write Gaelic with a
brush.
--
Michael Everson ***
Hello, experts!
Every time I read the following passage in
http://www.unicode.org/unicode/uni2book/ch03.pdf
I get confused:
- A single abstract character may correspond to more then one code
value - ...
- Multiple code values may be required to represent a single abstract
character. For exam
Sorry for the empty message; I didn't mean to reply to Alan's message
(but thanks for the updated URL; I updated my UTF-8 sampler page at
http://www.columbia.edu/kermit/utf8.html).
- Frank
Антон Тагунов <[EMAIL PROTECTED]> wrote regarding Definition D5:
> Every time I read the following passage in
> http://www.unicode.org/unicode/uni2book/ch03.pdf
> I get confused:
>
> - A single abstract character may correspond to more then one code
> value - ...
> - Multiple code values may be
> > The last time I read the Unicode standard UTF-16 was big endian
> > unless a BOM was present, and that's what I expected from a UTF-16
> > converter.
>
> Conformance requirement C2 (TUS 3.0, p. 37) says:
>
[And other many good references where TUS does *not* say that :)]
OK, maybe in 2.0, o
> > The last time I read the Unicode standard UTF-16 was big endian
> > unless a BOM was present, and that's what I expected from a UTF-16
> > converter.
>
> Conformance requirement C2 (TUS 3.0, p. 37) says:
>
> "The Unicode Standard does not specify any order of bytes inside a
> Unicode value."
So the original statement was correct. If the file starts with FF FE, it
must be a little-endian encoding; but you can't tell whether it's UTF-16 or
UTF-32.
- rick cameron
-Original Message-
From: Mark Davis [mailto:[EMAIL PROTECTED]]
Sent: Tuesday, 9 April 2002 20:36
To: Kenneth Whistl
The reason for ICU's "UTF-16" converter not trying to auto-detect the BOM is that this
seems to be something that the _application_ has to decide, not the _converter_ that
the application instantiates.
This converter name is (currently) only a convenience alias for "use the UTF-16 byte
serializ
> The reason for ICU's "UTF-16" converter not trying to auto-detect the BOM
> is that this seems to be something that the _application_ has to decide,
> not the _converter_ that the application instantiates.
> This converter name is (currently) only a convenience alias for "use the
> UTF-16 byte s
> Антон Тагунов <[EMAIL PROTECTED]> wrote regarding Definition D5:
>
> > Every time I read the following passage in
> > http://www.unicode.org/unicode/uni2book/ch03.pdf
> > I get confused:
> >
> > - A single abstract character may correspond to more then one code
> > value - ...
> > - Multiple
Rick Cameron wrote:
> So the original statement was correct. If the file starts with FF FE, it
> must be a little-endian encoding; but you can't tell whether it's UTF-16 or
> UTF-32.
If you know that it's UTF-16 and you just try to figure out the byte order, then FF FE
is unambiguous.
If you
> So the original statement was correct. If the file starts with FF
FE,
> it must be a little-endian encoding; but you can't tell whether it's
> UTF-16 or UTF-32.
The original statement was:
> > A Unicode text file beginning with FEFF is
> > big-endian, and a file beginning with FFFE (not a lega
> If you look for any Unicode signature, then you look for FF
> FE 00 00 (UTF-32LE) before you check for FF FE (UTF-16LE).
FF FE 00 00 could be the UTF-32LE BOM, but it could also be UTF-16LE BOM
followed by a UTF-16 U+. Yes, the NULL is usually not thought of as "text",
but there's no know
Yves wrote, in response to Doug:
> > > The last time I read the Unicode standard UTF-16 was big endian
> > > unless a BOM was present, and that's what I expected from a UTF-16
> > > converter.
> >
> > Conformance requirement C2 (TUS 3.0, p. 37) says:
> >
> > "The Unicode Standard does not speci
Here is what I think the FAQ ought to say:
Suppose you know that the text is Unicode.
- Unicode can be represented in a number of different forms (UTFs)
- some of them *may* start with a BOM (a byte sequence that would
correspond to U+FEFF).
- some cannot (in that case, a byte sequence that w
> "D43 UTF-16 character encoding scheme: the Unicode
> CES that serializes a UTF-16 code unit sequence as a byte sequence
> in either big-endian or little-endian format.
>
> * In UTF-16 (the CES), the UTF-16 code unit sequence
> <004D 0430 4E8C D800 DF02> is serialized as
> or
> o
And of course, I have been complaining about ICU's UTF-16 converter
behavior, but glibc's one does the same assumption that "UTF-16" is in the
local endianness:
gabier% echo hello | uconv -t utf-16be | iconv -f utf-16 -t ascii
iconv: illegal input sequence at position 0
gabier%
So fixing one but
Yves,
> So same semantics as before.
Yep. The editorial committee would't be doing its job right
if it were changing the semantics of the standard. The intent
here is to rewrite everything so that the semantics intended
all along will finally be revealed to everyone!
It really is a little like
> > So same semantics as before.
>
> Yep. The editorial committee would't be doing its job right
> if it were changing the semantics of the standard.
Agreed! Is there any mention that the non-BOM byte sequence is most
significant byte first anywhere else? You know, for the newbies?
> Joshua 1.
Mark Davis <[EMAIL PROTECTED]> wrote:
> - when one of the BOM-allowing UTFs starts with a BOM, you know the
> encoding*, and you strip off the BOM when you get the content.
>
> *assuming that no UTF-16 file has U+ as the first character.
In the real world, this is a pretty good assumption --
22 matches
Mail list logo