Alan Wood's Unicode Resources is moving

2002-04-10 Thread Alan Wood
My collection of test pages and of surveys of fonts and programs is becoming too popular for my ISP's "free" Web space, so I am moving it to a proper URL on a faster server. The new address is: http://www.alanwood.net/unicode/ Please update any links or bookmarks you may have for the old addres

Re: Gaelic, etc., Unicode fonts

2002-04-10 Thread Michael Everson
At 23:17 -0400 2002-04-09, ÇÎÅZÅZÅZÅZ ÇÎÅZÅZÅZ wrote: >I wonder if Michael Everson will make a Gaelic kana font? Probably not. Only if commissioned to do so, but it seems to me that the ductus of Latin and Kana are not very related. One doesn' write Gaelic with a brush. -- Michael Everson ***

Discrepancy in ch03.pdf?

2002-04-10 Thread Anton Tagunov
Hello, experts! Every time I read the following passage in http://www.unicode.org/unicode/uni2book/ch03.pdf I get confused: - A single abstract character may correspond to more then one code value - ... - Multiple code values may be required to represent a single abstract character. For exam

Re: Alan Wood's Unicode Resources is moving

2002-04-10 Thread Frank da Cruz

Re: Alan Wood's Unicode Resources is moving

2002-04-10 Thread Frank da Cruz
Sorry for the empty message; I didn't mean to reply to Alan's message (but thanks for the updated URL; I updated my UTF-8 sampler page at http://www.columbia.edu/kermit/utf8.html). - Frank

Re: Discrepancy in ch03.pdf?

2002-04-10 Thread Doug Ewell
Антон Тагунов <[EMAIL PROTECTED]> wrote regarding Definition D5: > Every time I read the following passage in > http://www.unicode.org/unicode/uni2book/ch03.pdf > I get confused: > > - A single abstract character may correspond to more then one code > value - ... > - Multiple code values may be

RE: Default endianness of Unicode, or not

2002-04-10 Thread Yves Arrouye
> > The last time I read the Unicode standard UTF-16 was big endian > > unless a BOM was present, and that's what I expected from a UTF-16 > > converter. > > Conformance requirement C2 (TUS 3.0, p. 37) says: > [And other many good references where TUS does *not* say that :)] OK, maybe in 2.0, o

RE: Default endianness of Unicode, or not

2002-04-10 Thread Yves Arrouye
> > The last time I read the Unicode standard UTF-16 was big endian > > unless a BOM was present, and that's what I expected from a UTF-16 > > converter. > > Conformance requirement C2 (TUS 3.0, p. 37) says: > > "The Unicode Standard does not specify any order of bytes inside a > Unicode value."

RE: MS/Unix BOM FAQ again (small fix)

2002-04-10 Thread Rick Cameron
So the original statement was correct. If the file starts with FF FE, it must be a little-endian encoding; but you can't tell whether it's UTF-16 or UTF-32. - rick cameron -Original Message- From: Mark Davis [mailto:[EMAIL PROTECTED]] Sent: Tuesday, 9 April 2002 20:36 To: Kenneth Whistl

Re: MS/Unix BOM FAQ again (small fix)

2002-04-10 Thread Markus Scherer
The reason for ICU's "UTF-16" converter not trying to auto-detect the BOM is that this seems to be something that the _application_ has to decide, not the _converter_ that the application instantiates. This converter name is (currently) only a convenience alias for "use the UTF-16 byte serializ

RE: MS/Unix BOM FAQ again (small fix)

2002-04-10 Thread Yves Arrouye
> The reason for ICU's "UTF-16" converter not trying to auto-detect the BOM > is that this seems to be something that the _application_ has to decide, > not the _converter_ that the application instantiates. > This converter name is (currently) only a convenience alias for "use the > UTF-16 byte s

Re: Discrepancy in ch03.pdf?

2002-04-10 Thread Kenneth Whistler
> Антон Тагунов <[EMAIL PROTECTED]> wrote regarding Definition D5: > > > Every time I read the following passage in > > http://www.unicode.org/unicode/uni2book/ch03.pdf > > I get confused: > > > > - A single abstract character may correspond to more then one code > > value - ... > > - Multiple

Re: MS/Unix BOM FAQ again (small fix)

2002-04-10 Thread Markus Scherer
Rick Cameron wrote: > So the original statement was correct. If the file starts with FF FE, it > must be a little-endian encoding; but you can't tell whether it's UTF-16 or > UTF-32. If you know that it's UTF-16 and you just try to figure out the byte order, then FF FE is unambiguous. If you

Re: MS/Unix BOM FAQ again (small fix)

2002-04-10 Thread Mark Davis
> So the original statement was correct. If the file starts with FF FE, > it must be a little-endian encoding; but you can't tell whether it's > UTF-16 or UTF-32. The original statement was: > > A Unicode text file beginning with FEFF is > > big-endian, and a file beginning with FFFE (not a lega

RE: MS/Unix BOM FAQ again (small fix)

2002-04-10 Thread jarkko . hietaniemi
> If you look for any Unicode signature, then you look for FF > FE 00 00 (UTF-32LE) before you check for FF FE (UTF-16LE). FF FE 00 00 could be the UTF-32LE BOM, but it could also be UTF-16LE BOM followed by a UTF-16 U+. Yes, the NULL is usually not thought of as "text", but there's no know

RE: Default endianness of Unicode, or not

2002-04-10 Thread Kenneth Whistler
Yves wrote, in response to Doug: > > > The last time I read the Unicode standard UTF-16 was big endian > > > unless a BOM was present, and that's what I expected from a UTF-16 > > > converter. > > > > Conformance requirement C2 (TUS 3.0, p. 37) says: > > > > "The Unicode Standard does not speci

Re: MS/Unix BOM FAQ again (small fix)

2002-04-10 Thread Mark Davis
Here is what I think the FAQ ought to say: Suppose you know that the text is Unicode. - Unicode can be represented in a number of different forms (UTFs) - some of them *may* start with a BOM (a byte sequence that would correspond to U+FEFF). - some cannot (in that case, a byte sequence that w

RE: Default endianness of Unicode, or not

2002-04-10 Thread Yves Arrouye
> "D43 UTF-16 character encoding scheme: the Unicode > CES that serializes a UTF-16 code unit sequence as a byte sequence > in either big-endian or little-endian format. > > * In UTF-16 (the CES), the UTF-16 code unit sequence > <004D 0430 4E8C D800 DF02> is serialized as > or > o

RE: Default endianness of Unicode, or not

2002-04-10 Thread Yves Arrouye
And of course, I have been complaining about ICU's UTF-16 converter behavior, but glibc's one does the same assumption that "UTF-16" is in the local endianness: gabier% echo hello | uconv -t utf-16be | iconv -f utf-16 -t ascii iconv: illegal input sequence at position 0 gabier% So fixing one but

RE: Default endianness of Unicode, or not

2002-04-10 Thread Kenneth Whistler
Yves, > So same semantics as before. Yep. The editorial committee would't be doing its job right if it were changing the semantics of the standard. The intent here is to rewrite everything so that the semantics intended all along will finally be revealed to everyone! It really is a little like

RE: Default endianness of Unicode, or not

2002-04-10 Thread Yves Arrouye
> > So same semantics as before. > > Yep. The editorial committee would't be doing its job right > if it were changing the semantics of the standard. Agreed! Is there any mention that the non-BOM byte sequence is most significant byte first anywhere else? You know, for the newbies? > Joshua 1.

Re: MS/Unix BOM FAQ again (small fix)

2002-04-10 Thread Doug Ewell
Mark Davis <[EMAIL PROTECTED]> wrote: > - when one of the BOM-allowing UTFs starts with a BOM, you know the > encoding*, and you strip off the BOM when you get the content. > > *assuming that no UTF-16 file has U+ as the first character. In the real world, this is a pretty good assumption --