On Sun, Nov 11, 2001 at 12:57:27PM -0800, Edward Cherlin wrote: > Thanks. The Perl implementors and you have done a very good job. I have a > few suggestions and one complaint. > > The most important issue is chr(). > > >Note that C<chr(...)> for arguments less than 0x100 (decimal 256) will > >return an eight-bit character for backward compatibility with older > >Perls (in ISO 8859-1 platforms it can be argued to be producing > >Unicode even then, just not Unicode encoded in UTF-8 -- the ISO 8859-1 > >is equivalent to the first 256 characters of Unicode). For C<chr()> > >arguments of 0x100 or more, Unicode will always be produced. > > My complaint: There should be a pure Unicode alternative to this kludge.
You mean chr() producing UTF-8? There has been talk about uchr() or the like. Maybe I'll just implement it in some module. > Obviously, it is not hard to write one in Perl, but it should be part of the > implementation. > ISO Latin-1 characters encoded as 10-FF in single bytes are not Unicode. > There is no Unicode transformation format or other encoding that permits > this. The code point range is actually x000010-x0000FF, and the encodings > are > > 0000000010000000 0000000011111111 UTF-16 Big Endian > 1000000000000000 1111111100000000 UTF-16 Little Endian > 00000000000000000000000010000000 00000000000000000000000011111111 UCS-4 BE > 00000000000000001000000000000000 00000000000000001111111100000000 UCS-4 LE > 1100001010000000 1100001110111111 UTF-8 Okay. > >Character ranges in regular expression character classes [a-z] > >and in the tr///, aka y///, operator are not affected by Unicode. > > This could mean that they extend gracefully to Unicode, for example > something like [\{x0300}-\{x03FF}], or that they cannot be used outside the > 00-FF range (or would it be 00-7F?). Clarification is needed. Hmmm. They extend but they may not do what people are expecting them to do: [a-z] will most certainly not mean "alphabetic characters". > >Unicode is a standard that defines a unique number for every character. > > Unique: Some characters are encoded in Unicode twice. Examples include > A-ring, also encoded as the Angstrom symbol, and a number of > full-width/half-width variants from Japanese standards. Argh. This has been the most contested point of the document :-) My take is that too many buts, ifs, and furthermores muddle the message. > Number: Please say "code point" rather than number. http://www.unicode.org/unicode/standard/WhatIsUnicode.html > Every character: Unicode and ISO/IEC 10646 are coordinated standards that > provide code points for the characters in almost all modern character set > standards, covering more than 30 writing systems and hundreds of languages, > including all commercially important modern languages. All characters in the > largest Chinese, Japanese, and Korean dictionaries are also encoded. The > standards will eventually cover almost all characters in more than 250 > writing systems and thousands of languages, but will not include proprietary > characters, personal-use characters, and some others. Nice chunk of text. Can I borrow? Though the 'proprietary characters' part is a bit debatable. What is a proprietary character? Is, say, HP's roman-8 proprietary? All its characters are in the Unicode (AFAIK). > Note that no platform today (Java, Unix, Mac, Windoze) includes rendering > capability for all of the writing systems defined in Unicode, even where > appropriate fonts are available. The greatest deficits are in Armenian, > Georgian, Ethiopic, and writing systems of Asia, including India, Tibet, > Mongolia, Sri Lanka, Burma, and Cambodia. Hmmm. I probably have to mention something about the display of Unicode but I'd rather keep it short and just refer to nice URLs. > >Since Unicode 3.1 Unicode characters have been defined all the way > >up to 21 bits... > > Unicode 1.0 began as a 16-bit character set, defining code points in the > range 0000-FFFF. ISO/IEC 10646 defines its corresponding region > 00000000-0000FFFF as the Basic Multilingual Plane (Plane 0). Since Unicode > 2.0, the Unicode code space has been defined to be 000000-10FFFF, adding 16 > more planes. This is often described as a 20.5 bit encoding. A set of > language tag characters was defined in Plane 14. Their use is highly > deprecated. > > In Unicode 3.1 characters were defined in Planes 1 and 2, and there are > plans for Plane 3, at least, to be populated in Unicode 4.0. ISO plans to > vote soon to restrict 10646 to the corresponding range, 00000000-0010FFFF. Uhhh, that's quite an information overload for an introductory document. Remember, this is not intended as comprehensive retelling of the Unicode FAQ, just the bare essential to start learning more. But saying a bit more about the history of Unicode is probably a good idea. > Some mention should be made of surrogates. They do not appear in UTF-8, but > many people are unclear on this point. They are also not characters. In the latest version (the http://www.iki.fi/jhi/perlunitut.pod is constantly updated) I mention surrogates, but I just point to perlunicode (the actual reference). > Mention should be made of the rule requiring the use of shortest-length > UTF-8 representations. Violations of this rule constitute a security hazard > in communications. I hope that Perl observes this rule. Yes, we have a regression test in our test suite that uses Markus Kuhn's appropriate tests. Perl generates only shortest-length, and non-shortest UTF-8 will generate a warning. -- $jhi++; # http://www.iki.fi/jhi/ # There is this special biologist word we use for 'stable'. # It is 'dead'. -- Jack Cohen