Thanks. The Perl implementors and you have done a very good job. I have a few suggestions and one complaint.
The most important issue is chr(). >Note that C<chr(...)> for arguments less than 0x100 (decimal 256) will >return an eight-bit character for backward compatibility with older >Perls (in ISO 8859-1 platforms it can be argued to be producing >Unicode even then, just not Unicode encoded in UTF-8 -- the ISO 8859-1 >is equivalent to the first 256 characters of Unicode). For C<chr()> >arguments of 0x100 or more, Unicode will always be produced. My complaint: There should be a pure Unicode alternative to this kludge. Obviously, it is not hard to write one in Perl, but it should be part of the implementation. ISO Latin-1 characters encoded as 10-FF in single bytes are not Unicode. There is no Unicode transformation format or other encoding that permits this. The code point range is actually x000010-x0000FF, and the encodings are 0000000010000000 0000000011111111 UTF-16 Big Endian 1000000000000000 1111111100000000 UTF-16 Little Endian 00000000000000000000000010000000 00000000000000000000000011111111 UCS-4 BE 00000000000000001000000000000000 00000000000000001111111100000000 UCS-4 LE 1100001010000000 1100001110111111 UTF-8 >Character ranges in regular expression character classes [a-z] >and in the tr///, aka y///, operator are not affected by Unicode. This could mean that they extend gracefully to Unicode, for example something like [\{x0300}-\{x03FF}], or that they cannot be used outside the 00-FF range (or would it be 00-7F?). Clarification is needed. >Unicode is a standard that defines a unique number for every character. Unique: Some characters are encoded in Unicode twice. Examples include A-ring, also encoded as the Angstrom symbol, and a number of full-width/half-width variants from Japanese standards. Number: Please say "code point" rather than number. Every character: Unicode and ISO/IEC 10646 are coordinated standards that provide code points for the characters in almost all modern character set standards, covering more than 30 writing systems and hundreds of languages, including all commercially important modern languages. All characters in the largest Chinese, Japanese, and Korean dictionaries are also encoded. The standards will eventually cover almost all characters in more than 250 writing systems and thousands of languages, but will not include proprietary characters, personal-use characters, and some others. Note that no platform today (Java, Unix, Mac, Windoze) includes rendering capability for all of the writing systems defined in Unicode, even where appropriate fonts are available. The greatest deficits are in Armenian, Georgian, Ethiopic, and writing systems of Asia, including India, Tibet, Mongolia, Sri Lanka, Burma, and Cambodia. >Since Unicode 3.1 Unicode characters have been defined all the way >up to 21 bits... Unicode 1.0 began as a 16-bit character set, defining code points in the range 0000-FFFF. ISO/IEC 10646 defines its corresponding region 00000000-0000FFFF as the Basic Multilingual Plane (Plane 0). Since Unicode 2.0, the Unicode code space has been defined to be 000000-10FFFF, adding 16 more planes. This is often described as a 20.5 bit encoding. A set of language tag characters was defined in Plane 14. Their use is highly deprecated. In Unicode 3.1 characters were defined in Planes 1 and 2, and there are plans for Plane 3, at least, to be populated in Unicode 4.0. ISO plans to vote soon to restrict 10646 to the corresponding range, 00000000-0010FFFF. Some mention should be made of surrogates. They do not appear in UTF-8, but many people are unclear on this point. They are also not characters. Mention should be made of the rule requiring the use of shortest-length UTF-8 representations. Violations of this rule constitute a security hazard in communications. I hope that Perl observes this rule. > -----Original Message----- > From: Jarkko Hietaniemi [mailto:[EMAIL PROTECTED]] > Sent: Saturday, November 10, 2001 10:54 AM > To: [EMAIL PROTECTED] > Cc: Markus Kuhn; [EMAIL PROTECTED] > Subject: perlunitut - feedback appreciated > > > For the upcoming Perl 5.8.0 release I just recently wrote the > following little introductory text. Any feedback appreciated. > > http://www.iki.fi/jhi/perlunitut.pod > > -- > $jhi++; # http://www.iki.fi/jhi/ > # There is this special biologist word we use for 'stable'. > # It is 'dead'. -- Jack Cohen