----- Forwarded message from Edward Cherlin <[EMAIL PROTECTED]> -----
Subject: RE: perlunitut - feedback appreciated From: Edward Cherlin <[EMAIL PROTECTED]> Date: Sun, 11 Nov 2001 23:54:17 -0800 Message-id: <004701c16b4f$3b7caf20$1e00a8c0@mcp> To: "'Jarkko Hietaniemi'" <[EMAIL PROTECTED]> Cc: [EMAIL PROTECTED] In-reply-to: <[EMAIL PROTECTED]> Importance: Normal I am unable to post to [EMAIL PROTECTED] Please forward. > -----Original Message----- > From: Jarkko Hietaniemi [mailto:[EMAIL PROTECTED]] > Sent: Sunday, November 11, 2001 1:24 PM > To: Edward Cherlin > Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]> > > On Sun, Nov 11, 2001 at 12:57:27PM -0800, Edward Cherlin wrote: > > Thanks. The Perl implementors and you have done a very good > job. I have a > > few suggestions and one complaint. > > > > The most important issue is chr(). > > > > >Note that C<chr(...)> for arguments less than 0x100 > (decimal 256) will > > >return an eight-bit character for backward compatibility with older > > >Perls (in ISO 8859-1 platforms it can be argued to be producing > > >Unicode even then, just not Unicode encoded in UTF-8 -- > the ISO 8859-1 > > >is equivalent to the first 256 characters of Unicode). > For C<chr()> > > >arguments of 0x100 or more, Unicode will always be produced. > > > > My complaint: There should be a pure Unicode alternative to > this kludge. > > You mean chr() producing UTF-8? There has been talk about uchr() or > the like. Maybe I'll just implement it in some module. Good. Thanks. > > >Character ranges in regular expression character classes [a-z] > > >and in the tr///, aka y///, operator are not affected by Unicode. > > > > This could mean that they extend gracefully to Unicode, for example > > something like [\{x0300}-\{x03FF}], or that they cannot be > used outside the > > 00-FF range (or would it be 00-7F?). Clarification is needed. > > Hmmm. They extend but they may not do what people are expecting them > to do: [a-z] will most certainly not mean "alphabetic characters". Definitely. They will have to include characters in Latin 1, Latin Extended A, Latin Extended B, at least. > > >Unicode is a standard that defines a unique number for > every character. Just say: Unicode is a character set standard with plans to cover all of the writing systems of the world, plus many other symbols. > > Unique: Some characters are encoded in Unicode twice. > Examples include > > A-ring, also encoded as the Angstrom symbol, and a number of > > full-width/half-width variants from Japanese standards. > > Argh. This has been the most contested point of the document :-) > My take is that too many buts, ifs, and furthermores muddle the > message. > > > Number: Please say "code point" rather than number. > > http://www.unicode.org/unicode/standard/WhatIsUnicode.html > > > Every character: Unicode and ISO/IEC 10646 are coordinated > standards that > > provide code points for the characters in almost all modern > character set > > standards, covering more than 30 writing systems and > hundreds of languages, > > including all commercially important modern languages. All > characters in the > > largest Chinese, Japanese, and Korean dictionaries are also > encoded. The > > standards will eventually cover almost all characters in > more than 250 > > writing systems and thousands of languages, but will not > include proprietary > > characters, personal-use characters, and some others. > > Nice chunk of text. Can I borrow? Certainly. >Though the 'proprietary > characters' > part is a bit debatable. What is a proprietary character? Is, say, > HP's roman-8 proprietary? All its characters are in the > Unicode (AFAIK). The Apple Open-Apple character is proprietary. Roman-8 is just an arrangement of pre-existing characters. > > Note that no platform today (Java, Unix, Mac, Windoze) > includes rendering > > capability for all of the writing systems defined in > Unicode, even where > > appropriate fonts are available. The greatest deficits are > in Armenian, > > Georgian, Ethiopic, and writing systems of Asia, including > India, Tibet, > > Mongolia, Sri Lanka, Burma, and Cambodia. > > Hmmm. I probably have to mention something about the display of > Unicode but I'd rather keep it short and just refer to nice URLs. I don't know of one. Maybe I should do that. > > >Since Unicode 3.1 Unicode characters have been defined all the way > > >up to 21 bits... Just say: Since Unicode 2.0, Unicode characters have been defined up to 21 bits. > > Unicode 1.0 began as a 16-bit character set, defining code > points in the > > range 0000-FFFF. ISO/IEC 10646 defines its corresponding region > > 00000000-0000FFFF as the Basic Multilingual Plane (Plane > 0). Since Unicode > > 2.0, the Unicode code space has been defined to be > 000000-10FFFF, adding 16 > > more planes. This is often described as a 20.5 bit > encoding. A set of > > language tag characters was defined in Plane 14. Their use is highly > > deprecated. > > > > In Unicode 3.1 characters were defined in Planes 1 and 2, > and there are > > plans for Plane 3, at least, to be populated in Unicode > 4.0. ISO plans to > > vote soon to restrict 10646 to the corresponding range, > 00000000-0010FFFF. > > Uhhh, that's quite an information overload for an introductory > document. Remember, this is not intended as comprehensive retelling > of the Unicode FAQ, just the bare essential to start learning more. > But saying a bit more about the history of Unicode is probably a good > idea. > > > Some mention should be made of surrogates. They do not > appear in UTF-8, but > > many people are unclear on this point. They are also not characters. > > In the latest version (the http://www.iki.fi/jhi/perlunitut.pod is > constantly updated) I mention surrogates, but I just point to > perlunicode (the actual reference). > > Mention should be made of the rule requiring the use of > shortest-length > > UTF-8 representations. Violations of this rule constitute a > security hazard > > in communications. I hope that Perl observes this rule. > > Yes, we have a regression test in our test suite that uses Markus > Kuhn's appropriate tests. Perl generates only shortest-length, and > non-shortest UTF-8 will generate a warning. Excellent. > -- > $jhi++; # http://www.iki.fi/jhi/ > # There is this special biologist word we use for 'stable'. > # It is 'dead'. -- Jack Cohen ----- End forwarded message ----- -- $jhi++; # http://www.iki.fi/jhi/ # There is this special biologist word we use for 'stable'. # It is 'dead'. -- Jack Cohen