[edward@webforhumans.com: RE: perlunitut - feedback appreciated]

Jarkko Hietaniemi Mon, 12 Nov 2001 05:54:50 -0800

----- Forwarded message from Edward Cherlin <[EMAIL PROTECTED]> -----


Subject: RE: perlunitut - feedback appreciated
From: Edward Cherlin <[EMAIL PROTECTED]>
Date: Sun, 11 Nov 2001 23:54:17 -0800
Message-id: <004701c16b4f$3b7caf20$1e00a8c0@mcp>
To: "'Jarkko Hietaniemi'" <[EMAIL PROTECTED]>
Cc: [EMAIL PROTECTED]
In-reply-to: <[EMAIL PROTECTED]>
Importance: Normal

I am unable to post to [EMAIL PROTECTED] Please forward.

> -----Original Message-----
> From: Jarkko Hietaniemi [mailto:[EMAIL PROTECTED]]
> Sent: Sunday, November 11, 2001 1:24 PM
> To: Edward Cherlin
> Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]>
>
> On Sun, Nov 11, 2001 at 12:57:27PM -0800, Edward Cherlin wrote:
> > Thanks. The Perl implementors and you have done a very good
> job. I have a
> > few suggestions and one complaint.
> >
> > The most important issue is chr().
> >
> > >Note that C<chr(...)> for arguments less than 0x100
> (decimal 256) will
> > >return an eight-bit character for backward compatibility with older
> > >Perls (in ISO 8859-1 platforms it can be argued to be producing
> > >Unicode even then, just not Unicode encoded in UTF-8 --
> the ISO 8859-1
> > >is equivalent to the first 256 characters of Unicode).
> For C<chr()>
> > >arguments of 0x100 or more, Unicode will always be produced.
> >
> > My complaint: There should be a pure Unicode alternative to
> this kludge.
>
> You mean chr() producing UTF-8?  There has been talk about uchr() or
> the like.  Maybe I'll just implement it in some module.

Good. Thanks.


> > >Character ranges in regular expression character classes [a-z]
> > >and in the tr///, aka y///, operator are not affected by Unicode.
> >
> > This could mean that they extend gracefully to Unicode, for example
> > something like [\{x0300}-\{x03FF}], or that they cannot be
> used outside the
> > 00-FF range (or would it be 00-7F?). Clarification is needed.
>
> Hmmm.  They extend but they may not do what people are expecting them
> to do: [a-z] will most certainly not mean "alphabetic characters".

Definitely. They will have to include characters in Latin 1, Latin Extended
A, Latin Extended B, at least.

> > >Unicode is a standard that defines a unique number for
> every character.

Just say: Unicode is a character set standard with plans to cover all of the
writing systems of the world, plus many other symbols.

> > Unique: Some characters are encoded in Unicode twice.
> Examples include
> > A-ring, also encoded as the Angstrom symbol, and a number of
> > full-width/half-width variants from Japanese standards.
>
> Argh.  This has been the most contested point of the document :-)
> My take is that too many buts, ifs, and furthermores muddle the
> message.
>
> > Number: Please say "code point" rather than number.
>
> http://www.unicode.org/unicode/standard/WhatIsUnicode.html
>
> > Every character: Unicode and ISO/IEC 10646 are coordinated
> standards that
> > provide code points for the characters in almost all modern
> character set
> > standards, covering more than 30 writing systems and
> hundreds of languages,
> > including all commercially important modern languages. All
> characters in the
> > largest Chinese, Japanese, and Korean dictionaries are also
> encoded. The
> > standards will eventually cover almost all characters in
> more than 250
> > writing systems and thousands of languages, but will not
> include proprietary
> > characters, personal-use characters, and some others.
>
> Nice chunk of text.  Can I borrow?

Certainly.


>Though the 'proprietary
> characters'
> part is a bit debatable.  What is a proprietary character?  Is, say,
> HP's roman-8 proprietary?  All its characters are in the
> Unicode (AFAIK).

The Apple Open-Apple character is proprietary. Roman-8 is just an
arrangement of pre-existing characters.


> > Note that no platform today (Java, Unix, Mac, Windoze)
> includes rendering
> > capability for all of the writing systems defined in
> Unicode, even where
> > appropriate fonts are available. The greatest deficits are
> in Armenian,
> > Georgian, Ethiopic, and writing systems of Asia, including
> India, Tibet,
> > Mongolia, Sri Lanka, Burma, and Cambodia.
>
> Hmmm.  I probably have to mention something about the display of
> Unicode but I'd rather keep it short and just refer to nice URLs.

I don't know of one. Maybe I should do that.


> > >Since Unicode 3.1 Unicode characters have been defined all the way
> > >up to 21 bits...

Just say: Since Unicode 2.0, Unicode characters have been defined up to 21
bits.

> > Unicode 1.0 began as a 16-bit character set, defining code
> points in the
> > range 0000-FFFF. ISO/IEC 10646 defines its corresponding region
> > 00000000-0000FFFF as the Basic Multilingual Plane (Plane
> 0). Since Unicode
> > 2.0, the Unicode code space has been defined to be
> 000000-10FFFF, adding 16
> > more planes. This is often described as a 20.5 bit
> encoding. A set of
> > language tag characters was defined in Plane 14. Their use is highly
> > deprecated.
> >
> > In Unicode 3.1 characters were defined in Planes 1 and 2,
> and there are
> > plans for Plane 3, at least, to be populated in Unicode
> 4.0. ISO plans to
> > vote soon to restrict 10646 to the corresponding range,
> 00000000-0010FFFF.
>
> Uhhh, that's quite an information overload for an introductory
> document.  Remember, this is not intended as comprehensive retelling
> of the Unicode FAQ, just the bare essential to start learning more.
> But saying a bit more about the history of Unicode is probably a good
> idea.
>
> > Some mention should be made of surrogates. They do not
> appear in UTF-8, but
> > many people are unclear on this point. They are also not characters.
>
> In the latest version (the http://www.iki.fi/jhi/perlunitut.pod is
> constantly updated) I mention surrogates, but I just point to
> perlunicode (the actual reference).


> > Mention should be made of the rule requiring the use of
> shortest-length
> > UTF-8 representations. Violations of this rule constitute a
> security hazard
> > in communications. I hope that Perl observes this rule.
>
> Yes, we have a regression test in our test suite that uses Markus
> Kuhn's appropriate tests.  Perl generates only shortest-length, and
> non-shortest UTF-8 will generate a warning.

Excellent.


> --
> $jhi++; # http://www.iki.fi/jhi/
>         # There is this special biologist word we use for 'stable'.
>         # It is 'dead'. -- Jack Cohen

----- End forwarded message -----

-- 
$jhi++; # http://www.iki.fi/jhi/
        # There is this special biologist word we use for 'stable'.
        # It is 'dead'. -- Jack Cohen

[edward@webforhumans.com: RE: perlunitut - feedback appreciated]

Reply via email to