Tom Christiansen wrote:

Certainly it's perfectly well known amongst people who deal with
letters--including with the Unicode standard.

"Accent" does have a colloquial meaning that maps correctly,
but sadly that colloquial definition does not correspond to
the technical definition, so in being clear, you become less
accurate. There is, as far as I'm aware, no good middle
ground, here.

One doesn't *have* to make up play-words.  There's nothing wrong with the
correct terminology.  Calling a mark a mark is pretty darned simple.

Well, scientist are not always happy with Unicode terms, e.g. 'ideograph' for Han characters, or 'Latin' for Roman scripts. But the terms should be used as defined by the standard--as names/identifiers of properties.

Unicode has blocks for diacritic marks, and a Diacritic property for
testing whether something is one.  There are 1328 code points whose
canonical decompositions have both both \p{Diacritic} and \pM in them,
946 code points that have only \pM but not \p{Diacritic}, and 197 that have \p{Diacritic} but not \pM.

If someone really uses Unicode there is way no around deep knowledge of the properties. Such code will use Unicode properties directly, and Perl 6 should therefore support all the properties.

I still think resorting to talking about "accent marks" is a bad idea. I had somebody the other day thinking that "throwing out the accent marks"
meant deleting all characters whose code points were over 0x7F--and this
was a recent CompSci major, too.

I know this sort of people. They also believe that UTF-8 is a 2-byte encoding.

But that's nothing.  The more you look into it, the weirder it can get,
especially with collation and canonical equivalence, both of which really
require locale knowledge outside the charset itself.

Sure. The specs of Perl 6 still need huge work on the Unicode part.

Helmut Wollmersdorfer

Reply via email to