Tom Christiansen wrote:
Exegesis 5 @ http://dev.perl.org/perl6/doc/design/exe/E05.html reads:
# Perl 6
/ < <alpha> - [A-Za-z] >+ / # All alphabetics except A-Z or a-z
# (i.e. the accented alphabetics)
[Update: Would now need to be <+<alpha> - [A..Za..z]> to avoid ambiguity
with "Texas quotes", and because we want to reserve whitespace as the first
character inside the angles for other uses.]
Explicit character classes were deliberately made a little less convenient
in Perl 6, because they're generally a bad idea in a Unicode world. For
example, the [A-Za-z] character class in the above examples won't even
match standard alphabetic Latin-1 characters like 'Ã', 'é', 'ø', let alone
alphabetic characters from code-sets such as Cyrillic, Hiragana, Ogham,
Cherokee, or Klingon.
First off, that "i.e. the accented alphabetics" phrasing is quite incorrect!
Of course. If the author intended to match "special" (= non-ASCII) Latin
letters, it should be something like
use charnames ();
for $codepoint ( 1 .. 0xffff ) {
$char = chr($codepoint);
if (
$char =~ /\p{L}/
&& $char =~ /\p{Latin}/
&& $char !~ /[A-Za-z]/) {
printf("%c %04X %s\n", $cp, $cp, charnames::viacode($codepoint));
}
}
Code like /[^\P{Alpha}A-Za-z]/ matches not just things like
[...]
but also of course:
[...]
00C6 LATIN CAPITAL LETTER AE
00D0 LATIN CAPITAL LETTER ETH
Good examples.
Both cannot be decomposed. Depending on your needs 'LETTER AE' can be
seen as a ligature. For example current botanical Latin allows (AFAIK)
'LETTER AE' but also 'LETTER A' + 'LETTER E'. If someone needs to match
both variants, there is no way around a local-specific transliteration.
'LATIN CAPITAL LETTER ETH' looks like an accented character (0110 LATIN
CAPITAL LETTER D WITH STROKE). Unicode policy does not (did not) allow
(de-)composition of overlays, which is the case for example for all
characters 'WITH STROKE'. Thus ':ignoremark' and ':samemark' will be
useless, if someone needs similarity matching of e.g.
unmark('ø') =~ /o/
[...]
It's not. Accent is not a synonym for any of those. Not all marks are
accents, and not all accents are marks.
I believe what is meant by "accent" is NFD($char) =~ /\pM/. Fine: then
say "with diacritics", not "with accents".
Agreed. Everything related to Unicode should use Unicode terms at least
in the definition. And if a Unicode term is used it should exactly mean
what is specified in the Unicode standard. E.g. it would be a fault, if
graphemes are defined by '\pX' or '(?>\PM\pM*)', as Unicode provides the
properties 'Grapheme_Base' and 'Grapheme_Extend' (unfortunately they are
not supported by Perl 5 or Perl 6).
Helmut Wollmersdorfer