Tom Christiansen wrote:
Exegesis 5 @ http://dev.perl.org/perl6/doc/design/exe/E05.html reads:

  # Perl 6
  / < <alpha> - [A-Za-z] >+ /   # All alphabetics except A-Z or a-z
                                # (i.e. the accented alphabetics)

    [Update: Would now need to be <+<alpha> - [A..Za..z]> to avoid ambiguity
    with "Texas quotes", and because we want to reserve whitespace as the first
    character inside the angles for other uses.]

    Explicit character classes were deliberately made a little less convenient
    in Perl 6, because they're generally a bad idea in a Unicode world. For
    example, the [A-Za-z] character class in the above examples won't even
    match standard alphabetic Latin-1 characters like 'Ã', 'é', 'ø', let alone
    alphabetic characters from code-sets such as Cyrillic, Hiragana, Ogham,
    Cherokee, or Klingon.

First off, that "i.e. the accented alphabetics" phrasing is quite incorrect!

Of course. If the author intended to match "special" (= non-ASCII) Latin letters, it should be something like

  use charnames ();
  for $codepoint ( 1 .. 0xffff ) {
    $char  = chr($codepoint);
    if (
      $char =~ /\p{L}/
      && $char =~ /\p{Latin}/
      && $char !~ /[A-Za-z]/) {
      printf("%c %04X %s\n", $cp, $cp, charnames::viacode($codepoint));
    }
   }

Code like /[^\P{Alpha}A-Za-z]/ matches not just things like
[...]
but also of course:

[...]

    00C6 LATIN CAPITAL LETTER AE
    00D0 LATIN CAPITAL LETTER ETH

Good examples.

Both cannot be decomposed. Depending on your needs 'LETTER AE' can be seen as a ligature. For example current botanical Latin allows (AFAIK) 'LETTER AE' but also 'LETTER A' + 'LETTER E'. If someone needs to match both variants, there is no way around a local-specific transliteration.

'LATIN CAPITAL LETTER ETH' looks like an accented character (0110 LATIN CAPITAL LETTER D WITH STROKE). Unicode policy does not (did not) allow (de-)composition of overlays, which is the case for example for all characters 'WITH STROKE'. Thus ':ignoremark' and ':samemark' will be useless, if someone needs similarity matching of e.g.

  unmark('ø') =~ /o/

[...]

It's not.  Accent is not a synonym for any of those.  Not all marks are
accents, and not all accents are marks.

I believe what is meant by "accent" is NFD($char) =~ /\pM/.  Fine: then
say "with diacritics", not "with accents".

Agreed. Everything related to Unicode should use Unicode terms at least in the definition. And if a Unicode term is used it should exactly mean what is specified in the Unicode standard. E.g. it would be a fault, if graphemes are defined by '\pX' or '(?>\PM\pM*)', as Unicode provides the properties 'Grapheme_Base' and 'Grapheme_Extend' (unfortunately they are not supported by Perl 5 or Perl 6).

Helmut Wollmersdorfer

Reply via email to