I have two points. First, this excerpt from Synopsis 6:
The :m (or :ignoremark) modifier scopes exactly like :ignorecase except
that it ignores marks (accents and such) instead of case. It is equivalent
to taking each grapheme (in both target and pattern), converting both to
NFD (maximally decomposed) and then comparing the two base characters
(Unicode non-mark characters) while ignoring any trailing mark characters.
The mark characters are ignored only for the purpose of determining the
truth of the assertion; the actual text matched includes all ignored
characters, including any that follow the final base character.
The :mm (or :samemark) variant may be used on a substitution to change the
substituted string to the same mark/accent pattern as the matched string.
Mark info is carried across on a character by character basis. If the right
string is longer than the left one, the remaining characters are
substituted without any modification. (Note that NFD/NFC distinctions are
usually immaterial, since Perl encapsulates that in grapheme mode.) Under
:sigspace the preceding rules are applied word by word. In perl5, one must
manually run two matches on all data.
First: I notice that ignoring marks (and such) and ignoring case are both
differently strengthed effects of the Unicode Collation Algorithm. What
about simply allowing folks to specify which of the four (or more, I guess)
levels of UCA equivalence/folding they want?
Second: I'm not altogether reassured by the parenned bit about NFD/NFC
being immaterial. That's because I've been pretty annoying lately in perl5
with having to manually run *everything* through a double match every time,
and I can't avoid it by prenormalizing. I'm just hoping that perl6 will
handle this better.
It's usually like this:
NFD($data) =~ $pattern
NFC($data) =~ $pattern
Or if you know your data is NFD:
$data =~ $pattern
NFC($data) =~ $pattern
Or if you know your data is NFC:
NFD($data) =~ $pattern
$data =~ $pattern
That's because even if your data in a known state with respect to
normalization, if your pattern admits both NFD and NFC forms, which it
would if read in from a file etc, then you have to run them both.
For example, suppose you read a pattern whose characters are specified
indirectly/symbolically:
$pattern = q<\xE9>; # LATIN SMALL LETTER E WITH ACUTE
or
$pattern = q<e\x{301}>; # "e" + COMBINING ACUTE ACCENT
It would be ok if those were literal characters, because you
could just NFD the patterns and be done. But they're not. So
in order for
$data =~ $pattern
to work properly with both, you really have to do a guaranteed
double-convert/match each time. This is rather unfortunate, to put it
mildly. What you really want is a pattern compile flag that imposes
canonical matching, and does this correctly even when faced with named
characters, etc.
My read of S06 suggests that this will not be an issue. I do wonder
what happens when you want to match just the combining part. Does
that fail in grapheme mode? It shouldn't: you *can* have standalones.
But then we're back to partial matches in the middle of things, which
is something that plagues us with full Unicode case-folding. This is
the
"\N{LATIN SMALL LIGATURE FFI}" =~ /(f)(f)/i
problem, amongst others. Seems that you are going to get into the
same dilemma if you allow matching partial graphemes in grapheme mode.
Hm.
--tom