UCA and NFC/NFD issues in pattern matching

Tom Christiansen Wed, 23 Feb 2011 18:52:17 -0800

I have two points.  First, this excerpt from Synopsis 6:

    The :m (or :ignoremark) modifier scopes exactly like :ignorecase except
    that it ignores marks (accents and such) instead of case. It is equivalent
    to taking each grapheme (in both target and pattern), converting both to
    NFD (maximally decomposed) and then comparing the two base characters
    (Unicode non-mark characters) while ignoring any trailing mark characters.
    The mark characters are ignored only for the purpose of determining the
    truth of the assertion; the actual text matched includes all ignored
    characters, including any that follow the final base character.


    The :mm (or :samemark) variant may be used on a substitution to change the
    substituted string to the same mark/accent pattern as the matched string.
    Mark info is carried across on a character by character basis. If the right
    string is longer than the left one, the remaining characters are
    substituted without any modification. (Note that NFD/NFC distinctions are
    usually immaterial, since Perl encapsulates that in grapheme mode.) Under
    :sigspace the preceding rules are applied word by word.  In perl5, one must
    manually run two matches on all data.

First: I notice that ignoring marks (and such) and ignoring case are both
differently strengthed effects of the Unicode Collation Algorithm.  What
about simply allowing folks to specify which of the four (or more, I guess)
levels of UCA equivalence/folding they want?

Second: I'm not altogether reassured by the parenned bit about NFD/NFC
being immaterial.  That's because I've been pretty annoying lately in perl5
with having to manually run *everything* through a double match every time,
and I can't avoid it by prenormalizing.  I'm just hoping that perl6 will
handle this better.

It's usually like this:

    NFD($data) =~ $pattern
    NFC($data) =~ $pattern

Or if you know your data is NFD:

        $data  =~ $pattern
    NFC($data) =~ $pattern

Or if you know your data is NFC:

    NFD($data) =~ $pattern
        $data  =~ $pattern

That's because even if your data in a known state with respect to
normalization, if your pattern admits both NFD and NFC forms, which it
would if read in from a file etc, then you have to run them both.

For example, suppose you read a pattern whose characters are specified
indirectly/symbolically:

    $pattern = q<\xE9>;         # LATIN SMALL LETTER E WITH ACUTE

or 

    $pattern = q<e\x{301}>;     # "e" + COMBINING ACUTE ACCENT

It would be ok if those were literal characters, because you
could just NFD the patterns and be done.  But they're not.  So
in order for


    $data =~ $pattern

to work properly with both, you really have to do a guaranteed
double-convert/match each time.  This is rather unfortunate, to put it
mildly.  What you really want is a pattern compile flag that imposes
canonical matching, and does this correctly even when faced with named
characters, etc.

My read of S06 suggests that this will not be an issue.  I do wonder
what happens when you want to match just the combining part.  Does
that fail in grapheme mode?  It shouldn't: you *can* have standalones.
But then we're back to partial matches in the middle of things, which
is something that plagues us with full Unicode case-folding.  This is
the 

    "\N{LATIN SMALL LIGATURE FFI}" =~ /(f)(f)/i

problem, amongst others.  Seems that you are going to get into the
same dilemma if you allow matching partial graphemes in grapheme mode.

Hm.

--tom

UCA and NFC/NFD issues in pattern matching

Reply via email to