Re: UCA and NFC/NFD issues in pattern matching

Helmut Wollmersdorfer Sun, 06 Mar 2011 04:23:51 -0800

Tom Christiansen wrote:

I have two points.  First, this excerpt from Synopsis 6:

    The :m (or :ignoremark) modifier scopes exactly like :ignorecase except
    that it ignores marks (accents and such) instead of case. It is equivalent
    to taking each grapheme (in both target and pattern), converting both to
    NFD (maximally decomposed) and then comparing the two base characters
    (Unicode non-mark characters) while ignoring any trailing mark characters.
    The mark characters are ignored only for the purpose of determining the
    truth of the assertion; the actual text matched includes all ignored
    characters, including any that follow the final base character.

    The :mm (or :samemark) variant may be used on a substitution to change the
    substituted string to the same mark/accent pattern as the matched string.
    Mark info is carried across on a character by character basis. If the right
    string is longer than the left one, the remaining characters are
    substituted without any modification. (Note that NFD/NFC distinctions are
    usually immaterial, since Perl encapsulates that in grapheme mode.) Under
    :sigspace the preceding rules are applied word by word.  In perl5, one must
    manually run two matches on all data.

First: I notice that ignoring marks (and such) and ignoring case are both
differently strengthed effects of the Unicode Collation Algorithm.  What
about simply allowing folks to specify which of the four (or more, I guess)
levels of UCA equivalence/folding they want?


Draft Unicode Technical Report #30
Character Foldings
http://www.unicode.org/reports/tr30/tr30-4.html

This one?

IMHO this should not be specified in the core of Perl6. Even the
existance of :ignoremark and :samemark is not necessary, because it
cannot fullfill the expectations: 'LATIN O WITH STROKE' (and other
characters with e.g. overlays) is not decomposable, and will not match
'LATIN O' under :ignoremark.

Even :ignorecase is usable only in the character range of ASCII, but is
needed for backwards compatibility.

E.g. the German 'SHARP S' can be written in uppercase as 'SS' or 'SZ'.And Swiss orthography doesn't use 'SHARP S', they always use 'ss'.

If someone wants to match the 'SHARP S' across all orthographic andtypographic variants there is no other way as to write manuallysomething like:


  $string =~ m/(ß|ss|sz)/i;

Language and text processing is full of such examples, which cannot besolved by Unicode in a general way. Here I agree with Larry that Perl6should only support the general part of Unicode.

Language/Locale (including orthography and typography) specificprocessing is the task of Unicode localisation, which should IMHO beimplemented by modules. The more I think about it, I cannot imagine ageneral solution using tailored Unicode-properties for localisation.

Second: I'm not altogether reassured by the parenned bit about NFD/NFC
being immaterial.  That's because I've been pretty annoying lately in perl5
with having to manually run *everything* through a double match every time,
and I can't avoid it by prenormalizing.  I'm just hoping that perl6 will
handle this better.

It's usually like this:

    NFD($data) =~ $pattern
    NFC($data) =~ $pattern

Or if you know your data is NFD:

        $data  =~ $pattern
    NFC($data) =~ $pattern

Or if you know your data is NFC:

    NFD($data) =~ $pattern
        $data  =~ $pattern

That's because even if your data in a known state with respect to
normalization, if your pattern admits both NFD and NFC forms, which it
would if read in from a file etc, then you have to run them both.

Mixing different levels of normalisation isn't a good idea. Just bringeverthing involved (including the patterns) to the same level.

In Perl5 a similar problem exists if someone mixes byte-mode andcharacter mode. Then AFAIK a regex like


        $byte_string =~ m/\p{Letter}/;

crashes.

For example, suppose you read a pattern whose characters are specified
indirectly/symbolically:

    $pattern = q<\xE9>;           # LATIN SMALL LETTER E WITH ACUTE

or

    $pattern = q<e\x{301}>;       # "e" + COMBINING ACUTE ACCENT

It would be ok if those were literal characters, because you
could just NFD the patterns and be done.  But they're not.  So
in order for


    $data =~ $pattern

to work properly with both, you really have to do a guaranteed
double-convert/match each time.  This is rather unfortunate, to put it
mildly.  What you really want is a pattern compile flag that imposes
canonical matching, and does this correctly even when faced with named
characters, etc.

My read of S06 suggests that this will not be an issue.

In Grapheme mode the pattern q<e\x{301}> normalizes to a single Graphemecharacter. That's why Graphemes are so convenient. And Graphemes arealso compatible with future versions of Unicode. I.e. your code willwork, if e.g. a future version of Unicode assigns a single codepoint for'LATIN SMALL LETTER A WITH POINT ABOVE AND POINT BELOW' and your codecontains something like 'a'+'COMBINING DOT ABOVE'+'COMBINING DOT BELOW'.

I do wonder
what happens when you want to match just the combining part.  Does
that fail in grapheme mode?  It shouldn't: you *can* have standalones.

In Grapheme mode 'standalones' can only happen at the beginning of astring, or better said without a base character somewhere before them.

But then we're back to partial matches in the middle of things, which
is something that plagues us with full Unicode case-folding.  This is

the

    "\N{LATIN SMALL LIGATURE FFI}" =~ /(f)(f)/i

problem, amongst others.  Seems that you are going to get into the
same dilemma if you allow matching partial graphemes in grapheme mode.

We can dream of :ignoreorthography or :ignoretypography, but they shouldnot be implemented into a regex-engine.


Helmut Wollmersdorfer

Re: UCA and NFC/NFD issues in pattern matching

Reply via email to