Re: [icu-support] Semantic issues with case-insensitive regex matching

Mark Davis ☕ Sun, 07 Nov 2010 16:27:29 -0800

UTS#18 has a discussion of matching with normalization and case-folding. In
practice, however, it turns out to be difficult to implement efficiently.
The matching itself is not that difficult to define:


Given a pattern P and a text transform T, *P is T-insensitive* IFF

   - for any two strings S and Z, T(S) = T(Z) if and only if P(S) = P(Z)

So, for example, a case-insensitive pattern like (?i)(d)z should match in:

DZUR
Dzur
ǲur
dzur

(The middle string uses
U+01F2<http://unicode.org/cldr/utility/character.jsp?a=01F2> ( ǲ )
LATIN CAPITAL LETTER D WITH SMALL LETTER Z)

But what is tricky is defining what the capture groups capture (when there
is reordering/growth/shrinkage) -- like in the" (d)" in the pattern above,
and doing the matching in an efficient way. If you have some suggestions for
defining such operations in such a way that they can be efficiently
implementable, it would be useful to start the discussion.

Mark

*— Il meglio è l’inimico del bene —*


On Sun, Nov 7, 2010 at 13:57, karl williamson <[email protected]>wrote:

> I submitted the text below to the unicore mailing list, and got no good
> answer, except recently, to try this list instead.  I couldn't find in
> the ICU documentation how the issues that this message raises are dealt
> with.  I'm hopeful someone here will respond.
> ----
>
> It would be good if TR18 were enhanced with more discussion of case
> insensitive matching.  Chapter 3 of the standard defines the Default
> Caseless Matching algorithm, but it applies only to two strings, and
> extending it to apply to patterns is not trivial, and is totally
> unspecified, as far as I have seen.
>
> In particular, the use of a property in a regular expression pattern
> with caseless matching introduces a number of issues that I don't
> believe are addressed anywhere in the standard.
>
> For example, should 'N' =~ /\p{Gc=Lowercase_Letter}/i
> should 'n' =~ /\p{Gc=Uppercase_Letter/i
>
> I thought the answer was true to both these, but then, what about
> "\N{MICRO SIGN}" =~ /\p{Block=Greek}/i
> "\N{MICRO SIGN}" =~ /\p{Script=Greek}/i
>
> because the fold of MICRO SIGN is in the Greek block and script?  It
> doesn't seem right to me that a character should match a different
> script than the one it's in under caseless matching.  Similarly, there
> are a number of characters whose fold has a different Age, Soft_Dotted,
> East_Asian_Width, Math, Decomposition_Type, Line_Break, or
> Full_Composition_Exclusion property value, besides the ones I would
> expect, like Changes_When_Case_Folded, and General_Category.  The
> YPOGEGRAMMENI, as always, introduces even more.
>
> So perhaps caseless matching shouldn't apply to some properties?  If so,
> which ones should be spelled out.  Certainly, some properties should
> have caseless matching rules.  For example, I believe,
>
> "A" =~ /\p{Name=Latin Small Letter A}/i
>
> should match.  Here's another example where allowing the property to
> match any case can lead to problems.
>
> "\N{LATIN SMALL LIGATURE FF}" =~
> /\p{ASCII_Hex_Digit=Y}\p{ASCII_Hex_Digit=Y}/i
>
> The pattern seems to indicate that only ASCII digits are desired; yet it
> could match something non-ASCII, potentially leading to a spoofing attack.
>
> TR18 is also silent on another issue I've brought up before, and gotten
> no response to.  A number of languages, including ICU I believe, allow
> for regular expression capture buffers.  These allow for saving some
> portion(s) of the original string that matched some sub-part of the
> pattern.  But when you convert the string into something else for
> matching, such as normalizing it, and then match against that, and you
> have capture buffers, those buffers should return not some portion of
> the converted string, but the corresponding portion of the original,
> which you may not be able to get back to.  This can happen even without
> normalization if the string folds to more than one character:
>
> "\N{LATIN SMALL LIGATURE FI}" =~ /fi/i
>
> should match, as should
>
> "\N{LATIN SMALL LIGATURE FI}" =~ /[f][i]/i
>
> Hence, so should
>
> "\N{LATIN SMALL LIGATURE FI}" =~ /(f)(i)/i
>
> But, the parentheses mean capture buffers, and there is no 1-to-1
> correspondence between either of these buffers and any atomic part of
> the string.  I don't know what should happen here, and I think TR18
> should address this.
>
> So how do I go about getting someone or someones thinking about these
> issues to add to TR18?
>
>
> ------------------------------------------------------------------------------
> The Next 800 Companies to Lead America's Growth: New Video Whitepaper
> David G. Thomson, author of the best-selling book "Blueprint to a
> Billion" shares his insights and actions to help propel your
> business during the next growth cycle. Listen Now!
> http://p.sf.net/sfu/SAP-dev2dev
> _______________________________________________
> icu-support mailing list - [email protected]
> To Un/Subscribe: https://lists.sourceforge.net/lists/listinfo/icu-support
>

Re: [icu-support] Semantic issues with case-insensitive regex matching

Reply via email to