UTS#18 has a discussion of matching with normalization and case-folding. In practice, however, it turns out to be difficult to implement efficiently. The matching itself is not that difficult to define:
Given a pattern P and a text transform T, *P is T-insensitive* IFF - for any two strings S and Z, T(S) = T(Z) if and only if P(S) = P(Z) So, for example, a case-insensitive pattern like (?i)(d)z should match in: DZUR Dzur Dzur dzur (The middle string uses U+01F2<http://unicode.org/cldr/utility/character.jsp?a=01F2> ( Dz ) LATIN CAPITAL LETTER D WITH SMALL LETTER Z) But what is tricky is defining what the capture groups capture (when there is reordering/growth/shrinkage) -- like in the" (d)" in the pattern above, and doing the matching in an efficient way. If you have some suggestions for defining such operations in such a way that they can be efficiently implementable, it would be useful to start the discussion. Mark *— Il meglio è l’inimico del bene —* On Sun, Nov 7, 2010 at 13:57, karl williamson <[email protected]>wrote: > I submitted the text below to the unicore mailing list, and got no good > answer, except recently, to try this list instead. I couldn't find in > the ICU documentation how the issues that this message raises are dealt > with. I'm hopeful someone here will respond. > ---- > > It would be good if TR18 were enhanced with more discussion of case > insensitive matching. Chapter 3 of the standard defines the Default > Caseless Matching algorithm, but it applies only to two strings, and > extending it to apply to patterns is not trivial, and is totally > unspecified, as far as I have seen. > > In particular, the use of a property in a regular expression pattern > with caseless matching introduces a number of issues that I don't > believe are addressed anywhere in the standard. > > For example, should 'N' =~ /\p{Gc=Lowercase_Letter}/i > should 'n' =~ /\p{Gc=Uppercase_Letter/i > > I thought the answer was true to both these, but then, what about > "\N{MICRO SIGN}" =~ /\p{Block=Greek}/i > "\N{MICRO SIGN}" =~ /\p{Script=Greek}/i > > because the fold of MICRO SIGN is in the Greek block and script? It > doesn't seem right to me that a character should match a different > script than the one it's in under caseless matching. Similarly, there > are a number of characters whose fold has a different Age, Soft_Dotted, > East_Asian_Width, Math, Decomposition_Type, Line_Break, or > Full_Composition_Exclusion property value, besides the ones I would > expect, like Changes_When_Case_Folded, and General_Category. The > YPOGEGRAMMENI, as always, introduces even more. > > So perhaps caseless matching shouldn't apply to some properties? If so, > which ones should be spelled out. Certainly, some properties should > have caseless matching rules. For example, I believe, > > "A" =~ /\p{Name=Latin Small Letter A}/i > > should match. Here's another example where allowing the property to > match any case can lead to problems. > > "\N{LATIN SMALL LIGATURE FF}" =~ > /\p{ASCII_Hex_Digit=Y}\p{ASCII_Hex_Digit=Y}/i > > The pattern seems to indicate that only ASCII digits are desired; yet it > could match something non-ASCII, potentially leading to a spoofing attack. > > TR18 is also silent on another issue I've brought up before, and gotten > no response to. A number of languages, including ICU I believe, allow > for regular expression capture buffers. These allow for saving some > portion(s) of the original string that matched some sub-part of the > pattern. But when you convert the string into something else for > matching, such as normalizing it, and then match against that, and you > have capture buffers, those buffers should return not some portion of > the converted string, but the corresponding portion of the original, > which you may not be able to get back to. This can happen even without > normalization if the string folds to more than one character: > > "\N{LATIN SMALL LIGATURE FI}" =~ /fi/i > > should match, as should > > "\N{LATIN SMALL LIGATURE FI}" =~ /[f][i]/i > > Hence, so should > > "\N{LATIN SMALL LIGATURE FI}" =~ /(f)(i)/i > > But, the parentheses mean capture buffers, and there is no 1-to-1 > correspondence between either of these buffers and any atomic part of > the string. I don't know what should happen here, and I think TR18 > should address this. > > So how do I go about getting someone or someones thinking about these > issues to add to TR18? > > > ------------------------------------------------------------------------------ > The Next 800 Companies to Lead America's Growth: New Video Whitepaper > David G. Thomson, author of the best-selling book "Blueprint to a > Billion" shares his insights and actions to help propel your > business during the next growth cycle. Listen Now! > http://p.sf.net/sfu/SAP-dev2dev > _______________________________________________ > icu-support mailing list - [email protected] > To Un/Subscribe: https://lists.sourceforge.net/lists/listinfo/icu-support >

