2015-05-14 20:13 GMT+02:00 Richard Wordingham < [email protected]>:
> If the interval list is compacted, at most one of the intervals will > contain a character properly having combining class 0. This is not a sufficent condition, there is also the case where two intervals contain combining characters with the same combining class: their relative order is significant because one is blocking the other (it limits the alllowed reorderings that are canonically equivalent). But if the replacement string also adds its own blockers the situation is worse... There's no simple way to determine what to do by just returning a replacement string that the regexp engine will insert itself in the output text: the base that can be done is that the regexp gives a full view not only to the characters withjin matches, but also the characters in the middle that are not part of the match: instead of performing the insertion itself (by specifying a single expression for the replacement text), you will provide a callback function analysing also the non-matched characters in the middle to decide what to do with them: you should then be able to choose between several replacement patterns (including placeholders also for unmathed intervals such as numbered placeholders with negative values $-1, $-2, ..., positive or null numbers being used for the classical array of matched captures $0, $1... But for these additional captures that are not part of the match, you need a way to indicate their placement within the true matched captures, and not all positive captures share the same set of negative captures and not at the same positions). Note that for making sure we can perform safe replacements within normalized text and makeing sure that the result will also be normalized, we need to include in negative captures some characters that are not in the middle of a match, but also all the other combining characters with non-zero combining class that are before the matched string (if the matched string does not start with a character with combining class 0) and after it and that have a higher combining class than the last character in the positive capture.; if the positive capure is an ampty string, the first negative capture will include all combining characters with distinct non-0 combining class. before the insertion point of that empty positive capture, and the second one will onclude all non-0 combining characters after thje insertion point that have distinct non-0 combining classes (these two negative captures are bounded in length to at most 255 characters, just like with the negative captures added for parts of the input that are in the middle of a positive capture). For now I've never seen any regexp engine supporting the concept of "negative captures", all of them only return positive ones, including when they allow the replacement to be a callback and not just a static string with optional placeholders. If there is such an interval, it will be > replaced and the others simply deleted. If there is no such interval, > then the choice of insertion point may be more difficult. Indeed, in > some cases, it could be appropriate to reject the replacement command > as undefined in the context. On the other hand, if the text buffer is > normalised, then one would be able to have well-defined behaviour, as > one does when splitting text into UCA collating elements. > > Richard. >

