On Sun, 13 Oct 2019 21:28:34 -0700 Mark Davis ☕️ via Unicode <unicode@unicode.org> wrote:
> The problem is that most regex engines are not written to handle some > "interesting" features of canonical equivalence, like discontinuity. > Suppose that X is canonically equivalent to AB. > > - A query /X/ can match the separated A and C in the target string > "AbC". So if I have code do [replace /X/ in "AbC" by "pq"], how > should it behave? "pqb", "pbq", "bpq"? If A contains a non-starter, pqbC. If C contains a non-starter, Abpq. Otherwise, if the results are canonically inequivalent, it should raise an exception for attempting a process that is either ill-defined or not Unicode-compliant. > If the input was in NFD (for > example), should the output be rearranged/decomposed so that it is > NFD? and so on. That is not a new issue. It exists already. > - A query /A/ can match *part* of the X in the target string > "aXb". So if I have code to do [replace /A/ in "aXb" by "pq"], what > should result: "apqBb"? Yes, unless raising an exception is appropriate (see above). > The syntax and APIs for regex engines are not built to handle these > features. It introduces a enough complications in the code, syntax, > and semantics that no major implementation has seen fit to do it. We > used to have a section in the spec about this, but were convinced > that it was better off handled at a higher level. What higher level? If anything, I would say that the handler is at a lower level (character fragments and the like). The potential requirement should be restored, but not subsumed in Levels 1 to 3. It is a sufficiently different level of endeavour. Richard.