2015-05-18 20:35 GMT+02:00 Richard Wordingham < richard.wording...@ntlworld.com>:
> The algorithm itself should be tractable - Mark Davis has published > an algorithm to generate all strings canonically equivalent to a > Unicode string, and what we need might not be so complex. Even this algorithm from Mark Davis will fail in this case: - You can use it easily to transform a regexp containing (\u0F73) into a regexp containing (\u0F73|\u0F71\u0F72|\u0F71\u0F72) - But this leaves the same problem for unbounded repetititions with the "+" or "*" or "{m,}" operators. - However you can use it for bounded repetitions with "{m,n}", provided that "n" is not too large because the total number of expendaned alternatives (without repetitions) explodes exponentially with a power proportional to "n" (the base of the exponent depends on the basic non-repeated string and the number of canonical equivalents it has. Now all the problem is how to do the backtracking, and if it works, and how to expose the matched captures (which will still be discontiguous, including $0) and then how you can perform a safe find&replace operation: it is hard to specify the replacement with simple "$n" placeholders, you need more complex placeholders for handling discontiguous matches: $n has to become not just a string, but an object whose default "tostring" property is the exact content of the match, but other properties are needed to expose the interleaving characters, or some context before and after the match (notably when these contexts contain combining characters that are NOT blocked by the match itself. Backtracing is an internal thing before even handling matches, they occur where there is still NO match to return, even if the regexp engine offers a way to use a callback instead of a basic replacement string containing "$n" placeholders, so this callback would not be called.