https://bugs.exim.org/show_bug.cgi?id=2131
Philip Hazel <[email protected]> changed: What |Removed |Added ---------------------------------------------------------------------------- Resolution|--- |WONTFIX Status|NEW |RESOLVED --- Comment #1 from Philip Hazel <[email protected]> --- There are philosophical and practical problems here. I have always argued that PCRE is a processor of strings of characters, with very little interpretation of how successive characters might interact with each other (minor exceptions are CRLF and \X sequences). I also feel that you should be able to look for U+00E8 and find *only* the U+00E8 character, not the U+0065/U+0300 pair. After all, you may actually be interested in searching for these kinds of differences of usage in a text. The practical issue is that when one regex item might match a variable number of code points, the length of the matched string is unknown. This makes lookbehinds impossible, at least in the way that they are implemented in PCRE2, which is to move back N characters and then match forwards. If an application is dealing with texts that have a mixture of ways of representing (for example) accented characters, and wants a simple way of searching for them, it can normalize the text (and possibly also the pattern) beforehand. It looks as if that's what your Perl examples do, and as there's already a library available from the utf8proc project it should be straightforward for other applications. I can't see any advantage of packaging this inside PCRE2. I'm going to mark this issue "won't fix". -- You are receiving this mail because: You are on the CC list for the bug. -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
