Dear PCRE developers, I'm writing on behalf of the Julia programming language [1] developers in order to get some information regarding the handling of invalid UTF- 8 string when PCRE2_UTF and PCRE2_NO_UTF_CHECK flags are set. For the context, Julia has taken the stance that strings are stored in UTF-8 but are not required to contain valid UTF-8. This is needed to be able to work with any contents, like a filename or an invalid text file, without throwing errors not modifying the input data. This approach is similar to that adopted by Go (see [2] for an example of problems which arise when strings are required to be valid Unicode as in Python 3).
Of course, this stance is more complex to hold in the context of regular expression matching. The PCRE documentation very clearly states that both the regular expression and the string must be valid Unicode when PCRE2_UTF is set, and that behavior is undefined if that's not the case and PCRE2_NO_UTF_CHECK is also set. However, we have been wondering whether it would be possible to allow the string (not the regex) to contain invalid UTF-8 when PCRE2_NO_UTF_CHECK is set. In such a situation, invalid sequences would simply be treated as series of one-byte "characters" for which all Unicode predicates would be false, and returned as-is (see [3]). This is how Julia treats invalid UTF-8 strings and it appears to work well. By default, valid UTF-8 would still be required, but instead of declaring the behavior as undefined when the string is invalid and PCRE2_NO_UTF_CHECK is set, a well- defined behavior would be implemented. Let me stress that we do not suggest supporting invalid regexes, as it appears difficult to give them a clear and meaningful definition. We are also aware that we could avoid setting PCRE2_UTF, but the resulting behavior would not match what is generally expected for strings which are supposed to contain (possibly invalid) Unicode text. Do you think such a behavior would make sense? Could it be implemented without dramatically impacting performance? Julia could use a custom patch if this feature is not deemed useful for PCRE. Thanks in advance for your help 1: http://julialang.org 2: http://lucumr.pocoo.org/2014/5/12/everything-about-unicode/ 3: https://github.com/JuliaLang/julia/pull/26731#issuecomment-379580049 -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
