On Thu, 12 Apr 2018, Milan Bouchet-Valat wrote:

> I'm writing on behalf of the Julia programming language [1] developers
> in order to get some information regarding the handling of invalid UTF-
> 8 string when PCRE2_UTF and PCRE2_NO_UTF_CHECK flags are set. 

Milan,

I understand what you are suggesting (treating invalid UTF-8 as one-byte 
characters) because I have implemented exactly that in other software 
I've written where performance is not critical.

However, in regex matching, performance *is* critical, which is why PCRE 
insists on working only with valid UTF strings. Checking each sequence 
for validity each time a character was inspected would degrade 
performance. (Also, in a backtracking algorithm, the same character may
be inspected multiple times during the course of a match, which only 
makes matters worse.)

The code in the PCRE2 library that checks a UTF-8 string for validity is
non-trivial. (It's in the source file src/pcre2_valid_utf.c if you want
to take a look.) Admittedly, it does identify very specific errors in
invalid sequences, but, for example, checking a 3-byte sequence involves
seven "if" tests of various kinds plus a switch and a table lookup.
(That's from a quick visual scan of the code; hope I counted right.)
Ignoring some of the less serious errors (overlong sequences or
surrogate codes) would simplify this a bit, but not much.

My view on this has always been that the most efficient approach, in the 
sense of getting the "best" (in some sense) behaviour over all
applications, is for applications to handle non-standard character
strings external to PCRE so that it can work as efficiently as possible.
One possible approach for strings of unknown provenance is to run
without PCRE2_NO_UTF_CHECK and, if any of the "invalid UTF" errors
occur, to convert the string (according to whatever rules you want) into
a valid UTF-8 string and then try again.

> Do you think such a behavior would make sense? Could it be implemented
> without dramatically impacting performance? Julia could use a custom
> patch if this feature is not deemed useful for PCRE.

It certainly makes sense, but I don't think it could be implemented 
without a serious performance hit. If you want to hack and try, note 
that the macros whose names start with GETCHAR (in pcre2_intmodedep.h) 
are used for character handling. In the case of UTF-8 these make use of 
GETUTF8, GETUTF8INC, and GETUTF8LEN, which are defined in 
pcre2_internal.h. However, there are also BACKCHAR, FORWARDCHAR, and 
ACROSSCHAR for moving around. These macros are used for compilation as 
well as for matching by the interpreter functions pcre2_match() and 
pcre2_dfa_match(). I don't know what happens in the JIT matcher, as I do 
not maintain that code, but it too assumes valid UTF-8. To be honest, I 
don't really advise trying to hack in this way. I think it makes more 
sense to fix bad strings externally.

Philip

-- 
Philip Hazel

-- 
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev 

Reply via email to