> In the default case, PCRE does not crash: it returns PCRE_ERROR_BADUTF8. This output is non-useful when the main application needs to analyze input stream no matter what. To do this the main application now is forced to: have its own built-in UTF8-parser; reparse the input stream by this built-in parser to find invalid UTF-8 characters; make them valid and remember changes to have possibility to restore them later; reexecute pcre_exec() with valid UTF-8 stream; rebuild output stream with restoring of replaced invalid UTF-8 characters. And cost of this work is very high.
Situations when analyzis must be successfully dealed regardles erroneous or not is input UTF-8 stream are widespread. The reason of error appearance in some cases is unwitting or wilful in other. Now PCRE can't offer effective solution. > I think it would penalize the normal running of PCRE too much. I wrote that this behaviour may be OPTIONAL. > I also think one could argue about how to interpret a sequence of > invalid byteswhose values are greater than 127. How many characters does > such astring encode? For example, suppose the first byte indicates that > thereare three more bytes in a UTF-8 character, two of them are OK, but > thethird one has an invalid value (less than 128, say). Is that a mangled > UTF-8 character followed by an ASCII byte, or is it four single-byte > characters? IMHO a sequence of invalid bytes may be interprets as one character of type "invalid" per byte. -- ## List details at http://lists.exim.org/mailman/listinfo/pcre-dev
