Re: [pcre-dev] Processing of invalid UTF-8 characters

ND Fri, 11 Feb 2011 09:25:44 -0800

> In the default case, PCRE does not crash: it returns PCRE_ERROR_BADUTF8.
This output is non-useful when the main application needs to analyze input  
stream no matter what. To do this the main application now is forced to:
   have its own built-in UTF8-parser;
   reparse the input stream by this built-in parser to find invalid UTF-8  
characters;
   make them valid and remember changes to have possibility to restore them  
later;
   reexecute pcre_exec() with valid UTF-8 stream;
   rebuild output stream with restoring of replaced invalid UTF-8  
characters.
And cost of this work is very high.


Situations when analyzis must be successfully dealed regardles erroneous  
or not is input UTF-8 stream are widespread. The reason of error  
appearance in some cases is unwitting or wilful in other. Now PCRE can't  
offer effective solution.

> I think it would penalize the normal running of PCRE too much.
I wrote that this behaviour may be OPTIONAL.

> I also think one could argue about how to interpret a sequence of  
> invalid byteswhose values are greater than 127. How many characters does  
> such astring encode? For example, suppose the first byte indicates that  
> thereare three more bytes in a UTF-8 character, two of them are OK, but  
> thethird one has an invalid value (less than 128, say). Is that a mangled 
> UTF-8 character followed by an ASCII byte, or is it four single-byte
> characters?
IMHO a sequence of invalid bytes may be interprets as one character of  
type "invalid" per byte.

-- 
## List details at http://lists.exim.org/mailman/listinfo/pcre-dev

Re: [pcre-dev] Processing of invalid UTF-8 characters

Reply via email to