>This output is non-useful when the main application needs to analyze >input >stream no matter what. To do this the main application now is forced to: > have its own built-in UTF8-parser; > reparse the input stream by this built-in parser to find invalid > UTF-8 >characters; > make them valid and remember changes to have possibility to > restore them >later; > reexecute pcre_exec() with valid UTF-8 stream; > rebuild output stream with restoring of replaced invalid UTF-8 >characters. >And cost of this work is very high.
And just why on Earth do you feel it's up to PCRE to carry the burden for processing nonsensical input strings? If you happen to have to use say SQLite or some other DB engine will you ask their devs to carry the very same (useless) burden as well? And the OS as well? And what else? That simply doesn't make any sense. Whatever the application is, it's its responsability to conform to APIs as they are defined or place an intermediate layer at this effect. PCRE is asking for either valid UTF-8 or random strings of bytes and offers an option to catch invalid input in both cases. IMHO this is very permissive already. I would find it natural that a library like PCRE would specify an undefined behavior (crash included) in case of invalid UTF-8 input. Checking UTF-8 conformance and eventually correcting things is just not its business. It's the same as validation of user input, for example personal data entered at a website. The data entry module needs to perform validation once and reject until correct data is input. It would be plain crazy to accept for instance random text as ZIP code, deeper applications having to check that and correct the wrong data at their sole level every time those applications need to process a ZIP code. Write your own random-to-UTF8 code in a way which suits your needs and let PCRE process pure UTF-8. That you can easily find invalid UTF-8 source is no excuse to rewrite gazillions of libraries with redundant useless code in order to bear with it (with doubtful results anyway). I wouldn't like to see valuable time and energy of benevolent PCRE devs wasted to have PCRE able to process random binary data. If you need that, I'm sure you can write such a module yourself. -- ## List details at http://lists.exim.org/mailman/listinfo/pcre-dev
