On 5/7/2011 11:43 AM, Philip Hazel wrote:
On Wed, 9 Feb 2011, ND wrote:

Putting something other than a match start into the offsets vectorrather
breaks the philosophy of PCRE.

It's useful to give to main application information about position when
PCRE_ERROR_BADUTF8 or PCRE_ERROR_SHORTUTF8 occurs. It must not be returned
in offsets vector nesessarily. May be in another memory block.
This information can help main application to analyze and fix erroneous
stream.
OK, I've changed my mind and decided that the offsets vector can be
used. I also decided that if this was happening, I should do the job
*properly*. I have just committed a patch which behaves like this:

If the size of the ovector is at least 2, then, for PCRE_ERROR_BADUTF8
or PCRE_ERROR_SHORTUTF8,

   ovector[0] is set to the byte offset of the first byte of the invalid
              character
   ovector[1] is set to a reason code

There are 21 different reason codes, documented in the pcreapi man page.
They include codes for "short by n bytes" (where n is 1-5), so in fact
PCRE_ERROR_SHORTUTF8 is no longer needed. However, I have not removed
it because that would break backwards compatibility.

Philip

I'm somewhat concerned about the possible scope of incompatibility of this change.Might some existing applications blindly treat an execution that filled the offset vector as a Match? Also, might some applications collect offset vectors while doing multiple match operations and subsequently act on the accumulated vectors independent from the executions? Keep in mind that for users of applications that extend regular expressions to end users via PCRE, end users can activate utf8 without notifying the host application by including (*UTF8) at the beginning of the pattern. Up til now such users would have seen a BADUTF8 behave like a non match.

Regards,
Sheri


--
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev

Reply via email to