Re: [pcre-dev] First slot of the offset vector have a wrong value when PCRE_ERROR_SHORTUTF8 rises

Sheri Mon, 09 May 2011 07:00:10 -0700

On 5/7/2011 11:43 AM, Philip Hazel wrote:

On Wed, 9 Feb 2011, ND wrote:

Putting something other than a match start into the offsets vectorrather
breaks the philosophy of PCRE.

It's useful to give to main application information about position when
PCRE_ERROR_BADUTF8 or PCRE_ERROR_SHORTUTF8 occurs. It must not be returned
in offsets vector nesessarily. May be in another memory block.
This information can help main application to analyze and fix erroneous
stream.

OK, I've changed my mind and decided that the offsets vector can be
used. I also decided that if this was happening, I should do the job
*properly*. I have just committed a patch which behaves like this:

If the size of the ovector is at least 2, then, for PCRE_ERROR_BADUTF8
or PCRE_ERROR_SHORTUTF8,

   ovector[0] is set to the byte offset of the first byte of the invalid
              character
   ovector[1] is set to a reason code

There are 21 different reason codes, documented in the pcreapi man page.
They include codes for "short by n bytes" (where n is 1-5), so in fact
PCRE_ERROR_SHORTUTF8 is no longer needed. However, I have not removed
it because that would break backwards compatibility.

Philip

I'm somewhat concerned about the possible scope of incompatibility ofthis change.Might some existing applications blindly treat an executionthat filled the offset vector as a Match? Also, might some applicationscollect offset vectors while doing multiple match operations andsubsequently act on the accumulated vectors independent from theexecutions? Keep in mind that for users of applications that extendregular expressions to end users via PCRE, end users can activate utf8without notifying the host application by including (*UTF8) at thebeginning of the pattern. Up til now such users would have seen aBADUTF8 behave like a non match.


Regards,
Sheri


--

## List details at https://lists.exim.org/mailman/listinfo/pcre-dev

Re: [pcre-dev] First slot of the offset vector have a wrong value when PCRE_ERROR_SHORTUTF8 rises

Reply via email to