Re: [pcre-dev] Partial match at end of subject

ph10 Wed, 17 Jul 2019 09:57:33 -0700

On Mon, 15 Jul 2019, ND via Pcre-dev wrote:

> This option is added ten years ago EXACTLY for multisegment matching.
> Please read a very first proposal post and thread about it. Thats how
> partial_hard is born:
> https://lists.exim.org/lurker/message/20090524.142622.cb850f3a.en.html


Your memory is much better than mine. :-) However, this does mean that 
partial hard matching has been the way it is for 10 years. For this 
reason alone, I am very reluctant to make a change, because somebody, 
somewhere is probably relying on the current behaviour.

> With multisegment matching we want that matching result be exactly the same as
> if we match a whole subject at once!

That is a good point, but I'm not sure it can always be achieved, and I 
am not happy with your example:

/c*+(?<=[bc])/aftertext
abc
 0: c
 0+ 
ab\=ph
 0: 
 0+ 

Your pattern encodes the requirement "find zero or more c's preceded by
b or c". When the input is "ab" that is exactly what it has done, which
is why it shows a complete match. It would make no sense never to give a 
full match when \=ph is used. This is surely correct:

/abc/
xyzabcdef\=ph
 0: abc

I'm trying to understand what the conditions are for a change in
behaviour. The issue is what should happen when a possible "partial
match" situation occurs, but no characters have been inspected. This 
happens in this example:

/c*+(?<=[bc])/aftertext
ab\=ph
 0: 
 0+ 

If the subject is "abc" you get a partial match, because one character 
has been inspected. So what should happen when a possible partial match 
has not inspected any characters? These are the choices:

1. Backtrack, maybe find a complete match (happens now).
2. Give an immediate "no match" - wrong for the test above.
3. Return a partial match of no inspected characters - also wrong.

The last one is wrong in general because it means that all unanchored 
patterns would give partial matches for example:

/abc/
xyz\=ph

I don't think it is right for that to yield a partial match. (I'm 
assuming no start-up optimizations.)

> I'll be very happy if you try to reconsider your approach to PCRE_PARTIAL_HARD
> and totally associate this option with multisegment matching purposes. Because
> it's what it originally intended for.
> But tell please if you know about another practical use of this option that
> force you to change it's original aims.

The problem is that I don't know what people use it for. It's been 
around for 10 years, and all my previous experience is that people use 
software in all kinds of strange ways not foreseen by its creator.

I think the only compromise here is perhaps to add a new option, to give 
changed behaviour, but I do not know what the change should be.

Aha! I think I have spotted how to distinguish between /c*+(?<=[bc])/
and /abc/. The first one will have a non-zero max lookbehind. So, 
something like this is needed:

. Invent a new option called PCRE2_NOTEOS. This would do two things:

(1) $, \z, and \Z would never match. This would deal with the /\z/ 
example.

(2) If a hard partial match is possible, but no characters have been 
inspected, AND max lookbehind is non-zero, return PCRE2_PARTIAL. 
Otherwise backtrack as now.

I am not sure that I like that; it seems messy and hard to explain, but 
at the moment it's the best I can come up with. More thought is needed, 
especially to see if (2) is all that is required, and if it would 
adversely affect other patterns.

Philip

-- 
Philip Hazel

-- 
## List details at https://lists.exim.org/mailman/listinfo/pcre-dev

Re: [pcre-dev] Partial match at end of subject

Reply via email to