Re: [Python-Dev] Zero-width matching in regexes

2017-12-13 Thread Serhiy Storchaka

05.12.17 01:21, MRAB пише:
I've finally come to a conclusion as to what the "correct" behaviour of 
zero-width matches should be: """always return the first match, but 
never a zero-width match that is joined to a previous zero-width match""".


If it's about to return a zero-width match that's joined to a previous 
zero-width match, then backtrack and keep on looking for a match.


Example:

 >>> print([m.span() for m in re.finditer(r'|.', 'a')])
[(0, 0), (0, 1), (1, 1)]

re.findall, re.split and re.sub should work accordingly.

If re.finditer finds n matches, then re.split should return a list of 
n+1 strings and re.sub should make n replacements (excepting maxsplit, 
etc.).


We now have a good opportunity of changing a long standing behavior of 
re.sub(). Currently empty matches are prohibited if adjacent to a 
previous match. For consistency with re.finditer() and re.findall(), 
with regex.sub() with VERSION1 flag, and with Perl, PCRE and other 
engines they should be prohibited only if adjacent to a previous *empty* 
match. Currently re.sub('x*', '-', 'abxc') returns '-a-b-c-', but will 
return '-a-b--c-' if change the behavior.


This behavior already was unintentionally temporary changed between 2.1 
and 2.2, when the underlying implementation of re was changed from PCRE 
to SRE. But the former behavior was quickly restored (see 
https://bugs.python.org/issue462270). Ironically the behavior of the 
current PCRE is different.


Possible options:

1. Change the behavior right now.
2. Start emitting a FutureWarning and change the behavior in future version.
3. Keep the status quo forever.

We need to make a decision right now since in the first two cases we 
should to change the behavior of re.split() right now. Its behavior is 
changed in 3.7 in any case, and it is better to change the behavior once 
than break the behavior in two different releases.


The changed detail is so subtle that no regular expressions in the 
stdlib and tests are affected, except the special purposed test added 
for guarding the current behavior.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Zero-width matching in regexes

2017-12-06 Thread Serhiy Storchaka

06.12.17 15:37, Paul Moore пише:

Behaviour (1) means that we'd get

>>> regex.sub(r'\w*', 'x', 'hello world', flags=regex.VERSION1)
'xx xx'

(because \w* matches the empty string after each word, as well as each
word itself). I just tested in Perl, and that is indeed what happens
there as well.


Yes, because in this case you need to use `\w+`, not `\w*`.

No CPython tests will be failed if change re.sub() to behaviour (2) 
except just added in 3.7 tests and the one test specially purposed to 
guard the old behavior. But I don't know how much third party code will 
be broken if made this change.



On that basis, I have to say that I find behaviour (2) more intuitive
and (arguably) "correct":

>>> regex.sub(r'\w*', 'x', 'hello world', flags=regex.VERSION0)
'x x'
>>> re.sub(r'\w*', 'x', 'hello world')
'x x'


The actual behavior of re.sub() and regex.sub() in the VERSION0 mode was 
a weird behavior (4).


>>> regex.sub(r'(\b|\w+)', r'[\1]', 'hello world', flags=regex.VERSION0)
'[]h[ello] []w[orld]'
>>> regex.sub(r'(\b|\w+)', r'[\1]', 'hello world', flags=regex.VERSION1)
'[][hello][] [][world][]'
>>> re.sub(r'(\b|\w+)', r'[\1]', 'hello world')  # 3.6, behavior (4)
'[]h[ello] []w[orld]'
>>> re.sub(r'(\b|\w+)', r'[\1]', 'hello world')  # 3.7, behavior (2)
'[][hello] [][world]'

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Zero-width matching in regexes

2017-12-06 Thread Paul Moore
On 6 December 2017 at 13:13, Serhiy Storchaka  wrote:
> 05.12.17 22:26, Terry Reedy пише:
>>
>> On 12/4/2017 6:21 PM, MRAB wrote:
>>>
>>> I've finally come to a conclusion as to what the "correct" behaviour of
>>> zero-width matches should be: """always return the first match, but never a
>>> zero-width match that is joined to a previous zero-width match""".
>>
>>
>> Is this different from current re or regex?
>
>
> Partially. There are different ways of handling the problem of repeated
> zero-width searching.
>
> 1. The one formulated by Matthew. This is the behavior of findall() and
> finditer() in regex in both VERSION0 and VERSION1 modes, sub() in regex in
> the VERSION1 mode, and findall() and finditer() in re since 3.7.
>
> 2. Prohibit a zero-width match that is joined to a previous match
> (independent from its width). This is the behavior of sub() in re and in
> regex in the VERSION0 mode, and split() in regex in the VERSION1 mode. This
> is the only correctly documented and explicitly tested behavior in re.
>
> 3. Prohibit a zero-width match (always). This is the behavior of split() in
> re in 3.4 and older (deprecated since 3.5) and in regex in VERSION0 mode.
>
> 4. Exclude the character following a zero-width match from following
> matches. This is the behavior of findall() and finditer() in 3.6 and older.
>
> The case 4 is definitely incorrect. It leads to excluding characters from
> matching. re.findall(r'^|\w+', 'two words') returns ['', 'wo', 'words'].
>
> The case 3 is pretty useless. It disallow splitting on useful zero-width
> patterns like `\b` and makes `\s*` just equal to `\s+`.
>
> The difference between cases 1 and 2 is subtle. The case 1 looks more
> logical and matches the behavior of Perl and PCRE, but the case 2 is
> explicitly documented and tested. This behavior is kept for compatibility
> with an ancient re implementation.

Behaviour (1) means that we'd get

>>> regex.sub(r'\w*', 'x', 'hello world', flags=regex.VERSION1)
'xx xx'

(because \w* matches the empty string after each word, as well as each
word itself). I just tested in Perl, and that is indeed what happens
there as well.

On that basis, I have to say that I find behaviour (2) more intuitive
and (arguably) "correct":

>>> regex.sub(r'\w*', 'x', 'hello world', flags=regex.VERSION0)
'x x'
>>> re.sub(r'\w*', 'x', 'hello world')
'x x'

Paul
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Zero-width matching in regexes

2017-12-06 Thread Serhiy Storchaka

05.12.17 22:26, Terry Reedy пише:

On 12/4/2017 6:21 PM, MRAB wrote:
I've finally come to a conclusion as to what the "correct" behaviour 
of zero-width matches should be: """always return the first match, but 
never a zero-width match that is joined to a previous zero-width 
match""".


Is this different from current re or regex?


Partially. There are different ways of handling the problem of repeated 
zero-width searching.


1. The one formulated by Matthew. This is the behavior of findall() and 
finditer() in regex in both VERSION0 and VERSION1 modes, sub() in regex 
in the VERSION1 mode, and findall() and finditer() in re since 3.7.


2. Prohibit a zero-width match that is joined to a previous match 
(independent from its width). This is the behavior of sub() in re and in 
regex in the VERSION0 mode, and split() in regex in the VERSION1 mode. 
This is the only correctly documented and explicitly tested behavior in re.


3. Prohibit a zero-width match (always). This is the behavior of split() 
in re in 3.4 and older (deprecated since 3.5) and in regex in VERSION0 mode.


4. Exclude the character following a zero-width match from following 
matches. This is the behavior of findall() and finditer() in 3.6 and older.


The case 4 is definitely incorrect. It leads to excluding characters 
from matching. re.findall(r'^|\w+', 'two words') returns ['', 'wo', 
'words'].


The case 3 is pretty useless. It disallow splitting on useful zero-width 
patterns like `\b` and makes `\s*` just equal to `\s+`.


The difference between cases 1 and 2 is subtle. The case 1 looks more 
logical and matches the behavior of Perl and PCRE, but the case 2 is 
explicitly documented and tested. This behavior is kept for 
compatibility with an ancient re implementation.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Zero-width matching in regexes

2017-12-06 Thread Serhiy Storchaka

05.12.17 01:21, MRAB пише:
I've finally come to a conclusion as to what the "correct" behaviour of 
zero-width matches should be: """always return the first match, but 
never a zero-width match that is joined to a previous zero-width match""".


If it's about to return a zero-width match that's joined to a previous 
zero-width match, then backtrack and keep on looking for a match.


Isn't this how sub(), findall() and finditer() work in regex with 
VERSION1? I agree that this behavior looks most logical and self-consistent.


Unfortunately the different behavior of re.sub() is documented explicitly:

"Empty matches for the pattern are replaced only when not adjacent to a 
previous match, so sub('x*', '-', 'abc') returns '-a-b-c-'."


And there a special purposed test for this. One time the behavior was 
changed when the re implementation was changed from pre to sre, but the 
older behavior was restored. [1] [2]


[1] https://bugs.python.org/issue462270
[2] 
https://github.com/python/cpython/commit/21009b9c6fc40b25fcb30ee60d6108f235733e40


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Zero-width matching in regexes

2017-12-05 Thread MRAB

On 2017-12-05 20:26, Terry Reedy wrote:

On 12/4/2017 6:21 PM, MRAB wrote:
I've finally come to a conclusion as to what the "correct" behaviour of 
zero-width matches should be: """always return the first match, but 
never a zero-width match that is joined to a previous zero-width match""".


Is this different from current re or regex?


Sometimes yes.

It's difficult to know how a zero-width match should be handled.

The normal way that, say, findall works is that it searches for a match 
and then continues from where it left off.


If at any point it matched an empty string, it would stall because the 
end of the match is also the start of the match.


How should that be handled?

The old buggy behaviour of the re module was to just advance by one 
character after a zero-width match, which can result in a character 
being skipped and going missing.


A solution is to prohibit a zero-width match that's joined to the 
previous match, but I'm not convinced that that's correct.


If it's about to return a zero-width match that's joined to a previous 
zero-width match, then backtrack and keep on looking for a match.


Example:

 >>> print([m.span() for m in re.finditer(r'|.', 'a')])
[(0, 0), (0, 1), (1, 1)]

re.findall, re.split and re.sub should work accordingly.

If re.finditer finds n matches, then re.split should return a list of 
n+1 strings and re.sub should make n replacements (excepting maxsplit, 
etc.).



___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Zero-width matching in regexes

2017-12-05 Thread Terry Reedy

On 12/4/2017 6:21 PM, MRAB wrote:
I've finally come to a conclusion as to what the "correct" behaviour of 
zero-width matches should be: """always return the first match, but 
never a zero-width match that is joined to a previous zero-width match""".


Is this different from current re or regex?

If it's about to return a zero-width match that's joined to a previous 
zero-width match, then backtrack and keep on looking for a match.


Example:

 >>> print([m.span() for m in re.finditer(r'|.', 'a')])
[(0, 0), (0, 1), (1, 1)]

re.findall, re.split and re.sub should work accordingly.

If re.finditer finds n matches, then re.split should return a list of 
n+1 strings and re.sub should make n replacements (excepting maxsplit, 
etc.).



--
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Zero-width matching in regexes

2017-12-04 Thread MRAB
I've finally come to a conclusion as to what the "correct" behaviour of 
zero-width matches should be: """always return the first match, but 
never a zero-width match that is joined to a previous zero-width match""".


If it's about to return a zero-width match that's joined to a previous 
zero-width match, then backtrack and keep on looking for a match.


Example:

>>> print([m.span() for m in re.finditer(r'|.', 'a')])
[(0, 0), (0, 1), (1, 1)]

re.findall, re.split and re.sub should work accordingly.

If re.finditer finds n matches, then re.split should return a list of 
n+1 strings and re.sub should make n replacements (excepting maxsplit, 
etc.).

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com