Re: [Python-Dev] Regular expressions: splitting on zero-width patterns

MRAB Tue, 28 Nov 2017 12:48:50 -0800

On 2017-11-28 20:04, Serhiy Storchaka wrote:

The two largest problems in the re module are splitting on zero-width
patterns and complete and correct support of the Unicode standard. These
problems are solved in regex. regex has many other features, but they
are less important.


I want to tell the problem of splitting on zero-width patterns. It
already was discussed on Python-Dev 13 years ago [3] and maybe later.
See also issues: [4], [5], [6], [7], [8].

In short it doesn't work. Splitting on the pattern r'\b' doesn't split
the text at boundaries of words, and splitting on the pattern
r'\s+|(?<=-)' will split the text on whitespaces, but will not split
words with hypens as expected.

In Python 3.4 and earlier:

  >>> re.split(r'\b', 'Self-Defence Class')
['Self-Defence Class']
  >>> re.split(r'\s+|(?<=-)', 'Self-Defence Class')
['Self-Defence', 'Class']
  >>> re.split(r'\s*', 'Self-Defence Class')
['Self-Defence', 'Class']

Note that splitting on r'\s*' (0 or more whitespaces) actually split on
r'\s+' (1 or more whitespaces). Splitting on patterns that only can
match the empty string (like r'\b' or r'(?<=-)') never worked, while
splitting

Starting since Python 3.5 splitting on a pattern that only can match the
empty string raises a ValueError (this never worked), and splitting a
pattern that can match the empty string but not only emits a
FutureWarning. This taken developers a time for replacing their patterns
r'\s*' to r'\s+' as they should be.

Now I have created a final patch [9] that makes re.split() splitting on
zero-width patterns.

  >>> re.split(r'\b', 'Self-Defence Class')
['', 'Self', '-', 'Defence', ' ', 'Class', '']
  >>> re.split(r'\s+|(?<=-)', 'Self-Defence Class')
['Self-', 'Defence', 'Class']
  >>> re.split(r'\s*', 'Self-Defence Class')
['', 'S', 'e', 'l', 'f', '-', 'D', 'e', 'f', 'e', 'n', 'c', 'e', 'C',
'l', 'a', 's', 's', '']

The latter case the result is differ too much from the previous result,
and this likely not what the author wanted to get. But users had two
Python releases for fixing their code. FutureWarning is not silent by
default.

Because these patterns produced errors or warnings in the recent two
releases, we don't need an additional parameter for compatibility.

But the problem was not just with re.split(). Other functions also
worked not good with patterns that can match the empty string.

  >>> re.findall(r'^|\w+', 'Self-Defence Class')
['', 'elf', 'Defence', 'Class']
  >>> list(re.finditer(r'^|\w+', 'Self-Defence Class'))
[<re.Match object; span=(0, 0), match=''>, <re.Match object; span=(1,
4), match='elf'>, <re.Match object; span=(5, 12), match='Defence'>,
<re.Match object; span=(13, 18), match='Class'>]
  >>> re.sub(r'(^|\w+)', r'<\1>', 'Self-Defence Class')
'<>S<elf>-<Defence> <Class>'

After matching the empty string the following character will be skipped
and will be not included in the next match. My patch fixes these
functions too.

  >>> re.findall(r'^|\w+', 'Self-Defence Class')
['', 'Self', 'Defence', 'Class']
  >>> list(re.finditer(r'^|\w+', 'Self-Defence Class'))
[<re.Match object; span=(0, 0), match=''>, <re.Match object; span=(0,
4), match='Self'>, <re.Match object; span=(5, 12), match='Defence'>,
<re.Match object; span=(13, 18), match='Class'>]
  >>> re.sub(r'(^|\w+)', r'<\1>', 'Self-Defence Class')
'<><Self>-<Defence> <Class>'

I think this change don't need preliminary warnings, because it change
the behavior of more rarely used patterns. No re tests have been broken.
I was needed to add new tests for detecting the behavior change.

But there is one spoonful of tar in a barrel of honey. I didn't expect
this, but this change have broken a pattern used with re.sub() in the
doctest module: r'(?m)^\s*?$'. This was fixed by replacing it with
r'(?m)^[^\S\n]+?$'). I hope that such cases are pretty rare and think
this is an avoidable breakage.

The new behavior of re.split() matches the behavior of regex.split()
with the VERSION1 flag, the new behavior of re.findall() and
re.finditer() matches the behavior of corresponding functions in the
regex module (independently from the version flag). But the new behavior
of re.sub() doesn't match exactly the behavior of regex.sub() with any
version flag. It differs from the old behavior as you can see in the
example above, but is closer to it that to regex.sub() with VERSION1.
This allowed to avoid braking existing tests for re.sub().

  >>> regex.sub(r'(\W+|(?<=-))', r':', 'Self-Defence Class')

'Self:Defence:Class'

  >>> regex.sub(r'(?V1)(\W+|(?<=-))', r':', 'Self-Defence Class')

'Self::Defence:Class'
  >>> re.sub(r'(\W+|(?<=-))', r':', 'Self-Defence Class')
'Self:Defence:Class'

As re.split() it never matches the empty string adjacent to the previous
match. re.findall() and re.finditer() only don't match the empty string
adjacent to the previous empty string match. In the regex module
regex.sub() is mutually consistent with regex.findall() and
regex.finditer() (with the VERSION1 flag), but regex.split() is not
consistent with them. In the re module re.split() and re.sub() will be
mutually consistent, as well as re.findall() and re.finditer(). This is
more backward compatible. And I don't know reasons for preferring the
behavior of re.findall() and re.finditer() over the behavior of
re.split() in this corner case.

FTR, you could make an argument for either behaviour. For regex, I wentwith what Perl does.

Would be nice to get this change in 3.7.0a3 for wider testing. Please
make a review of the patch [9] or tell your thoughts about this change.

[1] https://docs.python.org/3/library/re.html
[2] https://pypi.python.org/pypi/regex/
[3] https://mail.python.org/pipermail/python-dev/2004-August/047272.html
[4] https://bugs.python.org/issue852532
[5] https://bugs.python.org/issue988761
[6] https://bugs.python.org/issue1647489
[7] https://bugs.python.org/issue3262
[8] https://bugs.python.org/issue25054
[9] https://github.com/python/cpython/pull/4471

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Regular expressions: splitting on zero-width patterns

Reply via email to