Serhiy Storchaka added the comment:

It is possible to change this behavior (see example patch). With this patch:

>>> re.split(r'(?<=CA)(?=GCTG)', 'ACGTCAGCTGAAACCCCAGCTGACGTACGT')
['ACGTCA', 'GCTGAAACCCCA', 'GCTGACGTACGT']
>>> re.split(r'\b', "the quick, brown fox")
['', 'the', ' ', 'quick', ', ', 'brown', ' ', 'fox', '']

But unfortunately this is backward incompatible change and will likely break 
existing code (and breaks tests). Consider following example: re.split('(:*)', 
'ab'). Currently the result is ['ab'], but with the patch it is ['', '', 'a', 
'', 'b', '', ''].

In third-part regex module [1] there is the V1 flag which switches incompatible 
bahavior change.

>>> regex.split('(:*)', 'ab')
['ab']
>>> regex.split('(?V1)(:*)', 'ab')
['', '', 'a', '', 'b', '', '']
>>> regex.split(r'(?<=CA)(?=GCTG)', 'ACGTCAGCTGAAACCCCAGCTGACGTACGT')
['ACGTCAGCTGAAACCCCAGCTGACGTACGT']
>>> regex.split(r'(?V1)(?<=CA)(?=GCTG)', 'ACGTCAGCTGAAACCCCAGCTGACGTACGT')
['ACGTCA', 'GCTGAAACCCCA', 'GCTGACGTACGT']
>>> regex.split(r'\b', "the quick, brown fox")
['the quick, brown fox']
>>> regex.split(r'(?V1)\b', "the quick, brown fox")
['', 'the', ' ', 'quick', ', ', 'brown', ' ', 'fox', '']

I don't know how to solve this issue without introducing such flag (or adding 
special boolean argument to re.split()).

As a workaround I suggest you to use the regex module.

[1] https://pypi.python.org/pypi/regex

----------
keywords: +patch
Added file: http://bugs.python.org/file37147/re_split_zero_width.patch

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue22817>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to