Serhiy Storchaka added the comment:
re.split() with the r'(?CA)(?=GCTG)' pattern raises a ValueError in 3.5 (see
issue22818). In future releases it could be changed to work with zero-width
patterns (such as lookaround assertions).
--
resolution: - wont fix
stage: - resolved
status:
Serhiy Storchaka added the comment:
It is possible to change this behavior (see example patch). With this patch:
re.split(r'(?=CA)(?=GCTG)', 'ACGTCAGCTGAAAAGCTGACGTACGT')
['ACGTCA', 'GCTGAAAA', 'GCTGACGTACGT']
re.split(r'\b', the quick, brown fox)
['', 'the', ' ', 'quick', ', ',
Serhiy Storchaka added the comment:
Previous attempts to solve this issue: issue852532, issue988761, issue3262.
--
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22817
___
New submission from Rex Dwyer:
I would like to split a DNA sequence with a restriction enzyme.
A description enzyme can be describe as, e.g. r'(?CA)(?=GCTG)'
I cannot get re.split to split on this pattern as perl 5 does.
--
components: Regular Expressions
messages: 230831
nosy:
Ezio Melotti added the comment:
Can you provide a sample DNA sequence (or part of it), the exact code you used,
the output you got, and what you expected?
--
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22817
Serhiy Storchaka added the comment:
re.split(r'(?=CA)(?=GCTG)', 'CAGCTG')
['CAGCTG']
I think expected output is ['CA', 'GCTG'].
--
nosy: +serhiy.storchaka
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22817
Rex Dwyer added the comment:
sorry if I wasn't clear.
s = 'ACGTCAGCTGAAAAGCTGACGTACGT
re.split(r'(?CA)(?=GCTG)',s)
expected output is:
acgtCA|GCTGaaacccCA|GCTGacgtacgt
- ['ACGTCA', 'GCTGAAAA', 'GCTGACGTACGT']
I would also be able to split a text on word boundaries:
re.split(r'\b', the
Serhiy Storchaka added the comment:
This looks as one of existing issue about zero-length matches (issue1647489,
issue10328).
--
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22817
___