[issue22491] Support Unicode line boundaries in regular expression

2019-10-27 Thread Lewis Gaul


Lewis Gaul  added the comment:

Hi there, I'm running 'EnHackathon' in a couple of weeks, and was wondering if 
this could be a good issue for a small team of first-time contributors with 
experience in C to work on.

Would anyone be able to offer any guidance for where to start in Modules/_sre.c?

--
nosy: +Lewis Gaul

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22491] Support Unicode line boundaries in regular expression

2019-07-22 Thread Zackery Spytz


Zackery Spytz  added the comment:

> To meet Unicode standard requirement RL1.6 [1] all Unicode line separators 
> should be supported:

It seems that large portions of Modules/_sre.c would have to be rewritten in 
order to do this.

--
nosy: +ZackerySpytz

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22491] Support Unicode line boundaries in regular expression

2014-09-25 Thread Serhiy Storchaka

New submission from Serhiy Storchaka:

Currently regular expressions support on '\n' as line boundary. To meet Unicode 
standard requirement RL1.6 [1] all Unicode line separators should be supported: 
'\n', '\r', '\v', '\f', '\x85', '\u2028', '\u2029' and two-character '\r\n'. 
Also it is recommended that '.' in dotall mode matches '\r\n'. Also strongly 
recommended to support the '\R' pattern which matches all line separators 
(equivalent to '(?:\\r\n|(?!\r\n)[\n\v\f\r\x85\u2028\u2029]').

 [m.start() for m in re.finditer('$', '\r\n\n\r', re.M)]
[1, 2, 4]  # should be [0, 2, 3, 4]
 [m.start() for m in re.finditer('^', '\r\n\n\r', re.M)]
[0, 2, 3]  # should be [0, 2, 3, 4]
 [m.group() for m in re.finditer('.', '\r\n\n\r', re.M|re.S)]
['\r', '\n', '\n', '\r']  # should be ['\r\n', '\n', '\r']
 [m.group() for m in re.finditer(r'\R', '\r\n\n\r')]
[]  # should be ['\r\n', '\n', '\r']

[1] http://www.unicode.org/reports/tr18/#RL1.6

--
components: Extension Modules, Regular Expressions
messages: 227508
nosy: ezio.melotti, mrabarnett, pitrou, serhiy.storchaka
priority: normal
severity: normal
stage: needs patch
status: open
title: Support Unicode line boundaries in regular expression
type: enhancement
versions: Python 3.5

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22491
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue22491] Support Unicode line boundaries in regular expression

2014-09-25 Thread Matthew Barnett

Matthew Barnett added the comment:

For reference, the regex module normally considers the line ending to be '\n', 
but it has a WORD flag ('(?w)') that turns on the Unicode definition of a 
'word' character as well as Unicode line separator.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue22491
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com