New submission from Serhiy Storchaka:

Currently the re module supports only simple sets. They can include literal 
characters, character ranges, some simple character classes and support the 
negation. The Unicode standard [1] defines set operations (union, intersection, 
difference and symmetric difference) and nested sets. Some regular expression 
engines implemented these features, for example the regex module supports all 
TR18 features except not-nested POSIX character classes.

If replace the re module with the regex module or add support of these features 
in the re module and make this syntax enabled by default, this will break some 
code. It is very unlikely the the regular expression contains duplicated 
characters ('--', '||', '&&' or '~~'), but nested sets uses just '[', and 
non-escaped '[' is occurred in character sets in regular expressions (even the 
stdlib contains several occurrences).

Proposed patch adds FutureWarnings emitted when possible breaking set construct 
('--', '||', '&&', '~~' or '[') is occurred in a regular expression. We need 
one or two releases with a warning before changing syntax. The patch also makes 
re.escape() escaping '&' and '~' and fixes several regular expression in the 
stdlib.

Alternatively the support of new set syntax could be enabled by special flag.

I'm not sure that the support of set operations and nested sets is necessary. 
This complicates the syntax of regular expressions (which already is not 
simple). Currently set operations can be emulated with lookarounds:

[set1||set2] -- (?:[set1]|[set2])
[set1&&set2] -- [set1](?<=[set2]) or (?=[set1])[set2]
[set1--set2] -- [set1](?<![set2]) or [set1](?<=[^set2]) or (?=[set1])[^set2]
[set1~~set2] -- recursively expand [[set1||set2]--[set1&&set2]]

[1] http://unicode.org/reports/tr18/#Subtraction_and_Intersection

----------
assignee: serhiy.storchaka
components: Library (Lib), Regular Expressions
messages: 293532
nosy: ezio.melotti, mrabarnett, r.david.murray, rhettinger, serhiy.storchaka
priority: normal
severity: normal
stage: patch review
status: open
title: Preparation for advanced set syntax in regular expressions
type: enhancement
versions: Python 3.7

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue30349>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to