[issue37996] 2to3 introduces unwanted extra backslashes for unicode characters in regular expressions

2019-08-31 Thread Bob Kline


Bob Kline  added the comment:

In fact, I suppose it's possible that the warning as I worded it is still not 
restrictive enough, and that there are subtle dependencies between the fixers 
which would make the action of one of them render the code no longer safely 
fixable as Python 2 code by the other fixers, and the real warning should 
really say something like, "You can only run this tool once in write-in-place 
mode on a given code set. You can run as many times without the -w option as 
many times, with as many combinations of fixers as you want, determining which 
of the fixers you will enable for the final write-in-place run." Are there such 
dependencies between fixers?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37996] 2to3 introduces unwanted extra backslashes for unicode characters in regular expressions

2019-08-31 Thread Bob Kline


Bob Kline  added the comment:

Thanks, I understand. However, this highlights something which had slipped 
under my radar. You get one shot at running a code set through the tool. You 
can't do what I was doing, which was to run the tool in "don't write" mode, 
then fix by hand some of the things it says will need to be done, then run it 
again in the same mode, fix, etc., until I got to the point where I felt like I 
could trust it (except for things like adding unnecessary `list()` wrappers, 
for which I learned how to use the option for suppressing certain default 
fixers), and then run the tool in write mode to fix what was left. I now 
totally get why the tool did what it did, and why the approach I was using was 
inappropriate, but was there a warning to this effect that I missed in the 
documentation? Something like "you can only run this tool once per fixer (or 
set of fixers) in write mode, and you cannot run a fixer on code for which you 
have performed any of the needed conversions for that fixer yourself"? Of cour
 se, it's always possible I'm the only developer clueless enough not to have 
figured this out without such a warning. :-)

Partly in my (lame) defense, I had lured myself into the frame of mind where 
what I was doing seemed to make sense by having just come out of a similar 
exercise with pylint, where iterative "fixing" works just fine. I guess I 
should take this as a good sign, that my brain has moved so far into the Python 
3 world that "..." was no longer recognizable as a bytestring.

Again, thanks for the gentle explanation. :-)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37996] 2to3 introduces unwanted extra backslashes for unicode characters in regular expressions

2019-08-31 Thread Ned Deily


Change by Ned Deily :


--
resolution:  -> not a bug
stage:  -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37996] 2to3 introduces unwanted extra backslashes for unicode characters in regular expressions

2019-08-31 Thread Matthew Barnett


Matthew Barnett  added the comment:

You wrote "the u had already been removed by hand". By removing the u in the 
_Python 2_ code, you changed that string from a Unicode string to a bytestring.

In a bytestring, \u is not an escape; b"\u" == b"\\u".

--
nosy: +mrabarnett

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37996] 2to3 introduces unwanted extra backslashes for unicode characters in regular expressions

2019-08-31 Thread Bob Kline


Bob Kline  added the comment:

Ah, this is worse than I first thought. It's not just converting code by adding 
extra backslashes to regular expression strings, where at least the regular 
expression engine will do what the original code was asking the Python parser 
to do (unless user code checks for and enforces limits on regular expression 
string lengths, so even that case is broken), but 2to3 is also mangling strings 
in places where the behavior is changed (that is, broken). 2to3 wants to change

if c not in ".-_:\u00B7\u0e87":

to

if c not in ".-_:\\u00B7\\u0e87":

Not the same thing at all, as illustrated here:

$ python
Python 3.7.3 (default, Jun 19 2019, 07:38:49)
[Clang 10.0.1 (clang-1001.0.46.4)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> len("\u00B7")
1
>>> len("\\u00B7")
6
>>>

That breaks the original code. This is a serious bug.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37996] 2to3 introduces unwanted extra backslashes for unicode characters in regular expressions

2019-08-31 Thread Bob Kline


Bob Kline  added the comment:

The original string had u"""...""" and the u had already been removed by hand 
in preparation for moving to Python 3.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37996] 2to3 introduces unwanted extra backslashes for unicode characters in regular expressions

2019-08-31 Thread Bob Kline


New submission from Bob Kline :

-UNWANTED = re.compile("""['".,?!:;()[\]{}<>\u201C\u201D\u00A1\u00BF]+""")
+UNWANTED = 
re.compile("""['".,?!:;()[\]{}<>\\u201C\\u201D\\u00A1\\u00BF]+""")

The non-ASCII characters in the original string are perfectly legitimate str 
characters, using valid standard escapes recognized and handled by the Python 
parser. It is unnecessary to lengthen the string argument passed to 
re.compile() and defer the conversion of the doubled escapes for the regular 
expression engine to handle.

--
components: 2to3 (2.x to 3.x conversion tool)
messages: 350922
nosy: bkline
priority: normal
severity: normal
status: open
title: 2to3 introduces unwanted extra backslashes for unicode characters in 
regular expressions
type: behavior
versions: Python 3.7

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com