[issue37367] octal escapes applied inconsistently throughout the interpreter and lib

2019-11-20 Thread Tal Einat


Tal Einat  added the comment:

Alright, so let's push through the existing PR for rejecting such octal escapes 
in byte strings.

We'll also need another PR for raising a deprecation warning for such octal 
escapes in strings.

--
keywords: +easy, easy (C)

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37367] octal escapes applied inconsistently throughout the interpreter and lib

2019-11-19 Thread Guido van Rossum


Guido van Rossum  added the comment:

For sure the Python tokenizer/parser should reject octal escapes that produce 
values >= 256. Certainly in bytes strings. Probably also in text strings 
(nobody using Unicode thinks in octal). This is ancient behavior though (it's 
the same in 2.7) so it may require a deprecation for the text string case. (For 
byte strings, it should become an error in 3.9 -- dropping the top bit is 
nonsensical.)

The regex parser should not be changed.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37367] octal escapes applied inconsistently throughout the interpreter and lib

2019-11-19 Thread Terry J. Reedy


Terry J. Reedy  added the comment:

I can't find who wrote this block treating octal escapes beginning with 4-7 the 
same as those beginning with 0-3. 

case '0': case '1': case '2': case '3':
case '4': case '5': case '6': case '7':
c = s[-1] - '0';
if (s < end && '0' <= *s && *s <= '7') {
c = (c<<3) + *s++ - '0';
if (s < end && '0' <= *s && *s <= '7')
c = (c<<3) + *s++ - '0';
}
*p++ = c;
break;

Antoine Pitrou merged from somewhere in 2010 and Christiqn Heimes renamed 
something in 2008 and before that, ???. Guido wrote the initial bytesobject.c 
in 2006.

Guido, do you agree that the current behavior, treating the same int 
differently when input into a bytes in different ways, is a bug, and if so, 
should we backport the fix?

>>> b'\407'
b'\x07'
>>> bytes((0o407,))
Traceback (most recent call last):
  File "", line 1, in 
bytes((0o407,))
ValueError: bytes must be in range(0, 256)

--
nosy: +gvanrossum

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37367] octal escapes applied inconsistently throughout the interpreter and lib

2019-11-17 Thread Tal Einat


Tal Einat  added the comment:

The PR looks pretty good, but the question is whether we want to break 
backwards compatibility in the name of correctness.

In this case, the silent buggy behavior of keeping the (mod 256) of the value 
seems worth fixing to me.

--
nosy: +taleinat

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37367] octal escapes applied inconsistently throughout the interpreter and lib

2019-07-08 Thread Jeffrey Kintscher


Jeffrey Kintscher  added the comment:

The PR is ready for review.  It raises ValueError if the escaped octal value in 
a byte string is greater than 255.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37367] octal escapes applied inconsistently throughout the interpreter and lib

2019-07-08 Thread Jeffrey Kintscher


Change by Jeffrey Kintscher :


--
keywords: +patch
pull_requests: +14461
stage:  -> patch review
pull_request: https://github.com/python/cpython/pull/14654

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37367] octal escapes applied inconsistently throughout the interpreter and lib

2019-07-05 Thread Jeffrey Kintscher


Jeffrey Kintscher  added the comment:

Here is the problematic code in _PyBytes_DecodeEscape in Objects/bytesobject.c:

c = s[-1] - '0';
if (s < end && '0' <= *s && *s <= '7') {
c = (c<<3) + *s++ - '0';
if (s < end && '0' <= *s && *s <= '7')
c = (c<<3) + *s++ - '0';
}
*p++ = c;

c is an int, and p is a char pointer to the new bytes object's string buffer.  
For b'\407', c gets correctly calculated as 263 (0x107), but the upper bits are 
lost when it gets recast as a char and stored in the location pointed to by p.  
Hence, b'\407' becomes b'\x07' when the object is created.

IMO, this should raise "ValueError: bytes must be in range(0, 256)" instead of 
silently throwing away the upper bits.  I will work on a PR.

I also took a look at how escaped hex values are handled by the same function.  
It may seem at first glance that

>>> b'\x107'
b'\x107'

is returning the hex value 0x107, but in reality it is returning '\x10' as the 
first character and '7' as the second character.  While visually misleading, it 
is syntactically and semantically correct.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37367] octal escapes applied inconsistently throughout the interpreter and lib

2019-07-04 Thread Jeffrey Kintscher


Jeffrey Kintscher  added the comment:

>>> b'\407'
b'\x07'
>>> ord(b'\407')
7

This is the object structure passed to builtin_ord():

(lldb) p *((PyBytesObject *)(c))
(PyBytesObject) $19 = {
  ob_base = {
ob_base = {
  ob_refcnt = 4
  ob_type = 0x0001003cb0b0
}
ob_size = 1
  }
  ob_shash = 8685212186264880044
  ob_sval = {
[0] = '\a'
  }
}

If two bytes were stored (0x107), I would expect ob_sval[0] to be 7 ('\a') and 
ob_sval[1] to be 1 on a little endian system, but ob_sval[1] is 0:

(lldb) p (long)(unsigned char) (((PyBytesObject *)(c))->ob_sval[0])
(long) $23 = 7
(lldb) p (long)(unsigned char) (((PyBytesObject *)(c))->ob_sval[1])
(long) $24 = 0

This means the truncation to a single byte is happening when the byte string 
object is created.

--
nosy: +Jeffrey.Kintscher

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37367] octal escapes applied inconsistently throughout the interpreter and lib

2019-06-28 Thread Terry J. Reedy


Terry J. Reedy  added the comment:

https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals
says, for \ooo, "In a bytes literal, hexadecimal and octal escapes denote the 
byte with the given value. In a string literal, these escapes denote a Unicode 
character with the given value."

I agree that sometimes truncating an invalid integer instead of always raising 
ValueError is strange.

>>> ord(b'\407')
7
>>> bytes((0o407,))
...
ValueError: bytes must be in range(0, 256)

I don't know is there was an intentional back-compatibility reason for this.

Without an example of re raising, I don't understand the re complaint.

--
nosy: +terry.reedy

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37367] octal escapes applied inconsistently throughout the interpreter and lib

2019-06-21 Thread SilentGhost


Change by SilentGhost :


--
components: +Regular Expressions
nosy: +ezio.melotti, mrabarnett, serhiy.storchaka
versions: +Python 3.9

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue37367] octal escapes applied inconsistently throughout the interpreter and lib

2019-06-21 Thread Dan Snider


New submission from Dan Snider :

At present, the bytecode compiler can generate 512 different unicode 
characters, one for each integral from the range [0-511), 512 being the total 
number of syntactically valid permutations of 3 octal digits preceded by a 
backslash. However, this does not match the regex compiler, which raises an 
error regardless of the input type when it encounters an an octal escape 
character with a decimal value greater than 255. On the other hand... the bytes 
literal:

>>> b'\407'

is somehow valid, and can lead to extremely difficult bugs to track down, such 
as this nonsense:

>>> re.compile(b'\407').search(b'\a')


I propose that the regex parser be augmented, enabling for unicode patterns the 
interpretation of three character octal escapes from the range(256, 512), while 
the bytecode parser be adjusted to match the behavior of the regex parser, 
raising an error for bytes literals > b"\400", rather than truncating the 9th 
bit.

--
messages: 346246
nosy: bup
priority: normal
severity: normal
status: open
title: octal escapes applied inconsistently throughout the interpreter and lib

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com