from:"Brian Romanowski"

[issue42974] tokenize reports incorrect end col offset and line string when input ends without explicit newline

2021-01-20 Thread Brian Romanowski



Brian Romanowski  added the comment:

I took a look at Parser/tokenizer.c.  From what I can tell, the tokenizer does 
fake a newline character when the input buffer does not end with actual newline 
characters and that the returned NEWLINE token has an effective length of 1 
because of this faked newline character.  That is, tok->cur - tok->start == 1 
when the NEWLINE token is returned.

If this part of the C tokenizer is meant to be modeled exactly by the Python 
tokenize module, then the current code is correct.  If there is some wiggle 
room because tok->start and tok->cur are converted into line numbers and column 
offsets, then maybe it's acceptable to change them?  If not, then the current 
documentation is misleading because the newline_token_2.end[1] element from my 
original example is not "...  column where the token ends in the source".  
There is no such column.

I'm not sure whether the C tokenizer exposes anything like 
newline_token_2.string, directly.  If so, does it hold the faked newline 
character or does it hold the empty string like the current tokenize module 
does?

I'm also not sure whether the C tokenizer exposes anything like 
newline_token_2.line.  If it does, I'd be surprised if the faked newline would 
cause this to somehow become the empty string instead of the actual line 
content.  So I'm guessing that current tokenize module's behavior here is still 
a real bug?  If not, then this is another case that might benefit from some 
edge-case documentation.

--

___
Python tracker 
<https://bugs.python.org/issue42974>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue42974] tokenize reports incorrect end col offset and line string when input ends without explicit newline

2021-01-20 Thread Brian Romanowski



Brian Romanowski  added the comment:

Shoot, just realized that consistency isn't the important thing here, the most 
important thing is that the tokenizer module exactly matches the output of the 
Python tokenizer.  It's possible that my changes violate that constraint, I'll 
take a look at that.

--

___
Python tracker 
<https://bugs.python.org/issue42974>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue42974] tokenize reports incorrect end col offset and line string when input ends without explicit newline

2021-01-19 Thread Brian Romanowski



Change by Brian Romanowski :


--
keywords: +patch
pull_requests: +23085
stage:  -> patch review
pull_request: https://github.com/python/cpython/pull/24260

___
Python tracker 
<https://bugs.python.org/issue42974>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue42974] tokenize reports incorrect end col offset and line string when input ends without explicit newline

2021-01-19 Thread Brian Romanowski



New submission from Brian Romanowski :

The tokenize module's tokenizer functions output incorrect (or at least 
misleading) information when the content being tokenized does not end in a line 
ending character.  This is related to the fix for issue<33899>  which added the 
NEWLINE tokens for this case but did not fill out the whole token tuple 
correctly.

The bug can be seen by running a version of the test in 
Lib/test/test_tokenize.py:

import io, tokenize

newline_token_1 = 
list(tokenize.tokenize(io.BytesIO("x\n".encode('utf-8')).readline))[-2]
newline_token_2 = 
list(tokenize.tokenize(io.BytesIO("x".encode('utf-8')).readline))[-2]

print(newline_token_1)
print(newline_token_2)

# Prints:
# TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 1), end=(1, 2), line='x\n')
# TokenInfo(type=4 (NEWLINE), string='', start=(1, 1), end=(1, 2), line='')  # 
bad "end" and "line"!

Notice that "len(newline_token_2.string) == 0" but "newline_token_2.end[1] - 
newline_token_2.start[1] == 1".  Seems more consistent if the 
newline_token_2.end == (1, 1).

Also, newline_token_2.line should hold the physical line rather than the empty 
string.  This would make it consistent with newline_token_1.line.

I'll add a PR shortly with a change so the output from the two cases is:

TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 1), end=(1, 2), line='x\n')
TokenInfo(type=4 (NEWLINE), string='', start=(1, 1), end=(1, 1), line='x')

If this looks reasonable, I can backport it for the other branches.  Thanks!

--
components: Library (Lib)
messages: 385313
nosy: romanows
priority: normal
severity: normal
status: open
title: tokenize reports incorrect end col offset and line string when input 
ends without explicit newline
versions: Python 3.10, Python 3.6, Python 3.7, Python 3.8, Python 3.9

___
Python tracker 
<https://bugs.python.org/issue42974>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue42974] tokenize reports incorrect end col offset and line string when input ends without explicit newline

[issue42974] tokenize reports incorrect end col offset and line string when input ends without explicit newline

[issue42974] tokenize reports incorrect end col offset and line string when input ends without explicit newline

[issue42974] tokenize reports incorrect end col offset and line string when input ends without explicit newline

4 matches

Site Navigation

Mail list logo

Footer information