[issue23297] ‘tokenize.detect_encoding’ is confused between text and bytes: no ‘startswith’ method on a byte string

2015-02-20 Thread R. David Murray

R. David Murray added the comment:

The error message could indeed be made clearer by turning it into a message 
that tokenize itself requires bytes input.  Or, more likely, the additional 
error handling needs to be in detect_encoding.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue23297] ‘tokenize.detect_encoding’ is confused between text and bytes: no ‘startswith’ method on a byte string

2015-02-12 Thread Pod

Pod added the comment:

Not the OP, but I find this message a bug because it's confusing from the 
perspective of a user of the tokenize() function. If you give tokenize a 
readlines() that returns a str, you get this error message that confusingly 
states that something inside tokenize must be a string and NOT a bytes, even 
though the user gave readlines a string, not a bytes. It looks like an internal 
bug.

Turns out it's because the contact changed from python2 to 3.

Personally, I'd been accidentally reading the python2 page for the tokenize 
library instead of python3, and had been using tokenize.generate_tokens in my 
python 3 code which accepts a io.StringIO just fine. When I realising my 
mistake and switched to the python3 version of the page I noticed 
generate_tokens is no longer supported, even though the code I had was working, 
and I noticed that the definition of tokenize had changed to match the old 
generate_tokens (along with a subtle change in the definition of the acceptable 
readlines function). 

So when I switched from tokenize.generate_tokens to tokenize.tokenize to try 
and use the library as intended, I get the same error as OP. Perhaps OP made a 
similar mistake?



To actually hit the error in question:

$ cat -n temp.py
 1  import tokenize
 2  import io
 3
 4
 5  byte_reader = io.BytesIO(b"test bytes generate_tokens")
 6  tokens = tokenize.generate_tokens(byte_reader.readline)
 7
 8  byte_reader = io.BytesIO(b"test bytes tokenize")
 9  tokens = tokenize.tokenize(byte_reader.readline)
10
11  byte_reader = io.StringIO("test string generate")
12  tokens = tokenize.generate_tokens(byte_reader.readline)
13
14  str_reader = io.StringIO("test string tokenize")
15  tokens = tokenize.tokenize(str_reader.readline)
16
17

$ python3 temp.py
Traceback (most recent call last):
  File "temp.py", line 15, in 
tokens = tokenize.tokenize(str_reader.readline)
  File "C:\work\env\python\Python34_64\Lib\tokenize.py", line 467, in 
tokenize
encoding, consumed = detect_encoding(readline)
  File "C:\work\env\python\Python34_64\Lib\tokenize.py", line 409, in 
detect_encoding
if first.startswith(BOM_UTF8):
TypeError: startswith first arg must be str or a tuple of str, not bytes

--
nosy: +Pod

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue23297] ‘tokenize.detect_encoding’ is confused between text and bytes: no ‘startswith’ method on a byte string

2015-01-22 Thread STINNER Victor

STINNER Victor added the comment:

I don't understand why do you consider that this issue is a bug. Can you show 
an example where detect_encoding() raises an exception?

--
nosy: +haypo

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue23297] ‘tokenize.detect_encoding’ is confused between text and bytes: no ‘startswith’ method on a byte string

2015-01-21 Thread R. David Murray

R. David Murray added the comment:

bytes does support startswith:

>>> b'abc'.startswith(b'a')
True

--
nosy: +r.david.murray

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue23297] ‘tokenize.detect_encoding’ is confused between text and bytes: no ‘startswith’ method on a byte string

2015-01-21 Thread Ben Finney

Ben Finney added the comment:

Possibly related to issue9969.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue23297] ‘tokenize.detect_encoding’ is confused between text and bytes: no ‘startswith’ method on a byte string

2015-01-21 Thread Ben Finney

New submission from Ben Finney:

In `tokenize.detect_encoding` is the following code::

first = read_or_stop()
if first.startswith(BOM_UTF8):
# …

The `read_or_stop` function is defined as::

def read_or_stop():
try:
return readline()
except StopIteration:
return b''

So, on catching ``StopIteration``, the return value will be a byte string. The 
`detect_encoding` code then immediately calls `sartswith`, which fails::

File "/usr/lib/python3.4/tokenize.py", line 409, in detect_encoding
  if first.startswith(BOM_UTF8):
  TypeError: startswith first arg must be str or a tuple of str, not bytes

One or both of those locations in the code is wrong. Either `read_or_stop` 
should never return a byte string; or `detect_encoding` should not assume it 
can call `startswith` on the result.

--
components: Library (Lib)
messages: 234471
nosy: bignose
priority: normal
severity: normal
status: open
title: ‘tokenize.detect_encoding’ is confused between text and bytes: no 
‘startswith’ method on a byte string
type: crash
versions: Python 3.4

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com