[issue23297] ‘tokenize.detect_encoding’ is confused between text and bytes: no ‘startswith’ method on a byte string

Pod Thu, 12 Feb 2015 06:49:34 -0800

Pod added the comment:

Not the OP, but I find this message a bug because it's confusing from the 
perspective of a user of the tokenize() function. If you give tokenize a 
readlines() that returns a str, you get this error message that confusingly 
states that something inside tokenize must be a string and NOT a bytes, even 
though the user gave readlines a string, not a bytes. It looks like an internal 
bug.


Turns out it's because the contact changed from python2 to 3.

Personally, I'd been accidentally reading the python2 page for the tokenize 
library instead of python3, and had been using tokenize.generate_tokens in my 
python 3 code which accepts a io.StringIO just fine. When I realising my 
mistake and switched to the python3 version of the page I noticed 
generate_tokens is no longer supported, even though the code I had was working, 
and I noticed that the definition of tokenize had changed to match the old 
generate_tokens (along with a subtle change in the definition of the acceptable 
readlines function). 

So when I switched from tokenize.generate_tokens to tokenize.tokenize to try 
and use the library as intended, I get the same error as OP. Perhaps OP made a 
similar mistake?



To actually hit the error in question:

        $ cat -n temp.py
             1  import tokenize
             2  import io
             3
             4
             5  byte_reader = io.BytesIO(b"test bytes generate_tokens")
             6  tokens = tokenize.generate_tokens(byte_reader.readline)
             7
             8  byte_reader = io.BytesIO(b"test bytes tokenize")
             9  tokens = tokenize.tokenize(byte_reader.readline)
            10
            11  byte_reader = io.StringIO("test string generate")
            12  tokens = tokenize.generate_tokens(byte_reader.readline)
            13
            14  str_reader = io.StringIO("test string tokenize")
            15  tokens = tokenize.tokenize(str_reader.readline)
            16
            17
        
        $ python3 temp.py
        Traceback (most recent call last):
          File "temp.py", line 15, in <module>
            tokens = tokenize.tokenize(str_reader.readline)
          File "C:\work\env\python\Python34_64\Lib\tokenize.py", line 467, in 
tokenize
            encoding, consumed = detect_encoding(readline)
          File "C:\work\env\python\Python34_64\Lib\tokenize.py", line 409, in 
detect_encoding
            if first.startswith(BOM_UTF8):
        TypeError: startswith first arg must be str or a tuple of str, not bytes

----------
nosy: +Pod

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue23297>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue23297] ‘tokenize.detect_encoding’ is confused between text and bytes: no ‘startswith’ method on a byte string

Reply via email to