New submission from Alexey Umnov: I execute the following code on the attached file 'text.txt':
import tokenize import codecs with open('text.txt', 'r') as f: reader = codecs.getreader('utf-8')(f) tokens = tokenize.generate_tokens(reader.readline) The file 'text.txt' has the following structure: first line with some text, then '\f' symbol (0x0c) on the second line and then some text on the last line. The result is that the function 'generate_tokens' ignores everything after '\f'. I've made some debugging and found out the following. If the file is read without using codecs (in ascii-mode), there are considered to be 3 lines in the file: 'text1\n', '\f\n', 'text2\n'. However in unicode-mode there are 4 lines: 'text1\n', '\f', '\n', 'text2\n'. I guess this is an intended behaviour since 2.7.x, but this causes a bug in tokenize module. Consider the lines 317-329 in tokenize.py: ... column = 0 while pos < max: # measure leading whitespace if line[pos] == ' ': column += 1 elif line[pos] == '\t': column = (column//tabsize + 1)*tabsize elif line[pos] == '\f': column = 0 else: break pos += 1 if pos == max: break ... The last 'break' corresponds to the main parsing loop and makes the parsing stop. Thus the lines that consist of (' ', '\t', '\f') characters and don't end with '\n' are treated as the end of file. ---------- components: Library (Lib) files: tokens.txt messages: 197899 nosy: Alexey.Umnov priority: normal severity: normal status: open title: tokenize.generate_tokens treat '\f' symbol as the end of file (when reading in unicode) type: behavior versions: Python 2.7 Added file: http://bugs.python.org/file31796/tokens.txt _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue19035> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com