Gareth Rees <g...@garethrees.org> added the comment:

I'm having a look to see if I can make tokenize.py better match the real 
tokenizer, but I need some feedback on a couple of design decisions. 

First, how to handle tokenization errors? There are three possibilities:

1. Generate an ERRORTOKEN, resynchronize, and continue to tokenize from after 
the error. This is what tokenize.py currently does in the two cases where it 
detects an error.

2. Generate an ERRORTOKEN and stop tokenizing. This is what tokenizer.c does.

3. Raise an exception (IndentationError, SyntaxError, or TabError). This is 
what the user sees when the parser is invoked from pythonrun.c.

Since the documentation for tokenize.py says, "It is designed to match the 
working of the Python tokenizer exactly", I think that implementing option (2) 
is best here. (This will mean changing the behaviour of tokenize.py in the two 
cases where it currently detects an error, so that it stops tokenizing.)

Second, how to record the cause of the error? The real tokenizer records the 
cause of the error in the 'done' field of the 'tok_state" structure, but 
tokenize.py loses this information. I propose to add fields to the TokenInfo 
structure (which is a namedtuple) to record this information. The real 
tokenizer uses numeric constants from errcode.h (E_TOODEEP, E_TABSPACE, 
E_DEDENT etc), and pythonrun.c converts these to English-language error 
messages (E_TOODEEP: "too many levels of indentation"). Both of these pieces of 
information will be useful, so I propose to add two fields "error" (containing 
a string like "TOODEEP") and "errormessage" (containing the English-language 
error message).

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue12675>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to