On 7/17/07, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote: > > When a source file contains a string literal with an out-of-range \U > > escape (e.g. "\U12345678"), instead of a syntax error pointing to the > > offending literal, I get this, without any indication of the file or > > line: > > > > UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in > > position 0-9: illegal Unicode character > > > > This is quite hard to track down. > > I think the fundamental flaw is that a codec is used to implement > the Python syntax (or, rather, lexical rules). > > Not quite sure what the rationale for this design was; doing it on > the lexical level is (was) tricky because \u escapes were allowed > only for Unicode literals, and the lexer had no knowledge of the > prefix preceding a literal. (In 3k, it's still similar, because > \U escapes have no effect in bytes and raw literals). > > Still, even if it is "only" handled at the parsing level, I > don't see why it needs to be a codec. Instead, implementing > escapes in the compiler would still allow for proper diagnostics > (notice that in the AST the original lexical form of the string > literal is gone).
I guess because it was deemed useful to have a codec for this purpose too, thereby exposing the algorithm to Python code that needs the same functionality (e.g. the compiler package, RIP). > > (Both the location of the bad > > literal in the source file, and the origin of the error in the parser. > > :-) Can someone come up with a fix? > > The language definition makes it difficult to fix it where I would > consider the "proper" place, i.e. in the tokenization: > > http://docs.python.org/ref/strings.html > > says that escapeseq is "\" <any ASCII character>. So > "\x" is a valid shortstring. > > Then it becomes fuzzy: It says that any unrecognized escape > sequences are left in the string. While that appears like a clear > specification, it is not implemented (and has not since Python > 2.0 anymore). According to the spec, '\U12345678' is well-formed, > and denotes the same string as '\\U12345678'. > > I now see the following choices: > 1. Restore implementing the spec again. Stop complaining about > invalid escapes for \x and \U, and just interpret the \ > as '\\'. In this case, the current design could be left in > place, and the codecs would just stop raising these errors. Sounds like a bad idea. I think \xNN (where N is not a hex digit) once behaved this way, and it was changed to explicitly complain instead as a service to users. > 2. Change the spec to make it an error if \x is not followed > by two hex digits, \u not by four hex digits, \U not by > 8, or the value denoted by the \U digits is out of range. > In this case, I would propose to move the lexical analysis > back into the parser, or just make an internal API that > will raise a proper SyntaxError (it will be tricky to > compute the column in the original source line, though). I'm all in favor of this spec change. Eventually we should change the lexer to do this right; for now, Kurt's patch is good enough. > 3. Change the spec to make constrain escapeseq, giving up > the rule that uninterpreted escapes silently become > two characters. That's difficult to write down in EBNF, > so should be formulated through constraints in natural > language. The lexer would have to keep track of what kind > of literal it is processing, and reject invalid escapes > directly on source level. -1 > There are probably other options as well. -- --Guido van Rossum (home page: http://www.python.org/~guido/) _______________________________________________ Python-3000 mailing list [email protected] http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com
