Re: [Python-Dev] Why aren't escape sequences in literal strings handled by the tokenizer?

Eric V. Smith Thu, 17 May 2018 15:41:39 -0700

On 5/17/2018 3:01 PM, Larry Hastings wrote:

I fed this into tokenize.tokenize():

    b''' x = "\u1234" '''
I was a bit surprised to see \Uxxxx in the output. Particularly becausethe output (t.string) was a *string* and not *bytes*.

For those (like me) who have no idea how to use tokenize.tokenize'swacky interface, the test code is:


list(tokenize.tokenize(io.BytesIO(b''' x = "\u1234" ''').readline))

Maybe I'm making a parade of my ignorance, but I assumed that stringliterals were parsed by the parser--just like everything else is parsedby the parser, hey it seems like a good place for it--and in particularthat the escape sequence substitutions would be done in the tokenizer.Having stared at it a little, I now detect a whiff of "this designsolved a real problem". So... what was the problem, and how does thisdesign solve it?

I assume the intent is to not throw away any information in the lexer,and give the parser full access to the original string. But that's justa guess.

BTW, my use case is that I hoped to use CPython's tokenizer to parsesome Python-ish-looking text and handle double-quoted strings for me.*Especially* all the escape sequences--leveraging all CPython's supportfor funny things like \U{penguin}. The current behavior of thetokenizer makes me think it'd be easier to roll my own!


Can you feed the token text to the ast?

>>> ast.literal_eval('"\u1234"')
'ሴ'

Eric
_______________________________________________
Python-Dev mailing list
[email protected]
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Why aren't escape sequences in literal strings handled by the tokenizer?

Reply via email to