On Fri, May 1, 2015 at 6:14 AM, Stephen J. Turnbull <step...@xemacs.org> wrote:
> Adam Bartoš writes: > > > Unfortunately, it doesn't work. With PYTHONIOENCODING=utf-8, the > > sys.std* streams are created with utf-8 encoding (which doesn't > > help on Windows since they still don't use ReadConsoleW and > > WriteConsoleW to communicate with the terminal) and after changing > > the sys.std* streams to the fixed ones and setting readline hook, > > it still doesn't work, > > I don't see why you would expect it to work: either your code is > bypassing PYTHONIOENCODING=utf-8 processing, and that variable doesn't > matter, or you're feeding already decoded text *as UTF-8* to your > module which evidently expects something else (UTF-16LE?). > I'll describe my picture of the situation, which might be terribly wrong. On Linux, in a typical situation, we have a UTF-8 terminal, PYTHONENIOENCODING=utf-8, GNU readline is used. When the REPL wants input from a user the tokenizer calls PyOS_Readline, which calls GNU readline. The user is prompted >>> , during the input he can use autocompletion and everything and he enters u'α'. PyOS_Readline returns b"u'\xce\xb1'" (as char* or something), which is UTF-8 encoded input from the user. The tokenizer, parser, and evaluator process the input and the result is u'\u03b1', which is printed as an answer. In my case I install custom sys.std* objects and a custom readline hook. Again, the tokenizer calls PyOS_Readline, which calls my readline hook, which calls sys.stdin.readline(), which returns an Unicode string a user entered (it was decoded from UTF-16-LE bytes actually). My readline hook encodes this string to UTF-8 and returns it. So the situation is the same. The tokenizer gets b"\u'xce\xb1'" as before, but know it results in u'\xce\xb1'. Why is the result different? I though that in the first case PyCF_SOURCE_IS_UTF8 might have been set. And after your suggestion, I thought that PYTHONIOENCODING=utf-8 is the thing that also sets PyCF_SOURCE_IS_UTF8. > > so presumably the PyCF_SOURCE_IS_UTF8 is still not set. > > I don't think that flag does what you think it does. AFAICT from > looking at the source, that flag gets unconditionally set in the > execution context for compile, eval, and exec, and it is checked in > the parser when creating an AST node. So it looks to me like it > asserts that the *internal* representation of the program is UTF-8 > *after* transforming the input to an internal representation (doing > charset decoding, removing comments and line continuations, etc). > I thought it might do what I want because of the behaviour of eval. I thought that the PyUnicode_AsUTF8String call in eval just encodes the passed unicode to UTF-8, so the situation looks like follows: eval(u"u'\u031b'") -> (b"u'\xce\xb1'", PyCF_SOURCE_IS_UTF8 set) -> u'\u03b1' eval(u"u'\u031b'".encode('utf-8')) -> (b"u'\xce\xb1'", PyCF_SOURCE_IS_UTF8 not set) -> u'\xce\xb1' But of course, this my picture might be wrong. > Well, the received text comes from sys.stdin and its encoding is > > known. > > How? You keep asserting this. *You* know, but how are you passing > that information to *the Python interpreter*? Guido may have a time > machine, but nobody claims the Python interpreter is telepathic. > I thought that the Python interpreter knows the input comes from sys.stdin at least to some extent because in pythonrun.c:PyRun_InteractiveOneObject the encoding for the tokenizer is inferred from sys.stdin.encoding. But this is actually the case only in Python 3. So I was wrong. > Yes. In the latter case, eval has no idea how the bytes given are > > encoded. > > Eval *never* knows how bytes are encoded, not even implicitly. That's > one of the important reasons why Python 3 was necessary. I think you > know that, but you don't write like you understand the implications > for your current work, which makes it hard to communicate. > Yes, eval never knows how bytes are encoded. But I meant it in comparison with the first case where a Unicode string was passed.
_______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com