subject:"\[Python\-Dev\] Unicode literals in Python 2.7"

Re: [Python-Dev] Unicode literals in Python 2.7

2015-05-11 Thread Nick Coghlan

On 10 May 2015 at 23:28, Adam Bartoš dre...@gmail.com wrote:
 Glenn Linderman wrote:
 Is this going to get released in 3.5, I hope?  Python 3 is pretty
 limited without some solution for Unicode on the console... probably the
 biggest deficiency I have found in Python 3, since its introduction. It
 has great Unicode support for files and processing, which convinced me
 to switch from Perl, and I like so much else about it, that I can hardly
 code in Perl any more (I still support a few Perl programs, but have
 ported most of them to Python).

 I'd love to see it included in 3.5, but I doubt that will happen. For one
 thing, it's only two weeks till beta 1, which is feature freeze. And mainly,
 my package is mostly hacking into existing Python environment. A proper
 implementation would need some changes in Python someone would have to do.
 See for example my proposal http://bugs.python.org/issue17620#msg234439. I'm
 not competent to write a patch myself and I have also no feedback to the
 proposed idea. On the other hand, using the package is good enough for me so
 I didn't further bring attention to the proposal.

Right, and while I'm interested in seeing this improved, I'm not
especially familiar with the internal details of our terminal
interaction implementation, and even less so when it comes to the
Windows terminal. Steve Dower's also had his hands full working on the
Windows installer changes, and several of our other Windows folks
aren't C programmers.

PEP 432 (the interpreter startup sequence improvements) will be back
on the agenda for Python 3.6, so the 3.6 time frame seems more
plausible at this point.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Unicode literals in Python 2.7

2015-05-11 Thread Glenn Linderman


On 5/11/2015 1:09 AM, Nick Coghlan wrote:

On 10 May 2015 at 23:28, Adam Bartoš dre...@gmail.com wrote:

Glenn Linderman wrote:

Is this going to get released in 3.5, I hope?  Python 3 is pretty
limited without some solution for Unicode on the console... probably the
biggest deficiency I have found in Python 3, since its introduction. It
has great Unicode support for files and processing, which convinced me
to switch from Perl, and I like so much else about it, that I can hardly
code in Perl any more (I still support a few Perl programs, but have
ported most of them to Python).

I'd love to see it included in 3.5, but I doubt that will happen. For one
thing, it's only two weeks till beta 1, which is feature freeze. And mainly,
my package is mostly hacking into existing Python environment. A proper
implementation would need some changes in Python someone would have to do.
See for example my proposal http://bugs.python.org/issue17620#msg234439. I'm
not competent to write a patch myself and I have also no feedback to the
proposed idea. On the other hand, using the package is good enough for me so
I didn't further bring attention to the proposal.

Right, and while I'm interested in seeing this improved, I'm not
especially familiar with the internal details of our terminal
interaction implementation, and even less so when it comes to the
Windows terminal. Steve Dower's also had his hands full working on the
Windows installer changes, and several of our other Windows folks
aren't C programmers.

PEP 432 (the interpreter startup sequence improvements) will be back
on the agenda for Python 3.6, so the 3.6 time frame seems more
plausible at this point.

Cheers,
Nick.


Wow!  Another bug that'll reach a decade in age before being fixed...
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Unicode literals in Python 2.7

2015-05-11 Thread Nick Coghlan

On 12 May 2015 at 06:38, Glenn Linderman v+pyt...@g.nevcal.com wrote:
 On 5/11/2015 1:09 AM, Nick Coghlan wrote:
 On 10 May 2015 at 23:28, Adam Bartoš dre...@gmail.com wrote:
 I'd love to see it included in 3.5, but I doubt that will happen. For one
 thing, it's only two weeks till beta 1, which is feature freeze. And mainly,
 my package is mostly hacking into existing Python environment. A proper
 implementation would need some changes in Python someone would have to do.
 See for example my proposal http://bugs.python.org/issue17620#msg234439. I'm
 not competent to write a patch myself and I have also no feedback to the
 proposed idea. On the other hand, using the package is good enough for me so
 I didn't further bring attention to the proposal.

 Right, and while I'm interested in seeing this improved, I'm not
 especially familiar with the internal details of our terminal
 interaction implementation, and even less so when it comes to the
 Windows terminal. Steve Dower's also had his hands full working on the
 Windows installer changes, and several of our other Windows folks
 aren't C programmers.

 PEP 432 (the interpreter startup sequence improvements) will be back
 on the agenda for Python 3.6, so the 3.6 time frame seems more
 plausible at this point.

 Cheers,
 Nick.

 Wow!  Another bug that'll reach a decade in age before being fixed...

Yep, that tends to happen with complex cross-platform bugs  RFEs that
require domain expertise in multiple areas to resolve. It's one of the
areas that operating system vendors are typically best equipped to
handle, but we haven't historically had that kind of major
institutional backing for CPython core development (that *is*
changing, but it's a relatively recent phenomenon).

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Unicode literals in Python 2.7

2015-05-10 Thread Adam Bartoš

Glenn Linderman wrote:
 Is this going to get released in 3.5, I hope?  Python 3 is pretty
 limited without some solution for Unicode on the console... probably the
 biggest deficiency I have found in Python 3, since its introduction. It
 has great Unicode support for files and processing, which convinced me
 to switch from Perl, and I like so much else about it, that I can hardly
 code in Perl any more (I still support a few Perl programs, but have
 ported most of them to Python).

I'd love to see it included in 3.5, but I doubt that will happen. For one
thing, it's only two weeks till beta 1, which is feature freeze. And
mainly, my package is mostly hacking into existing Python environment. A
proper implementation would need some changes in Python someone would have
to do. See for example my proposal
http://bugs.python.org/issue17620#msg234439. I'm not competent to write a
patch myself and I have also no feedback to the proposed idea. On the other
hand, using the package is good enough for me so I didn't further bring
attention to the proposal.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Unicode literals in Python 2.7

2015-05-09 Thread Adam Bartoš

I already have a solution in Python 3 (see
https://github.com/Drekin/win-unicode-console,
https://pypi.python.org/pypi/win_unicode_console), I was just considering
adding support for Python 2 as well. I think I have an working example in
Python 2 using ctypes.

On Thu, May 7, 2015 at 9:23 PM, Martin v. Löwis mar...@v.loewis.de
wrote:

 Am 02.05.15 um 21:57 schrieb Adam Bartoš:
  Even if sys.stdin contained a file-like object with proper encoding
  attribute, it wouldn't work since sys.stdin has to be instance of type
  'file'. So the question is, whether it is possible to make a file
 instance
  in Python that is also customizable so it may call my code. For the first
  thing, how to change the value of encoding attribute of a file object.

 If, by in Python, you mean both in pure Python, and in Python 2,
 then the answer is no. If you can add arbitrary C code, then you might
 be able to hack your C library's stdio implementation to delegate fread
 calls to your code.

 I recommend to use Python 3 instead.

 Regards,
 Martin


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Unicode literals in Python 2.7

2015-05-09 Thread Glenn Linderman


On 5/9/2015 5:39 AM, Adam Bartoš wrote:
I already have a solution in Python 3 (see 
https://github.com/Drekin/win-unicode-console, 
https://pypi.python.org/pypi/win_unicode_console), I was just 
considering adding support for Python 2 as well. I think I have an 
working example in Python 2 using ctypes.


Is this going to get released in 3.5, I hope?  Python 3 is pretty 
limited without some solution for Unicode on the console... probably the 
biggest deficiency I have found in Python 3, since its introduction. It 
has great Unicode support for files and processing, which convinced me 
to switch from Perl, and I like so much else about it, that I can hardly 
code in Perl any more (I still support a few Perl programs, but have 
ported most of them to Python).


I wondered if all your recent questions about Py 2 were as a result of 
porting the above to Py 2... I only have one program left that I was 
forced to write in Py 2 because of library dependencies, and I think 
that library is finally being ported to Py 3, whew!  So while I laud 
your efforts, and no doubt they will benefit some folks for a few years 
yet, I hope never to use your Py 2 port myself!
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Unicode literals in Python 2.7

2015-05-07 Thread Martin v. Löwis

Am 02.05.15 um 21:57 schrieb Adam Bartoš:
 Even if sys.stdin contained a file-like object with proper encoding
 attribute, it wouldn't work since sys.stdin has to be instance of type
 'file'. So the question is, whether it is possible to make a file instance
 in Python that is also customizable so it may call my code. For the first
 thing, how to change the value of encoding attribute of a file object.

If, by in Python, you mean both in pure Python, and in Python 2,
then the answer is no. If you can add arbitrary C code, then you might
be able to hack your C library's stdio implementation to delegate fread
calls to your code.

I recommend to use Python 3 instead.

Regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Unicode literals in Python 2.7

2015-05-02 Thread Stephen J. Turnbull

Adam Bartoš writes:

  I'll describe my picture of the situation, which might be terribly wrong.
  On Linux, in a typical situation, we have a UTF-8 terminal,
  PYTHONENIOENCODING=utf-8, GNU readline is used. When the REPL wants input
  from a user the tokenizer calls PyOS_Readline, which calls GNU readline.
  The user is prompted  , during the input he can use autocompletion and
  everything and he enters u'α'. PyOS_Readline returns bu'\xce\xb1' (as
  char* or something),

It's char*, according to Parser/myreadline.c.  It is not str in Python
2.

  which is UTF-8 encoded input from the user.

By default, it's just ASCII-compatible bytes.  I don't know offhand
where, but somehow PYTHONIOENCODING tells Python it's UTF-8 -- that's
how Python knows about it in this situation.

  The tokenizer, parser, and evaluator process the input and the result is
  u'\u03b1', which is printed as an answer.
 
  In my case I install custom sys.std* objects and a custom readline
  hook.  Again, the tokenizer calls PyOS_Readline, which calls my
  readline hook, which calls sys.stdin.readline(),

This is your custom version?

  which returns an Unicode string a user entered (it was decoded from
  UTF-16-LE bytes actually). My readline hook encodes this string to
  UTF-8 and returns it. So the situation is the same.  The tokenizer
  gets b\u'xce\xb1' as before, but know it results in u'\xce\xb1'.
  
  Why is the result different?

The result is different because Python doesn't learn that the actual
encoding is UTF-8.  If you have tried setting PYTHONIOENCODING=utf-8
with your setup and that doesn't work, I'm not sure where the
communication is failing.

The only other thing I can think of is to set the encoding
sys.stdin.encoding.  That may be readonly, though (that would explain
why the only way to set the PYTHONIOENCODING is via an environment
variable).  At least you could find out what it is, with and without
PYTHONIOENCODING set to 'utf-8' (or maybe it's 'utf8' or 'UTF-8' --
all work as expected with unicode.encode/str.decode on Mac OS X).

Or it could be unimplemented in your replacement module.

  I though that in the first case PyCF_SOURCE_IS_UTF8 might have been
  set. And after your suggestion, I thought that
  PYTHONIOENCODING=utf-8 is the thing that also sets
  PyCF_SOURCE_IS_UTF8.

No.  PyCF_SOURCE_IS_UTF8 is set unconditionally in the functions
builtin_{eval,exec,compile}_impl in Python/bltins.c in the cases that
matter AFAICS.  It's not obvious to me under what conditions it might
*not* be set.  It is then consulted in ast.c in PyAST_FromNodeObject,
and nowhere else that I can see.

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Unicode literals in Python 2.7

2015-05-02 Thread Adam Bartoš

I think I have found out where the problem is. In fact, the encoding of the
interactive input is determined by sys.stdin.encoding, but only in the case
that it is a file object (see
https://hg.python.org/cpython/file/d356e68de236/Parser/tokenizer.c#l890 and
the implementation of tok_stdin_decode). For example, by default on my
system sys.stdin has encoding cp852.

 u'á'
u'\xe1' # correct
 import sys; sys.stdin = foo
 u'á'
u'\xa0' # incorrect

Even if sys.stdin contained a file-like object with proper encoding
attribute, it wouldn't work since sys.stdin has to be instance of type
'file'. So the question is, whether it is possible to make a file instance
in Python that is also customizable so it may call my code. For the first
thing, how to change the value of encoding attribute of a file object.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Unicode literals in Python 2.7

2015-05-01 Thread Adam Bartoš

On Fri, May 1, 2015 at 6:14 AM, Stephen J. Turnbull step...@xemacs.org
wrote:

 Adam Bartoš writes:

   Unfortunately, it doesn't work. With PYTHONIOENCODING=utf-8, the
   sys.std* streams are created with utf-8 encoding (which doesn't
   help on Windows since they still don't use ReadConsoleW and
   WriteConsoleW to communicate with the terminal) and after changing
   the sys.std* streams to the fixed ones and setting readline hook,
   it still doesn't work,

 I don't see why you would expect it to work: either your code is
 bypassing PYTHONIOENCODING=utf-8 processing, and that variable doesn't
 matter, or you're feeding already decoded text *as UTF-8* to your
 module which evidently expects something else (UTF-16LE?).


I'll describe my picture of the situation, which might be terribly wrong.
On Linux, in a typical situation, we have a UTF-8 terminal,
PYTHONENIOENCODING=utf-8, GNU readline is used. When the REPL wants input
from a user the tokenizer calls PyOS_Readline, which calls GNU readline.
The user is prompted  , during the input he can use autocompletion and
everything and he enters u'α'. PyOS_Readline returns bu'\xce\xb1' (as
char* or something), which is UTF-8 encoded input from the user. The
tokenizer, parser, and evaluator process the input and the result is
u'\u03b1', which is printed as an answer.

In my case I install custom sys.std* objects and a custom readline hook.
Again, the tokenizer calls PyOS_Readline, which calls my readline hook,
which calls sys.stdin.readline(), which returns an Unicode string a user
entered (it was decoded from UTF-16-LE bytes actually). My readline hook
encodes this string to UTF-8 and returns it. So the situation is the same.
The tokenizer gets b\u'xce\xb1' as before, but know it results in
u'\xce\xb1'.

Why is the result different? I though that in the first case
PyCF_SOURCE_IS_UTF8 might have been set. And after your suggestion, I
thought that PYTHONIOENCODING=utf-8 is the thing that also sets
PyCF_SOURCE_IS_UTF8.



   so presumably the PyCF_SOURCE_IS_UTF8 is still not set.

 I don't think that flag does what you think it does.  AFAICT from
 looking at the source, that flag gets unconditionally set in the
 execution context for compile, eval, and exec, and it is checked in
 the parser when creating an AST node.  So it looks to me like it
 asserts that the *internal* representation of the program is UTF-8
 *after* transforming the input to an internal representation (doing
 charset decoding, removing comments and line continuations, etc).


I thought it might do what I want because of the behaviour of eval. I
thought that the PyUnicode_AsUTF8String call in eval just encodes the
passed unicode to UTF-8, so the situation looks like follows:
eval(uu'\u031b') - (bu'\xce\xb1', PyCF_SOURCE_IS_UTF8 set) - u'\u03b1'
eval(uu'\u031b'.encode('utf-8')) - (bu'\xce\xb1', PyCF_SOURCE_IS_UTF8
not set) - u'\xce\xb1'
But of course, this my picture might be wrong.


  Well, the received text comes from sys.stdin and its encoding is
   known.

 How?  You keep asserting this.  *You* know, but how are you passing
 that information to *the Python interpreter*?  Guido may have a time
 machine, but nobody claims the Python interpreter is telepathic.


I thought that the Python interpreter knows the input comes from sys.stdin
at least to some extent because in pythonrun.c:PyRun_InteractiveOneObject
the encoding for the tokenizer is inferred from sys.stdin.encoding. But
this is actually the case only in Python 3. So I was wrong.


  Yes. In the latter case, eval has no idea how the bytes given are
   encoded.

 Eval *never* knows how bytes are encoded, not even implicitly.  That's
 one of the important reasons why Python 3 was necessary.  I think you
 know that, but you don't write like you understand the implications
 for your current work, which makes it hard to communicate.


Yes, eval never knows how bytes are encoded. But I meant it in comparison
with the first case where a Unicode string was passed.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Unicode literals in Python 2.7

2015-04-30 Thread Stephen J. Turnbull

Chris Angelico writes:

  It's legal Unicode, but it doesn't mean what he typed in.

Of course, that's obvious.  My point is Welcome to the wild wacky
world of soi-disant 'internationalized' software, where what you see
is what you get regardless of what you type.



___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Unicode literals in Python 2.7

2015-04-30 Thread Alexander Walters


does this not work for you?

from __future__ import unicode_literals


On 4/28/2015 16:20, Adam Bartoš wrote:

Hello,

is it possible to somehow tell Python 2.7 to compile a code entered in 
the interactive session with the flag PyCF_SOURCE_IS_UTF8 set? I'm 
considering adding support for Python 2 in my package 
(https://github.com/Drekin/win-unicode-console) and I have run into 
the fact that when uα is entered in the interactive session, it 
results in u\xce\xb1 rather than u\u03b1. As this seems to be a 
highly specialized question, I'm asking it here.


Regards, Drekin


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/tritium-list%40sdamon.com


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Unicode literals in Python 2.7

2015-04-30 Thread Stephen J. Turnbull

Adam Bartoš writes:

  Unfortunately, it doesn't work. With PYTHONIOENCODING=utf-8, the
  sys.std* streams are created with utf-8 encoding (which doesn't
  help on Windows since they still don't use ReadConsoleW and
  WriteConsoleW to communicate with the terminal) and after changing
  the sys.std* streams to the fixed ones and setting readline hook,
  it still doesn't work,

I don't see why you would expect it to work: either your code is
bypassing PYTHONIOENCODING=utf-8 processing, and that variable doesn't
matter, or you're feeding already decoded text *as UTF-8* to your
module which evidently expects something else (UTF-16LE?).

  so presumably the PyCF_SOURCE_IS_UTF8 is still not set.

I don't think that flag does what you think it does.  AFAICT from
looking at the source, that flag gets unconditionally set in the
execution context for compile, eval, and exec, and it is checked in
the parser when creating an AST node.  So it looks to me like it
asserts that the *internal* representation of the program is UTF-8
*after* transforming the input to an internal representation (doing
charset decoding, removing comments and line continuations, etc).

   Regarding your environment, the repeated use of custom is a red
   flag.  Unless you bundle your whole environment with the code you
   distribute, Python can know nothing about that.  In general, Python
   doesn't know what encoding it is receiving text in.
  
  Well, the received text comes from sys.stdin and its encoding is
  known.

How?  You keep asserting this.  *You* know, but how are you passing
that information to *the Python interpreter*?  Guido may have a time
machine, but nobody claims the Python interpreter is telepathic.

  Ideally, Python would recieve the text as Unicode String object so
  there would be no problem with encoding

Forget ideal.  Python 3 was created (among other reasons) to get
closer to that ideal.  But programs in Python 2 are received as str,
which is bytes in an ASCII-compatible encoding, not unicode (unless
otherwise specified by PYTHONIOENCODING or a coding cookie in a source
file, and as far as I know that's the only ways to specify source
encoding).  This specification of Python program isn't going to
change in Python 2; that's one of the major unfixable reasons that
Python 2 and Python 3 will be incompatible forever.

  The custom stdio streams and readline hooks are set at runtime by a
  code in sitecustomize. It does not affect IDLE and it is compatible
  with IPython. I would like to also set PyCF_SOURCE_IS_UTF8 at
  runtime from Python e.g. via ctypes. But this may be impossible.

  Yes. In the latter case, eval has no idea how the bytes given are
  encoded.

Eval *never* knows how bytes are encoded, not even implicitly.  That's
one of the important reasons why Python 3 was necessary.  I think you
know that, but you don't write like you understand the implications
for your current work, which makes it hard to communicate.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Unicode literals in Python 2.7

2015-04-30 Thread Adam Bartoš

 does this not work for you?

 from __future__ import unicode_literals

No, with unicode_literals I just don't have to use the u'' prefix, but the
wrong interpretation persists.


On Thu, Apr 30, 2015 at 3:03 AM, Stephen J. Turnbull step...@xemacs.org
wrote:


 IIRC, on the Linux console and in an uxterm, PYTHONIOENCODING=utf-8 in
 the environment does what you want.


Unfortunately, it doesn't work. With PYTHONIOENCODING=utf-8, the sys.std*
streams are created with utf-8 encoding (which doesn't help on Windows
since they still don't use ReadConsoleW and WriteConsoleW to communicate
with the terminal) and after changing the sys.std* streams to the fixed
ones and setting readline hook, it still doesn't work, so presumably the
PyCF_SOURCE_IS_UTF8 is still not set.



 Regarding your environment, the repeated use of custom is a red
 flag.  Unless you bundle your whole environment with the code you
 distribute, Python can know nothing about that.  In general, Python
 doesn't know what encoding it is receiving text in.


Well, the received text comes from sys.stdin and its encoding is known.
Ideally, Python would recieve the text as Unicode String object so there
would be no problem with encoding (see
http://bugs.python.org/issue17620#msg234439 ).


If you *do* know, you can set PyCF_SOURCE_IS_UTF8.  So if you know
 that all of your users will have your custom stdio and readline hooks
 installed (AFAICS, they can't use IDLE or IPython!), then you can
 bundle Python built with the flag set, or perhaps you can do the
 decoding in your custom stdio module.


The custom stdio streams and readline hooks are set at runtime by a code in
sitecustomize. It does not affect IDLE and it is compatible with IPython. I
would like to also set PyCF_SOURCE_IS_UTF8 at runtime from Python e.g. via
ctypes. But this may be impossible.



 Note that even if you have a UTF-8 input source, some users are likely
 to be surprised because IIRC Python doesn't canonicalize in its
 codecs; that is left for higher-level libraries.  Linux UTF-8 is
 usually NFC normalized, while Mac UTF-8 is NFD normalized.


Actually, I have a UTF-16-LE source, but that is not important since it's
decoted to Python Unicode string object. I have this Unicode string and I'm
to return it from the readline hook, but I don't know how to communicate it
to the caller – the tokenizer – so it is interpreted correctly. Note that
the following works:

 eval(raw_input('~~ '))
~~ u'α'
u'\u03b1'

Unfortunatelly, the REPL works differently than eval/exec on raw_input. It
seems that the only option is to bypass the REPL by a custom REPL (e.g.
based on code.InteractiveConsole). However, wrapping up the execution of a
script, so that the custom REPL is invoked at the right place, is
complicated.


   Le 29 avr. 2015 10:36, Adam Bartoš dre...@gmail.com a écrit :
 Why I'm talking about PyCF_SOURCE_IS_UTF8? eval(uu'\u03b1') -
u'\u03b1' but eval(uu'\u03b1'.encode('utf-8')) - u'\xce\xb1'.

 Just to be clear, you accept those results as correct, right?


Yes. In the latter case, eval has no idea how the bytes given are encoded.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Unicode literals in Python 2.7

2015-04-29 Thread Nick Coghlan

On 29 April 2015 at 06:20, Adam Bartoš dre...@gmail.com wrote:
 Hello,

 is it possible to somehow tell Python 2.7 to compile a code entered in the
 interactive session with the flag PyCF_SOURCE_IS_UTF8 set? I'm considering
 adding support for Python 2 in my package
 (https://github.com/Drekin/win-unicode-console) and I have run into the fact
 that when uα is entered in the interactive session, it results in
 u\xce\xb1 rather than u\u03b1. As this seems to be a highly specialized
 question, I'm asking it here.

As far as I am aware, we don't have the equivalent of a coding
cookie for the interactive interpreter, so if anyone else knows how
to do it, I'll be learning something too :)

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Unicode literals in Python 2.7

2015-04-29 Thread Adam Bartoš

This situation is a bit different from coding cookies. They are used when
we have bytes from a source file, but we don't know its encoding. During
interactive session the tokenizer always knows the encoding of the bytes. I
would think that in the case of interactive session the PyCF_SOURCE_IS_UTF8
should be always set so the bytes containing encoded non-ASCII characters
are interpreted correctly. Why I'm talking about PyCF_SOURCE_IS_UTF8?
eval(uu'\u03b1') - u'\u03b1' but eval(uu'\u03b1'.encode('utf-8')) -
u'\xce\xb1'. I understand that in the second case eval has no idea how are
the given bytes encoded. But the first case is actually implemented by
encoding to utf-8 and setting PyCF_SOURCE_IS_UTF8. That's why I'm talking
about the flag.

Regards, Drekin

On Wed, Apr 29, 2015 at 9:25 AM, Nick Coghlan ncogh...@gmail.com wrote:

 On 29 April 2015 at 06:20, Adam Bartoš dre...@gmail.com wrote:
  Hello,
 
  is it possible to somehow tell Python 2.7 to compile a code entered in
 the
  interactive session with the flag PyCF_SOURCE_IS_UTF8 set? I'm
 considering
  adding support for Python 2 in my package
  (https://github.com/Drekin/win-unicode-console) and I have run into the
 fact
  that when uα is entered in the interactive session, it results in
  u\xce\xb1 rather than u\u03b1. As this seems to be a highly
 specialized
  question, I'm asking it here.

 As far as I am aware, we don't have the equivalent of a coding
 cookie for the interactive interpreter, so if anyone else knows how
 to do it, I'll be learning something too :)

 Cheers,
 Nick.

 --
 Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Unicode literals in Python 2.7

2015-04-29 Thread Guido van Rossum

I suspect the interactive session is *not* always in UTF8. It probably
depends on the keyboard mapping of your terminal emulator. I imagine in
Windows it's the current code page.

On Wed, Apr 29, 2015 at 9:19 AM, Adam Bartoš dre...@gmail.com wrote:

 Yes, that works for eval. But I want it for code entered during an
 interactive session.

  u'α'
 u'\xce\xb1'

 The tokenizer gets bu'\xce\xb1' by calling PyOS_Readline and it knows
 it's utf-8 encoded. But the result of evaluation is u'\xce\xb1'. Because of
 how eval works, I believe that it would work correctly if the
 PyCF_SOURCE_IS_UTF8 was set, but it is not. That is why I'm asking if there
 is a way to set it. Also, my naive thought is that it should be always set
 in the case of interactive session.


 On Wed, Apr 29, 2015 at 4:59 PM, Victor Stinner victor.stin...@gmail.com
 wrote:

 Le 29 avr. 2015 10:36, Adam Bartoš dre...@gmail.com a écrit :
  Why I'm talking about PyCF_SOURCE_IS_UTF8? eval(uu'\u03b1') -
 u'\u03b1' but eval(uu'\u03b1'.encode('utf-8')) - u'\xce\xb1'.

 There is a simple option to get this flag: call eval() with unicode, not
 with encoded bytes.

 Victor



 ___
 Python-Dev mailing list
 Python-Dev@python.org
 https://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe:
 https://mail.python.org/mailman/options/python-dev/guido%40python.org




-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Unicode literals in Python 2.7

2015-04-29 Thread Adam Bartoš

Yes, that works for eval. But I want it for code entered during an
interactive session.

 u'α'
u'\xce\xb1'

The tokenizer gets bu'\xce\xb1' by calling PyOS_Readline and it knows
it's utf-8 encoded. But the result of evaluation is u'\xce\xb1'. Because of
how eval works, I believe that it would work correctly if the
PyCF_SOURCE_IS_UTF8 was set, but it is not. That is why I'm asking if there
is a way to set it. Also, my naive thought is that it should be always set
in the case of interactive session.


On Wed, Apr 29, 2015 at 4:59 PM, Victor Stinner victor.stin...@gmail.com
wrote:

 Le 29 avr. 2015 10:36, Adam Bartoš dre...@gmail.com a écrit :
  Why I'm talking about PyCF_SOURCE_IS_UTF8? eval(uu'\u03b1') -
 u'\u03b1' but eval(uu'\u03b1'.encode('utf-8')) - u'\xce\xb1'.

 There is a simple option to get this flag: call eval() with unicode, not
 with encoded bytes.

 Victor

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Unicode literals in Python 2.7

2015-04-29 Thread Adam Bartoš

I am in Windows and my terminal isn't utf-8 at the beginning, but I install
custom sys.std* objects at runtime and I also install custom readline hook,
so the interactive loop gets the input from my stream objects via
PyOS_Readline. So when I enter u'α', the tokenizer gets bu'\xce\xb1',
which is the string encoded in utf-8, and sys.stdin.encoding == 'utf-8'.
However, the input is then interpreted as u'\xce\xb1' instead of u'\u03b1'.

On Wed, Apr 29, 2015 at 6:40 PM, Guido van Rossum gu...@python.org wrote:

 I suspect the interactive session is *not* always in UTF8. It probably
 depends on the keyboard mapping of your terminal emulator. I imagine in
 Windows it's the current code page.

 On Wed, Apr 29, 2015 at 9:19 AM, Adam Bartoš dre...@gmail.com wrote:

 Yes, that works for eval. But I want it for code entered during an
 interactive session.

  u'α'
 u'\xce\xb1'

 The tokenizer gets bu'\xce\xb1' by calling PyOS_Readline and it knows
 it's utf-8 encoded. But the result of evaluation is u'\xce\xb1'. Because of
 how eval works, I believe that it would work correctly if the
 PyCF_SOURCE_IS_UTF8 was set, but it is not. That is why I'm asking if there
 is a way to set it. Also, my naive thought is that it should be always set
 in the case of interactive session.


 On Wed, Apr 29, 2015 at 4:59 PM, Victor Stinner victor.stin...@gmail.com
  wrote:

 Le 29 avr. 2015 10:36, Adam Bartoš dre...@gmail.com a écrit :
  Why I'm talking about PyCF_SOURCE_IS_UTF8? eval(uu'\u03b1') -
 u'\u03b1' but eval(uu'\u03b1'.encode('utf-8')) - u'\xce\xb1'.

 There is a simple option to get this flag: call eval() with unicode, not
 with encoded bytes.

 Victor



 ___
 Python-Dev mailing list
 Python-Dev@python.org
 https://mail.python.org/mailman/listinfo/python-dev
 Unsubscribe:
 https://mail.python.org/mailman/options/python-dev/guido%40python.org




 --
 --Guido van Rossum (python.org/~guido)

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Unicode literals in Python 2.7

2015-04-29 Thread Oleg Broytman

On Wed, Apr 29, 2015 at 09:40:43AM -0700, Guido van Rossum gu...@python.org 
wrote:
 I suspect the interactive session is *not* always in UTF8. It probably
 depends on the keyboard mapping of your terminal emulator. I imagine in
 Windows it's the current code page.

   Even worse: in w32 it can be an OEM codepage.

Oleg.
-- 
 Oleg Broytmanhttp://phdru.name/p...@phdru.name
   Programmers don't die, they just GOSUB without RETURN.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Unicode literals in Python 2.7

2015-04-29 Thread Stephen J. Turnbull

Adam Bartoš writes:

  I am in Windows and my terminal isn't utf-8 at the beginning, but I
  install custom sys.std* objects at runtime and I also install
  custom readline hook,

IIRC, on the Linux console and in an uxterm, PYTHONIOENCODING=utf-8 in
the environment does what you want.  (Can't test at the moment, I'm on
a Mac and Terminal.app somehow fails to pass the right thing to Python
from the input methods I have available -- I get an empty string,
while I don't seem to have an uxterm, only an xterm.)  This has to be
set at interpreter startup; once the interpreter has decided its IO
encoding, you can't change it, you can only override it by
intercepting the console input and decoding it yourself.

Regarding your environment, the repeated use of custom is a red
flag.  Unless you bundle your whole environment with the code you
distribute, Python can know nothing about that.  In general, Python
doesn't know what encoding it is receiving text in.

If you *do* know, you can set PyCF_SOURCE_IS_UTF8.  So if you know
that all of your users will have your custom stdio and readline hooks
installed (AFAICS, they can't use IDLE or IPython!), then you can
bundle Python built with the flag set, or perhaps you can do the
decoding in your custom stdio module.

Note that even if you have a UTF-8 input source, some users are likely
to be surprised because IIRC Python doesn't canonicalize in its
codecs; that is left for higher-level libraries.  Linux UTF-8 is
usually NFC normalized, while Mac UTF-8 is NFD normalized.

   u'\xce\xb1'

Note that that is perfectly legal Unicode.

   Le 29 avr. 2015 10:36, Adam Bartoš dre...@gmail.com a écrit :
Why I'm talking about PyCF_SOURCE_IS_UTF8? eval(uu'\u03b1') -
   u'\u03b1' but eval(uu'\u03b1'.encode('utf-8')) - u'\xce\xb1'.

Just to be clear, you accept those results as correct, right?


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Unicode literals in Python 2.7

2015-04-29 Thread Chris Angelico

On Thu, Apr 30, 2015 at 11:03 AM, Stephen J. Turnbull
step...@xemacs.org wrote:
 Note that even if you have a UTF-8 input source, some users are likely
 to be surprised because IIRC Python doesn't canonicalize in its
 codecs; that is left for higher-level libraries.  Linux UTF-8 is
 usually NFC normalized, while Mac UTF-8 is NFD normalized.

u'\xce\xb1'

 Note that that is perfectly legal Unicode.

It's legal Unicode, but it doesn't mean what he typed in. This means:

'\xce' LATIN CAPITAL LETTER I WITH CIRCUMFLEX
'\xb1' PLUS-MINUS SIGN

but the original input was:

'\u03b1' GREEK SMALL LETTER ALPHA

ChrisA
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Unicode literals in Python 2.7

2015-04-29 Thread Victor Stinner

Le 29 avr. 2015 10:36, Adam Bartoš dre...@gmail.com a écrit :
 Why I'm talking about PyCF_SOURCE_IS_UTF8? eval(uu'\u03b1') -
u'\u03b1' but eval(uu'\u03b1'.encode('utf-8')) - u'\xce\xb1'.

There is a simple option to get this flag: call eval() with unicode, not
with encoded bytes.

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

[Python-Dev] Unicode literals in Python 2.7

2015-04-28 Thread Adam Bartoš

Hello,

is it possible to somehow tell Python 2.7 to compile a code entered in the
interactive session with the flag PyCF_SOURCE_IS_UTF8 set? I'm considering
adding support for Python 2 in my package (
https://github.com/Drekin/win-unicode-console) and I have run into the fact
that when uα is entered in the interactive session, it results in
u\xce\xb1 rather than u\u03b1. As this seems to be a highly specialized
question, I'm asking it here.

Regards, Drekin
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Unicode literals in Python 2.7

Re: [Python-Dev] Unicode literals in Python 2.7

Re: [Python-Dev] Unicode literals in Python 2.7

Re: [Python-Dev] Unicode literals in Python 2.7

Re: [Python-Dev] Unicode literals in Python 2.7

Re: [Python-Dev] Unicode literals in Python 2.7

Re: [Python-Dev] Unicode literals in Python 2.7

Re: [Python-Dev] Unicode literals in Python 2.7

Re: [Python-Dev] Unicode literals in Python 2.7

Re: [Python-Dev] Unicode literals in Python 2.7

Re: [Python-Dev] Unicode literals in Python 2.7

Re: [Python-Dev] Unicode literals in Python 2.7

Re: [Python-Dev] Unicode literals in Python 2.7

Re: [Python-Dev] Unicode literals in Python 2.7

Re: [Python-Dev] Unicode literals in Python 2.7

Re: [Python-Dev] Unicode literals in Python 2.7

Re: [Python-Dev] Unicode literals in Python 2.7

Re: [Python-Dev] Unicode literals in Python 2.7

Re: [Python-Dev] Unicode literals in Python 2.7

Re: [Python-Dev] Unicode literals in Python 2.7

Re: [Python-Dev] Unicode literals in Python 2.7

Re: [Python-Dev] Unicode literals in Python 2.7

Re: [Python-Dev] Unicode literals in Python 2.7

[Python-Dev] Unicode literals in Python 2.7

24 matches

Site Navigation

Mail list logo

Footer information