[issue13216] Add cp65001 codec

2011-10-18 Thread STINNER Victor

New submission from STINNER Victor :

Thanks to #12281, it is now trivial to implement any Windows code page in 
Python. I don't know if existing code pages (e.g. cp932) should use 
codecs.code_page_encode/.code_page_decode on Windows, or continue to use the 
(portable) Python code.

Users want the code page 65001, even if I consider that it is useless to set 
the ANSI code page to 65001 in a console (see issue #1602), but that's a 
different story. Attached patch implements this code page.

--
components: Unicode
files: cp65001.py
messages: 145871
nosy: amaury.forgeotdarc, haypo, loewis
priority: normal
severity: normal
status: open
title: Add cp65001 codec
versions: Python 3.3
Added file: http://bugs.python.org/file23453/cp65001.py

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13216] Add cp65001 codec

2011-10-18 Thread STINNER Victor

STINNER Victor  added the comment:

> Users want the code page 65001

See issues #6058, #7441 and #10920.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13216] Add cp65001 codec

2011-10-19 Thread Martin v . Löwis

Martin v. Löwis  added the comment:

We shouldn't use the MS codec if we have our own, as they may differ.

As for the 65001 bug: is that actually solved by this codec?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13216] Add cp65001 codec

2011-10-19 Thread STINNER Victor

STINNER Victor  added the comment:

> We shouldn't use the MS codec if we have our own, as they may differ.

Ok, I agree. MS codec has a nice replacement behaviour (search for a similar 
glyph): cp1252 encodes Ł to b'L' for example. Our codec raises a 
UnicodeEncodeError on u'\u0141'.encode('cp1252').

> As for the 65001 bug: is that actually solved by this codec?

Sorry, which bug?

See tests using CP_UTF8 in test_codecs. Depending on the Windows version, you 
don't get the same behaviour on surrogates. Before Windows Vista, surrogates 
were always encoded, whereas you can now choose the behaviour using the Python 
error handler:

if self.vista_or_later():
tests.append(('\udc80', 'strict', None)) # None=UnicodeEncodeError
tests.append(('\udc80', 'ignore', b''))
tests.append(('\udc80', 'replace', b'?'))
else:
tests.append(('\udc80', 'strict', b'\xed\xb2\x80'))

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13216] Add cp65001 codec

2011-10-19 Thread STINNER Victor

STINNER Victor  added the comment:

> I consider that it is useless to set the ANSI code page to 65001 in a console

I did more tests on the Windows console, focused on output, see:
http://bugs.python.org/issue1602#msg145898

I was wrong, it *is* useful to change the code page to 65001. Even if we have 
fully Unicode compliant sys.stdout and sys.stderr, setting the code page to 
CP_UTF8 (65001) does still improve Unicode support in some cases:

 - if the output (stdout and/or stderr) is redirected
 - if you encode Unicode to the console code page to use directly 
sys.stdout.buffer and sys.stderr.buffer

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13216] Add cp65001 codec

2011-10-19 Thread Martin v . Löwis

Martin v. Löwis  added the comment:

>> As for the 65001 bug: is that actually solved by this codec?
> 
> Sorry, which bug?

#6501 and friends (isn't it interesting that the issue of code page
65001 is reported as bug 6501?)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13216] Add cp65001 codec

2011-10-19 Thread STINNER Victor

STINNER Victor  added the comment:

> > Sorry, which bug?

> #6501 and friends

Hum, this particular issue, #6501, doesn't concern the code page 65001. The 
typical usecase (issues #7441 and #10920) is:

C:\victor\cpython>chcp 65001
Page de codes active : 65001

C:\victor\cpython>pcbuild\python_d.exe
Fatal Python error: Py_Initialize: can't initialize sys standard streams
LookupError: unknown encoding: cp65001


The console and console output code pages may be changed by something else.

The current workaround is to set PYTHONIOENCODING environment variable to 
utf-8, but as explained in msg132831, the workaround is not applicable if 
Python is embeded or if the program has been frozen by cx-freeze ("cx-freeze 
deliberately sets Py_IgnoreEnvironmentFlag").

--

The issue #6501 was a bug in io.device_encoding(). I fixed it in Python 3.3 and 
I'm waiting... since 5 months... for Graham Dumpleton before backporting the 
fix. The issue suggests also to not fail if the encoding cannot be found (I 
dislike this idea).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13216] Add cp65001 codec

2011-10-26 Thread Roundup Robot

Roundup Robot  added the comment:

New changeset 0eac706d82d1 by Victor Stinner in branch 'default':
Fix the issue number of my cp65001 commit: 13247 => issue #13216
http://hg.python.org/cpython/rev/0eac706d82d1

--
nosy: +python-dev

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13216] Add cp65001 codec

2011-10-26 Thread STINNER Victor

STINNER Victor  added the comment:

New changeset 2cad20e2e588 by Victor Stinner in branch 'default':
Close #13247: Add cp65001 codec, the Windows UTF-8 (CP_UTF8)
http://hg.python.org/cpython/rev/2cad20e2e588

--
resolution:  -> fixed
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13216] Add cp65001 codec

2011-10-26 Thread STINNER Victor

STINNER Victor  added the comment:

Lib/encodings/cp65001.py uses a little trick to mark the codec as specific to 
Windows:
-
if not hasattr(codecs, 'code_page_encode'):
raise LookupError("cp65001 encoding is only available on Windows")
-

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com