STINNER Victor victor.stin...@haypocalc.com added the comment:
I added a cp65001 codec to Python 3.3: see issue #13216.
--
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12281
___
Roundup Robot devn...@psf.upfronthosting.co.za added the comment:
New changeset af0800b986b7 by Victor Stinner in branch 'default':
Issue #12281: Rewrite the MBCS codec to handle correctly replace and ignore
http://hg.python.org/cpython/rev/af0800b986b7
--
nosy: +python-dev
Roundup Robot devn...@psf.upfronthosting.co.za added the comment:
New changeset 5841920d1ef6 by Victor Stinner in branch 'default':
Issue #12281: Skip code page tests on non-Windows platforms
http://hg.python.org/cpython/rev/5841920d1ef6
--
___
Roundup Robot devn...@psf.upfronthosting.co.za added the comment:
New changeset 413b89242766 by Victor Stinner in branch 'default':
Issue #12281: Fix test_codecs.test_cp932() on Windows XP
http://hg.python.org/cpython/rev/413b89242766
--
___
Python
STINNER Victor victor.stin...@haypocalc.com added the comment:
test_codecs pass on Windows XP and Windows Seven buildbots.
--
resolution: - fixed
status: open - closed
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12281
STINNER Victor victor.stin...@haypocalc.com added the comment:
mbcs6.patch: update patch to tip.
--
Added file: http://bugs.python.org/file23430/mbcs6.patch
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12281
Changes by STINNER Victor victor.stin...@haypocalc.com:
Removed file: http://bugs.python.org/file22374/mbcs4.patch
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12281
___
Changes by STINNER Victor victor.stin...@haypocalc.com:
Removed file: http://bugs.python.org/file22389/mbcs5.patch
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12281
___
STINNER Victor victor.stin...@haypocalc.com added the comment:
Version 7 of my patch. This patch is ready for a review: I implemented all TODO.
Summary of the patch (of this issue):
- fix mbcs encoding to handle correctly ignore replace error handlers on all
Windows version
- the mbcs
Changes by STINNER Victor victor.stin...@haypocalc.com:
Removed file: http://bugs.python.org/file23430/mbcs6.patch
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12281
___
Changes by Ezio Melotti ezio.melo...@gmail.com:
--
nosy: +ezio.melotti
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12281
___
___
STINNER Victor victor.stin...@haypocalc.com added the comment:
What about something like .decode('mbcs', errors='windows')?
Yes, we can use an error handler specific to the mbcs codec, but I would prefer
to not introduce special error handlers.
For os.fsencode(), we can keep it unchanged, or
STINNER Victor victor.stin...@haypocalc.com added the comment:
Patch version 5 fixes the encode/decode flags on Windows XP. The codecs give
different result on XP and Seven in some cases:
Seven:
- b'\x81\x00abc'.decode('cp932', 'replace') returns '\u30fb\x00abc'
- '\udc80'.encode(CP_UTF8,
Amaury Forgeot d'Arc amaur...@gmail.com added the comment:
What is the use of these code_page_encode() functions?
--
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12281
___
STINNER Victor victor.stin...@haypocalc.com added the comment:
TODO: add more tests
CP_UTF8:
if self.vista_or_later:
tests.append(('\udc80', 'strict', None))
tests.append(('\udc80', 'ignore', b''))
tests.append(('\udc80', 'replace', b'\xef\xbf\xbd'))
STINNER Victor victor.stin...@haypocalc.com added the comment:
What is the use of these code_page_encode() functions?
I wrote them to be able to write tests.
We can maybe use them to implement the Python code page codecs using a
custom codec register function: see msg138246. Windows codecs
Amaury Forgeot d'Arc amaur...@gmail.com added the comment:
I don't know yet how Windows do decode bytes filenames
(especially how it handles undecodable bytes),
I suppose that it uses MultiByteToWideChar using cp=CP_ACP and flags=0.
It's likely, yes. But you don't need a new codec function
STINNER Victor victor.stin...@haypocalc.com added the comment:
Patch version 4 (mbcs4.patch):
- fix encode and decode flags depending on the code page and Windows version,
e.g. use WC_ERR_INVALID_CHARS instead of WC_NO_BEST_FIT_CHARS for CP_UTF8 on
Windows Vista and later
- fix usage of the
Changes by STINNER Victor victor.stin...@haypocalc.com:
Removed file: http://bugs.python.org/file22282/mbcs.patch
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12281
___
Changes by STINNER Victor victor.stin...@haypocalc.com:
Removed file: http://bugs.python.org/file22315/mbcs2.patch
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12281
___
Changes by STINNER Victor victor.stin...@haypocalc.com:
Removed file: http://bugs.python.org/file22340/mbcs3.patch
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12281
___
STINNER Victor victor.stin...@haypocalc.com added the comment:
Patch version 3:
- add unit tests for code pages 932, 1252, CP_UTF7 and CP_UTF8
- fix encode/decode flags for CP_UTF7/CP_UTF8
- fix encode name on UnicodeDecodeError, support also CP_UTF7 and CP_UTF8
code page names
TODO:
-
STINNER Victor victor.stin...@haypocalc.com added the comment:
Using my patch, it is possible create a codec for any code page on demand:
register a function checking if the encoding name starts with cp and ends
with a valid code page number.
Even if it is bad idea to set the OEM code page to
STINNER Victor victor.stin...@haypocalc.com added the comment:
Version 2 of my patch (mbcs2.patch):
- patch also the encoder: fix ignore/replace depending on the Windows version,
support any error handler: encode character per character if encoding in strict
mode fails
- Add
STINNER Victor victor.stin...@haypocalc.com added the comment:
Example on Windows Vista with ANSI=cp932:
import codecs
codecs.code_page_encode(1252, '\xe9')
(b'\xe9', 1)
codecs.mbcs_encode('\xe9')
...
UnicodeEncodeError: 'mbcs' codec can't encode characters in position 0--1:
invalid
STINNER Victor victor.stin...@haypocalc.com added the comment:
Decode examples, ANSI=cp932:
codecs.code_page_decode(1252, b'\x80')
('\u20ac', 1)
codecs.code_page_decode(932, b'\x82')
...
UnicodeDecodeError: 'mbcs' codec can't decode bytes in position 0--1: No
mapping for the Unicode
STINNER Victor victor.stin...@haypocalc.com added the comment:
mbcs.patch fixes PyUnicode_DecodeMBCS():
- only use flags=0 if errors=replace on Windows = Vista or if
errors=ignore on Windows Vista
- support any error handler
- support any code page (but the code page is hardcoded to CP_ACP)
STINNER Victor victor.stin...@haypocalc.com added the comment:
Example with ANSI=cp932 (on Windows Seven):
- b'abc\xffdef'.decode('mbcs', 'replace') gives 'abc\uf8f3def'
- b'abc\xffdef'.decode('mbcs', 'ignore') gives 'abcdef'
--
nosy: +ocean-city
STINNER Victor victor.stin...@haypocalc.com added the comment:
Example with ANSI=cp932 (on Windows Seven):
- b'abc\xffdef'.decode('mbcs', 'replace') gives 'abc\uf8f3def'
- b'abc\xffdef'.decode('mbcs', 'ignore') gives 'abcdef'
Oh, and b'\xff'.decode('mbcs', 'surrogateescape') gives '\udcff'
New submission from STINNER Victor victor.stin...@haypocalc.com:
Starting at Python 3.2, the MBCS codec uses MultiByteToWideChar() to decode
bytes using flags=MB_ERR_INVALID_CHARS by default (strict error handler),
flags=0 for the ignore error handler, and raise a ValueError for other error
STINNER Victor victor.stin...@haypocalc.com added the comment:
MBCS codec was changed by #850997. Martin von Loewis proposed solutions to
implement other error handlers in msg19180.
--
___
Python tracker rep...@bugs.python.org
Changes by STINNER Victor victor.stin...@haypocalc.com:
--
nosy: +loewis
___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue12281
___
___
32 matches
Mail list logo