[issue13064] Port codecs and error handlers to the new Unicode API

2011-11-17 Thread STINNER Victor

Changes by STINNER Victor victor.stin...@haypocalc.com:


--
resolution:  - duplicate
status: open - closed

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue13064
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13064] Port codecs and error handlers to the new Unicode API

2011-11-16 Thread STINNER Victor

STINNER Victor victor.stin...@haypocalc.com added the comment:

Martin von Loewis implemented this issue, thanks Martin!

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue13064
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13064] Port codecs and error handlers to the new Unicode API

2011-09-29 Thread STINNER Victor

New submission from STINNER Victor victor.stin...@haypocalc.com:

We really need a new API for error handlers, using Python objects instead of 
Py_UNICODE* strings, and using code point indexes instead of UTF-16 unit 
indexes (index in the Py_UNICODE* object). It's also inefficient to encode to 
Py_UNICODE at the first encode/decode error.

I added private APIs, we may make them public:

 * _PyUnicode_AsASCIIString()
 * _PyUnicode_AsLatin1String()
 * _PyUnicode_AsUTF8String()

--

Martin answered me by mail:

Would you like to work on this? Some thoughts:

- encoding error handlers are easier than decoding, since the encoding
  error API uses Py_UNICODE* for almost no good reason (except to pass
  substrings into the exception object, which is better done with
  PyUnicode_Substring). Decoding has the issue that the error handler
  may produce a replacement string which then needs to be inserted into
  the output.

- for decoding, I suggest to duplicate the error handling utility
  function, into one that operates on Unicode objects only. Then port
  one codec at a time, and ultimately remove the then-unused Py_UNICODE
  function.

- adding an error handler result into a string may cause widening of the
  string. I can see two approaches:

  a) write decoders in Py_UCS4. This is perhaps best for the rarely-used
 codecs, such as UTF-7.
  b) write the codecs so that they do incremental widening. Start off
 with a Py_UCS1 buffer, and check each decoded character whether it
 is out of range. When you get an error handler result, check
 maxchar and widen the result accordingly.
  c) in principle, there is a third approach: run over the string once,
 collect all error handler results. Then allocate the output string,
 decode again, pasting the replacement strings into the output
 interleaved with regular decoded chars. This seems too complicated
 to implement.

--
components: Unicode
messages: 144639
nosy: haypo
priority: normal
severity: normal
status: open
title: Port codecs and error handlers to the new Unicode API
versions: Python 3.3

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue13064
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13064] Port codecs and error handlers to the new Unicode API

2011-09-29 Thread Ezio Melotti

Changes by Ezio Melotti ezio.melo...@gmail.com:


--
nosy: +ezio.melotti
stage:  - needs patch
type:  - feature request

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue13064
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com