[issue36819] Crash during encoding using UTF-16/32 and custom error handler

Walter Dörwald Wed, 29 Sep 2021 08:17:22 -0700

Walter Dörwald <[email protected]> added the comment:

The original specification (PEP 293) required that an error handler called for 
encoding *must* return a replacement string (not bytes). This returned string 
must then be encoded again. Only if this fails an exception must be raised.


Returning bytes from the encoding error handler is an extension specified by 
PEP 383:

> The error handler interface is extended to allow the encode error handler to 
> return byte strings immediately, in addition to returning Unicode strings 
> which then get encoded again (also see the discussion below).

So for 3. in Serhiy's problem list

> 3. Incorrect exception can be raised if the error handler returns invalid 
> string/bytes: a non-ASCII string or a bytes object consisting of not a whole 
> number of units.

I get:

🐚 ~/ ❯ python
Python 3.9.7 (default, Sep  3 2021, 12:37:55)
[Clang 12.0.5 (clang-1205.0.22.9)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> def bad(exc):
...  return ('\udbc0', exc.start)
...
>>> import codecs
>>> codecs.register_error('bad', bad)
>>> '\udbc0'.encode('utf-16', 'bad')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError

I would have expected an exception message that basically looks like the one 
I'd get, if I had used the strict error handler.

But otherwise returning a replacement that is unencodable is allowed and should 
raise an exception (which happens here, but with a missing exception message). 
(Returning something unencodable might make sense when the error handler is 
able to create replacement characters for some unencodable input, but not for 
other, but of course the error handler can always raise an exception directly).

Returning invalid bytes is not an issue, they simply get written to the output. 
That's exactly the use case of PEP 383: The bytes couldn't be decoded in the 
specified encoding, so they are "invalid", but the surrogateescape error 
handler encodes them back to the same "invalid" bytes. So the error handler is 
allowed to output bytes that can't be decoded again with the same encoding.

Returning a restart position outside the valid range of the length of the 
original string should raise an IndexError according to PEP 293:

> If the callback does not raise an exception (either the one passed in, or a 
> different one), it must return a tuple: `(replacement, newpos)`
> `replacement` is a unicode object that the encoder will encode and emit 
> instead of the unencodable `object[start:end]` part, `newpos` specifies
> a new position within object, where (after encoding the replacement) the 
> encoder will continue encoding.
> Negative values for `newpos` are treated as being relative to end of object. 
> If `newpos` is out of bounds the encoder will raise an `IndexError`.

Of course we could retroactively reinterpret "out of bounds" as outside of 
`range(exc.start + 1, len(object))`, instead of outside `range(0, 
len(object))`. An error handler that never advances is broken anyway. But we 
can't detect "never".

However it would probably be OK to reject pathological error handlers (i.e. 
those that don't advance (i.e. return at least `exc.start + 1` as the restart 
position)). But I'm not sure how that's different from an error handler that 
skips ahead much farther (i.e. returns something like 
`(exc.start+len(object))//2` or `max(exc.start+1, len(object)-10)`): The 
returned restart position leads to a certain expectation of how many bytes the 
encoder might have to output until everything is encoded and must adjust 
accordingly.

----------

_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue36819>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue36819] Crash during encoding using UTF-16/32 and custom error handler

Reply via email to