Ezio Melotti <[email protected]> added the comment:
So to summarize a bit, there are different possible level of strictness:
1) all the possible encodable values, including the ones >10FFFF;
2) values in range 0..10FFFF;
3) values in range 0..10FFFF except surrogates (aka scalar values);
4) values in range 0..10FFFF except surrogates and noncharacters;
and this is what is currently available in Python:
1) not available, probably it will never be;
2) available through the 'surrogatepass' error handler;
3) default behavior (i.e. with the 'strict' error handler);
4) currently not available.
(note: this refers to the utf-8 codec in Python 3, but it should be true for
the utf-16/32 codecs too once #12892 is fixed. This whole message refers to
codecs only and what they should (dis)allow. What we use internally seems to
work fine and doesn't need to be changed.)
Now, assume that we don't care about option 1 and want to implement the missing
option 4 (which I'm still not 100% sure about). The possible options are:
* add a new codec (actually one for each UTF encoding);
* add a new error handler that explicitly disallows noncharacters;
* change the meaning of 'strict' to match option 4;
This depends on what should be the default behavior while dealing with
noncharacters. If they are rejected by default, then 'strict' should reject
them. However this would leave us without option 3 (something to encode all
and only scalar values), and surrogatepass will be misnamed if it also allows
noncharacters (and if it doesn't we will end up without option 2 too). This is
apparently what Perl does:
> Perl will never ever produce nor accept one of the 66 noncharacers
> on any stream marked as one of the 7 character encoding schemes.
Implementation-wise, I think the 'strict' error handler should be the strictest
one, because the codec must detects all the "problematic" chars and send them
to the error handler that might then decide what to do with them. I.e. if the
codec detects noncharacters, sends them to the error handler, and the error
handler is strict, an error will be raised; if it doesn't detect them, the
error handler won't be able to do anything with them.
Another option is to provide another codec that specifically detects them, but
this means re-implementing a slightly different version of each codec (or
possibly add an extra argument to the PyUnicode_{Encode,Decode}UTF* functions).
We could also decide to leave the handling of noncharacters as it is -- after
all the Unicode standard doesn't seem to explicitly forbid them as it does with
e.g. surrogates.
> We have a flavor of non-strict utf8, spelled "utf8" instead of "UTF-8",
> that can produce and accept illegal characters, although by default it
> is still going to generate a warning
How did Perl implement this? With two (or more) slightly different version of
the same codec?
And how does Perl handle errors? With some global options that turns (possibly
specific) warnings into error (like python -We)?
Python has different codecs that encode/decode str/bytes and whenever they find
a "problematic" char they send it to the error handler that might decide to
raise an error, remove the char, replace it with something else, sending it
back unchanged, generate a warning and so on. In this way you can have
different combinations of codecs and error handlers to get the desired
behaviors. (and FWIW in Python 'utf8' is an alias for 'UTF-8'.)
> I think a big problem here is that the Python culture doesn't use stream
> encodings enough. People are always making their own repeated and tedious
> calls to encode and then sending stuff out a byte stream, by which time it
> is too late to check.
> [...]
> Anything that deals with streams should have an encoding argument. But
> often/many? things in Python don't.
Several objects have an .encoding and .error attributes (e.g. sys.stdin/out),
and they are used to encode/decode the text/bytes sent/read to/from them. In
other places we prefer the "explicit is better than implicit" approach and
require the user (or some other higher-level layer) to encode/decode manually.
I'm not sure why you are saying that it's too late to check, and since the
encoding/decoding happens only in a few places I don't think it's tedious at
all (and often it's automatic too).
----------
_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue12729>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com