David Watson <bai...@users.sourceforge.net> added the comment: OK, here are new versions of the original patches.
I've tweaked the docs to make clear that ASCII-compatible encodings actually *are* ASCII, and point to an explanation as soon as they're mentioned. You're right that PyUnicode_AsEncodedString() is the preferable interface for the argument converter (I think I got PyUnicode_AsEncodedObject() from an old version of PyUnicode_FSConverter() :/), but for the ASCII step I've just short-circuited it and used PyUnicode_EncodeASCII() directly, since the converter has already checked that the object is of Unicode type. For the IDNA step, PyUnicode_AsEncodedString() should result in a less confusing error message if the codec returns some non-bytes object one day. However, the PyBytes_Check isn't to check up on the codec, but to check for a bytes argument, which the converter also supports. For that reason, I think encode_hostname would be a misleading name, but I've renamed it hostname_converter after the example of PyUnicode_FSConverter, and renamed unicode_from_hostname to decode_hostname. I've also made the converter check for UnicodeEncodeError in the ASCII step, but the end result really is UnicodeError if the IDNA step fails, because the "idna" codec does not use UnicodeEncodeError or UnicodeDecodeError. Complain about that if you wish :) I think the example I gave in the previous comment was also confusing, so just to be clear... In /etc/hosts (in UTF-8 encoding): 127.0.0.2 € 127.0.0.3 xn--lzg Without patches: >>> from socket import * >>> getnameinfo(("127.0.0.3", 0), 0) ('xn--lzg', '0') >>> getnameinfo(("127.0.0.2", 0), 0) ('€', '0') >>> getaddrinfo(*_) [(2, 1, 6, '', ('127.0.0.3', 0)), (2, 2, 17, '', ('127.0.0.3', 0)), (2, 3, 0, '', ('127.0.0.3', 0))] >>> '€'.encode("idna") b'xn--lzg' With patches: >>> from socket import * >>> getnameinfo(("127.0.0.3", 0), 0) ('xn--lzg', '0') >>> getnameinfo(("127.0.0.2", 0), 0) ('\udce2\udc82\udcac', '0') >>> getaddrinfo(*_) [(2, 1, 6, '', ('127.0.0.2', 0)), (2, 2, 17, '', ('127.0.0.2', 0)), (2, 3, 0, '', ('127.0.0.2', 0))] >>> '\udce2\udc82\udcac'.encode("idna") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/david/python-patches/python-3/Lib/encodings/idna.py", line 167, in encode result.extend(ToASCII(label)) File "/home/david/python-patches/python-3/Lib/encodings/idna.py", line 76, in ToASCII label = nameprep(label) File "/home/david/python-patches/python-3/Lib/encodings/idna.py", line 38, in nameprep raise UnicodeError("Invalid character %r" % c) UnicodeError: Invalid character '\udce2' The exception at the end demonstrates why surrogateescape strings don't get confused with IDNs. ---------- Added file: http://bugs.python.org/file18272/ascii-surrogateescape-2.diff _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue9377> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com