Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info>: > Of course we have no idea what Marko's software is, or what it is doing,
Correct, you don't, but the link Paul Rubin posted gives you an idea: Python 3 says: everything is Unicode (by default, except in certain situations, and except if we send you crazy reencoded data, and even then it's sometimes still unicode, albeit wrong unicode). Filenames are Unicode, Terminals are Unicode, stdin and out are Unicode, there is so much Unicode! And because UNIX is not Unicode, Python 3 now has the stance that it's right and UNIX is wrong <URL: http://lucumr.pocoo.org/2014/5/12/everything-about-unicode/> > [Marko] >>> No, as a large number of Python3 facilities require str objects as >>> arguments. Consider urllib.request.urlopen(), for example, which >>> requires a URL to be an str object. > > That's because URLs are fundamentally text strings. <URL: https://tools.ietf.org/html/rfc1738>: In most URL schemes, the sequences of characters in different parts of a URL are used to represent sequences of octets used in Internet protocols. For example, in the ftp scheme, the host name, directory name and file names are such sequences of octets, represented by parts of the URL. (RFC 3986 says the same thing in a more roundabout way.) A URL consists of ASCII-only characters that represent an octet string. Of course, ASCII characters *are* Unicode characters. > Quick quiz: which of the following are real URLs? > (a) http://правительство.рф On the face of it, that is not a valid URL. However, hostnames can be dealt with somewhat bijectively using punycode. But try this: >>> import http.client >>> conn = http.client.HTTPConnection("example.com") >>> conn.request("GET", "/ä") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib64/python3.5/http/client.py", line 1107, in request self._send_request(method, url, body, headers) File "/usr/lib64/python3.5/http/client.py", line 1142, in _send_request self.putrequest(method, url, **skips) File "/usr/lib64/python3.5/http/client.py", line 984, in putrequest self._output(request.encode('ascii')) UnicodeEncodeError: 'ascii' codec can't encode character '\xe4' in positi\ on 5: ordinal not in range(128) >>> conn = http.client.HTTPConnection("example.com") >>> conn.request("GÄT", "/") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib64/python3.5/http/client.py", line 1107, in request self._send_request(method, url, body, headers) File "/usr/lib64/python3.5/http/client.py", line 1142, in _send_request self.putrequest(method, url, **skips) File "/usr/lib64/python3.5/http/client.py", line 984, in putrequest self._output(request.encode('ascii')) UnicodeEncodeError: 'ascii' codec can't encode character '\xc4' in positi\ on 1: ordinal not in range(128) IOW, the method and URL path given to conn.request are str objects but they are really just thinly veiled containers for ASCII bytes objects. That approach is very similar to mine. Marko -- https://mail.python.org/mailman/listinfo/python-list