[Python-Dev] PEP 383 and GUI libraries
(sent only to python-dev, as I am not a subscriber of tahoe-dev) Zooko wrote: > [Tahoe] currently uses utf-8 for its internal storage (note: nothing to > do with reading or writing files from external sources -- only for > storing filenames in the decentralized storage system which is > accessed by Tahoe clients), and we can't start putting non-utf-8-valid > sequences in the "filename" slot because other Tahoe clients would > then get a UnicodeDecodeError exception when trying to read those > directories. So what do you do when someone has an existing file whose name is supposed to be in utf-8, but whose actual bytes are not valid utf-8? If you have somehow solved that problem, then you're already done -- the PEP's encoding is a no-op on anything that isn't already invalid unicode. If you have not solved that problem, then those clients will already be getting a UnicodeDecodeError; all the PEP does is make it at least possible for them to recover. ... > Requirement 1 (unicode): Each filename that you see needs to be valid > unicode (it is stored internally in utf-8). (repeating) What does Tahoe do if this is violated? Do you throw an exception right there and not let them copy the file to tahoe? If so, then that same error correction means that utf8b will never differ from utf-8, and you have nothing to worry about. > Requirement 2 (faithful if unicode): Doesn't the PEP meet this? > Requirement 3 (no file left behind): Doesn't the PEP also meet this? I thought the concern was just that the name used would not be valid unicode, unless the original name was itself valid unicode. > Possible Requirement 4 (faithful bytes if not unicode, a.k.a. > "round-tripping"): Doesn't the PEP also support this? (Only) the invalid bytes get escaped and therefore must be unescaped, but the escapement is reversible. > 3. (handling collisions) In either case 2.a or 2.b the resulting > unicode string may already be present in the directory. This collision is what the use of half-surrogates (as the escape characters) avoids. Such collisions can't be present unless the data was invalid unicode, in which case it was the result of an escapement (unless something other than python is creating new invalid filenames). -jJ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 and GUI libraries
[cross-posting to python-dev and tahoe-dev] On Fri, May 1, 2009 at 8:12 PM, James Y Knight wrote: > > If I were designing a new system such as this, I'd probably just go for > utf8b *always*. Ah, this would be a very tempting possibility -- abandon all unix users who are slow to embrace our utf-8b future! However, it is moot because Tahoe is not a new system. It is currently at v1.4.1, has a strong policy of backwards-compatibility, and already has lots of data, lots of users, and programmers building on top of it. It currently uses utf-8 for its internal storage (note: nothing to do with reading or writing files from external sources -- only for storing filenames in the decentralized storage system which is accessed by Tahoe clients), and we can't start putting non-utf-8-valid sequences in the "filename" slot because other Tahoe clients would then get a UnicodeDecodeError exception when trying to read those directories. We *could* create a new metadata entry to hold things other than utf-8. Current Tahoe clients would never look at that entry (the metadata is a JSON-serialized dictionary, so we can add a new key name into it without disturbing the existing clients), but future Tahoe clients could look for that new key. That is where it is possible that future versions of Tahoe might be able to benefit from utf-8b or PEP 383, although what PEP 383 offers for this use case remains unclear to me. > But if you don't do that, then, I still don't see what purpose your > requirements serve. If I have two systems: one with a UTF-8 locale, and one > with a Latin-1 locale, why should transmitting filenames from system 1 to > system 2 through tahoe preserve the raw bytes, but doing the reverse *not* > preserve the raw bytes? (all byte-sequences are valid in latin-1, remember, > so they'll all decode into unicode without error, and then be reencoded in > utf-8...). This seems rather a useless behavior to me. I see I'm not explaining the Tahoe requirements clearly. It's probably that I'm not understanding them clearly myself. Hopefully the following will help. There are two different things stored in Tahoe for each directory entry: the filename and the metadata. Suppose you have run "tahoe cp -r myfiles/ tahoe:" on a Linux system and then you inspect the files in the Tahoe filesystem, such as by examining the web interface [1] or by running "tahoe ls", either of which you could do either from the same machine where you ran "tahoe cp" or from a different machine (which could be using any operating system). We have the following requirements about what ends up in your Tahoe directory after that cp -r. Requirement 1 (unicode): Each filename that you see needs to be valid unicode (it is stored internally in utf-8). This eliminates utf-8b and PEP 383 from being directly applicable to the filename part, although perhaps they could be useful for the metadata part (about which more below). Requirement 2 (faithful if unicode): For each filename (byte string) in your myfiles directory, if that bytestring is the valid encoding of some string in your stated locale, then the resulting filename in Tahoe is that (unicode) string. Nobody ever doesn't want this, right? Well, maybe some people don't want this sometimes, because it could be that the locale was wrong for this byte string and the resulting successfully-decoded unicode name is gibberish. This is especially acute if the locale is an 8-bit encoding such as latin-1 or windows-1252. However, what's the alternative? Guessing that their locale shouldn't be set to latin-1 and instead decoding their bytes some other way? It seems like we're not going to do better than requirement 2 (faithful if unicode). Requirement 3 (no file left behind): For each filename (byte string) in your myfiles directory, whether or not that byte string is the valid encoding of anything in your stated locale, then that file will be added into the Tahoe filesystem under *some* name (a good candidate would be mojibake, e.g. decode the bytes with latin-1, but that is not the only possibility). I have heard some developers say that they don't want to support this requirement and would rather tell the users to fix their filenames before they can back up or share those files through Tahoe. On the other hand, users have said that they require this and they are not going to go mucking about with all their filenames just so that they can use my backup and filesharing tool. Now already we can say that these three requirements mean that there can be collisions -- for example a directory could have two entries, one of which is not a valid encoding in the locale, and whatever unicode string we invent to name it with in order to satisfy requirements 3 (no file left behind) and 1 (unicode) might happen to be the same as the (correctly-encoded) name of the other file. Therefore these three requirements imply that we have to detect such collisions and deal with them somehow. (Thanks to Martin v. Löwis f
Re: [Python-Dev] PEP 383 and GUI libraries
On May 1, 2009, at 9:42 PM, Zooko O'Whielacronx wrote: Yep, I reversed the order of encode() and decode(). However, my whole statement was utterly wrong and shows that I still didn't fully get it yet. I have flip-flopped again and currently think that PEP 383 is useless for this use case and that my original plan [1] is still the way to go. Please let me know if you spot a flaw in my plan or a ridiculousity in my requirements, or if you see a way that PEP 383 can help me. If I were designing a new system such as this, I'd probably just go for utf8b *always*. That is, set the filesystem encoding to utf-8b. The end. All files always keep the same bytes transferring between unix systems. Thus, for the 99% of the world that uses either windows or a utf-8 locale, they get useful filenames inside tahoe. The other 1% of the world that uses something like latin-1, EUC_JP, etc. on their local system sees mojibake filenames in tahoe, but will see the same filename that they put in when they take it back out. Gnome already uses only utf-8 for filename displays for a few years now, for example, so this isn't exactly an unheard-of position to take... But if you don't do that, then, I still don't see what purpose your requirements serve. If I have two systems: one with a UTF-8 locale, and one with a Latin-1 locale, why should transmitting filenames from system 1 to system 2 through tahoe preserve the raw bytes, but doing the reverse *not* preserve the raw bytes? (all byte-sequences are valid in latin-1, remember, so they'll all decode into unicode without error, and then be reencoded in utf-8...). This seems rather a useless behavior to me. James ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 and GUI libraries
Folks: Being new to the use of gmail, I accidentally sent the following only to MvL and not to the list. He promptly replied with a helpful counterexample showing that my design can suffer collisions. :-) Regards, Zooko On Fri, May 1, 2009 at 10:38 AM, "Martin v. Löwis" wrote: >> >> Requirement: either the unicode string or the bytes are faithfully >> transmitted from one system to another. > > I don't understand this requirement very well, in particular not > the "faithfully" part. > >> That is: if you read a filename from the filesystem, and transmit that >> filename to another system and use it, then there are two cases: > > What do you mean by "use it"? Things like opening files? How does > that work? In general, a file name valid on one system is invalid > on a different system - or, at least, refers to a different file > over there. This is independent of encodings. Tahoe is a backup and filesharing program, so you might for example, execute "tahoe cp -r Motörhead tahoe:" to copy all the contents of your "Motörhead" directory to your Tahoe filesystem. Later you or a friend, might execute "tahoe cp -r tahoe:Motörhead ." to copy everything from that directory within your Tahoe filesystem to your local filesystem. So in this case the flow of information is local_system_1 -> Tahoe -> local_system_2. The Requirement 1 is that for each filename encountered which is a valid encoding in local_system_1, then the resulting (unicode) name is transmitted through the Tahoe filesystem and then written out into local_system_2 in the expected way (i.e. just by using the Python unicode APIs and passing the unicode object to them). Requirement 2 is that for each filename encountered which is not a valid encoding in local_system_1, then the original bytes are transmitted through the Tahoe filesystem and then, if the target system is a byte-oriented system such as Linux, the original bytes are written into the target filesystem. (If the target is not Linux then mojibake! but we don't have to go into that now.) Does that make sense? > In all your descriptions, I'm puzzled as to where exactly you get > the source bytes from. If you use the PEP 383 interfaces, you will > start with character strings, not byte strings, always. On Mac and Windows, we use the Python unicode APIs e.g. os.listdir(u"Motörhead"). On Linux and Solaris, we use the Python bytestring APIs e.g. os.listdir("Motörhead".encode(sys.getfilesystemencoding())). >> Okay, I find it surprisingly easy to make subtle errors in this encoding >> stuff, so please let me know if you spot one. Is it true that >> srcbytes.encode(srcencoding, 'python-escape').decode('utf-8', >> 'python-escape') will always produce srcbytes ? > > I think you mixed up bytes and unicode here: if srcbytes is indeed > a bytes object, then you can't apply .encode to it. Yep, I reversed the order of encode() and decode(). However, my whole statement was utterly wrong and shows that I still didn't fully get it yet. I have flip-flopped again and currently think that PEP 383 is useless for this use case and that my original plan [1] is still the way to go. Please let me know if you spot a flaw in my plan or a ridiculousity in my requirements, or if you see a way that PEP 383 can help me. Thank you very much. Regards, Zooko [1] http://allmydata.org/trac/tahoe/ticket/534#comment:47 ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 and GUI libraries
On 01May2009 18:38, Martin v. L?wis wrote: | > Okay, I am wrong about this. Having a flag to remember whether I had to | > fall back to the utf-8b trick is one method to implement my requirement, | > but my actual requirement is this: | > | > Requirement: either the unicode string or the bytes are faithfully | > transmitted from one system to another. | | I don't understand this requirement very well, in particular not | the "faithfully" part. | | > That is: if you read a filename from the filesystem, and transmit that | > filename to another system and use it, then there are two cases: | | What do you mean by "use it"? Things like opening files? How does | that work? In general, a file name valid on one system is invalid | on a different system - or, at least, refers to a different file | over there. This is independent of encodings. I think he's doing a file transfer of some kind and needs to preserve the names. Or I would guess the two systems are not both UNIX or there is some subtlety not yet mentioned, or he'd just use tar or some other byte-level UNIX tool. | > Requirement 1: the byte string was valid in the encoding of source | > system, in which case the unicode name is faithfully transmitted | > (i.e. the bytes that finally land on the target system are the result of | > sourcebytes.decode(source_sys_encoding).encode(target_sys_encoding). | | In all your descriptions, I'm puzzled as to where exactly you get | the source bytes from. If you use the PEP 383 interfaces, you will | start with character strings, not byte strings, always. But if both system do present POSIX layers, it's bytes underneath and the system tools will natively use bytes. He wants to ensure that he can read using python, using listdir, and elsewhere when he writing using python, preserve the bytes layer. I think. In fact it sounds like he may be translating valid unicode and carefully not altering byte names that don't decode. That in turn implies that the codec may be different on the two systems. | > Okay, I find it surprisingly easy to make subtle errors in this encoding | > stuff, so please let me know if you spot one. Is it true that | > srcbytes.encode(srcencoding, 'python-escape').decode('utf-8', | > 'python-escape') will always produce srcbytes ? | | I think you mixed up bytes and unicode here: if srcbytes is indeed | a bytes object, then you can't apply .encode to it. I think he has encode/decode swapped (I did too back in the uber-thread; if your mapping is one-to-one the distinction is almost arbitrary). However, his assertion/hope is true only if srcencoding == 'utf-8'. The PEP itself says that it works if the decode and encode use the same mapping. -- Cameron Simpson DoD#743 http://www.cskk.ezoshosting.com/cs/ "How do you know I'm Mad?" asked Alice. "You must be," said the Cat, "or you wouldn't have come here." ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 and GUI libraries
Zooko O'Whielacronx wrote: Following-up to my own post to correct a major error: Is it true that srcbytes.encode(srcencoding, 'python-escape').decode('utf-8', 'python-escape') will always produce srcbytes ? That is my Requirement If you start with bytes, decode with utf-8b to unicode (possibly 'invalid'), and encode the result back to bytes with utf-8b, you should get the original bytes, regardless of what they were. That is the point of PEP 383 -- to reliably roundtrip file 'names' that start as bytes and must end as the same bytes but which may not otherwise have a unicode decoding. If you start with invalid unicode text, encode to bytes with utf-8b, and decode back to unicode, you might instead get a different and valid unicode text. An example was given in the discussion. I believe this would be hard to avoid. An any case, it does not matter for the use case of starting with bytes that one wants to temporarily but surely work with as text. Terry Jan Reedy ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 and GUI libraries
> Okay, I am wrong about this. Having a flag to remember whether I had to > fall back to the utf-8b trick is one method to implement my requirement, > but my actual requirement is this: > > Requirement: either the unicode string or the bytes are faithfully > transmitted from one system to another. I don't understand this requirement very well, in particular not the "faithfully" part. > That is: if you read a filename from the filesystem, and transmit that > filename to another system and use it, then there are two cases: What do you mean by "use it"? Things like opening files? How does that work? In general, a file name valid on one system is invalid on a different system - or, at least, refers to a different file over there. This is independent of encodings. > Requirement 1: the byte string was valid in the encoding of source > system, in which case the unicode name is faithfully transmitted > (i.e. the bytes that finally land on the target system are the result of > sourcebytes.decode(source_sys_encoding).encode(target_sys_encoding). In all your descriptions, I'm puzzled as to where exactly you get the source bytes from. If you use the PEP 383 interfaces, you will start with character strings, not byte strings, always. > Okay, I find it surprisingly easy to make subtle errors in this encoding > stuff, so please let me know if you spot one. Is it true that > srcbytes.encode(srcencoding, 'python-escape').decode('utf-8', > 'python-escape') will always produce srcbytes ? I think you mixed up bytes and unicode here: if srcbytes is indeed a bytes object, then you can't apply .encode to it. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 and GUI libraries
Zooko O'Whielacronx wrote: Following-up to my own post to correct a major error: On Thu, Apr 30, 2009 at 11:44 PM, Zooko O'Whielacronx wrote: Folks: My use case (Tahoe-LAFS [1]) requires that I am *able* to read arbitrary binary names from the filesystem and store them so that I can regenerate the same byte string later, but it also requires that I *know* whether what I got was a valid string in the expected encoding (which might be utf-8) or whether it was not and I need to fall back to storing the bytes. Okay, I am wrong about this. Having a flag to remember whether I had to fall back to the utf-8b trick is one method to implement my requirement, but my actual requirement is this: Requirement: either the unicode string or the bytes are faithfully transmitted from one system to another. That is: if you read a filename from the filesystem, and transmit that filename to another system and use it, then there are two cases: Requirement 1: the byte string was valid in the encoding of source system, in which case the unicode name is faithfully transmitted (i.e. the bytes that finally land on the target system are the result of sourcebytes.decode(source_sys_encoding).encode(target_sys_encoding). Requirement 2: the byte string was not valid in the encoding of source system, in which case the bytes are faithfully transmitted (i.e. the bytes that finally land on the target system are the same as the bytes that originated in the source system). Now I finally understand how fiendishly clever MvL's PEP 383 generalization of Markus Kuhn's utf-8b trick is! The only thing necessary to achieve both of those requirements above is that the 'python-escape' error handler is used on the target system .encode() as well as on the source system .decode()! Well, I'm going to have to let this sink in and maybe write some code to see if I really understand it. But if this is right, then I can do away with some of the mechanism that I've built up, and instead: Backport PEP 383 to Python 2. And, document the PEP 383 trick in some generic, widely respected format such as an Internet Draft so that I can explain to other users of the Tahoe data (many of whom use other languages than Python) what they have to do if they find invalid utf-8 in the data. Oh good, I just realized that Tahoe emits only utf-8, so all I have to do is point them to the utf-8b documents (such as they are) and explain that to read filenames produced by Tahoe they have to implement utf-8b. That's really good that they don't have to implement MvL's generalization of that trick to other encodings, since utf-8b is already understood by some folks. Okay, I find it surprisingly easy to make subtle errors in this encoding stuff, so please let me know if you spot one. Is it true that srcbytes.encode(srcencoding, 'python-escape').decode('utf-8', 'python-escape') will always produce srcbytes ? That is my Requirement 2. No, but srcbytes.encode('utf-8', 'python-escape').decode('utf-8', 'python-escape') == srcbytes. The encodings on both ends need to be the same. For example: >>> b'\x80'.decode('windows-1252') u'\u20ac' >>> u'\u20ac'.encode('utf-8') '\xe2\x82\xac' Currently: >>> b'\x80'.decode('utf-8') Traceback (most recent call last): File "", line 1, in b'\x80'.decode('utf-8') File "C:\Python26\lib\encodings\utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: unexpected code byte But under this PEP: >>> b'x80'.decode('utf-8', 'python-escape') u'\xdc80' >>> u'\xdc80'.encode('utf-8', 'python-escape') '\x80' ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 and GUI libraries
Following-up to my own post to correct a major error: On Thu, Apr 30, 2009 at 11:44 PM, Zooko O'Whielacronx wrote: > Folks: > > My use case (Tahoe-LAFS [1]) requires that I am *able* to read arbitrary > binary names from the filesystem and store them so that I can regenerate > the same byte string later, but it also requires that I *know* whether > what I got was a valid string in the expected encoding (which might be > utf-8) or whether it was not and I need to fall back to storing the > bytes. Okay, I am wrong about this. Having a flag to remember whether I had to fall back to the utf-8b trick is one method to implement my requirement, but my actual requirement is this: Requirement: either the unicode string or the bytes are faithfully transmitted from one system to another. That is: if you read a filename from the filesystem, and transmit that filename to another system and use it, then there are two cases: Requirement 1: the byte string was valid in the encoding of source system, in which case the unicode name is faithfully transmitted (i.e. the bytes that finally land on the target system are the result of sourcebytes.decode(source_sys_encoding).encode(target_sys_encoding). Requirement 2: the byte string was not valid in the encoding of source system, in which case the bytes are faithfully transmitted (i.e. the bytes that finally land on the target system are the same as the bytes that originated in the source system). Now I finally understand how fiendishly clever MvL's PEP 383 generalization of Markus Kuhn's utf-8b trick is! The only thing necessary to achieve both of those requirements above is that the 'python-escape' error handler is used on the target system .encode() as well as on the source system .decode()! Well, I'm going to have to let this sink in and maybe write some code to see if I really understand it. But if this is right, then I can do away with some of the mechanism that I've built up, and instead: Backport PEP 383 to Python 2. And, document the PEP 383 trick in some generic, widely respected format such as an Internet Draft so that I can explain to other users of the Tahoe data (many of whom use other languages than Python) what they have to do if they find invalid utf-8 in the data. Oh good, I just realized that Tahoe emits only utf-8, so all I have to do is point them to the utf-8b documents (such as they are) and explain that to read filenames produced by Tahoe they have to implement utf-8b. That's really good that they don't have to implement MvL's generalization of that trick to other encodings, since utf-8b is already understood by some folks. Okay, I find it surprisingly easy to make subtle errors in this encoding stuff, so please let me know if you spot one. Is it true that srcbytes.encode(srcencoding, 'python-escape').decode('utf-8', 'python-escape') will always produce srcbytes ? That is my Requirement 2. Regards, Zooko ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 and GUI libraries
On Thu, 30 Apr 2009 at 23:44, Zooko O'Whielacronx wrote: Would it be possible for Python unicode objects to have a flag indicating whether the 'python-escape' error handler was present? That Unless I'm misunderstanding something, couldn't you implement what you need by looking in a given string for the half surrogates? If you find one, you have a string python-escape modified, if you don't, it didn't. What does Tahoe do on Windows when it gets a filename that is not valid Unicode? You might not even have to conditionalize the above code on platform (ie: instead you have a generalized is_valid_unicode test function that you always use). --David ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 and GUI libraries
Zooko O'Whielacronx wrote: [snip...] Would it be possible for Python unicode objects to have a flag indicating whether the 'python-escape' error handler was present? That would serve the same purpose as my "failed_decode" flag above, and would basically allow me to use the Python APIs directory and make all this work-around code disappear. Failing that, I can't see any way to use the os.listdir() in its unicode-oriented mode to satisfy Tahoe's requirements. If you take the above code and then add the fact that you want to use the failed_decode flag when *encoding* the d argument to os.listdir(), then you get this code: [2]. Oh, I just realized that I *could* use the PEP 383 os.listdir(), like this: def listdir(d): fse = sys.getfilesystemencoding() if fse == 'utf-8b': fse = 'utf-8' ns = [] for fn in os.listdir(d): bytes = fn.encode(fse, 'python-escape') try: ns.append(FName(bytes.decode(fse, 'strict'))) except UnicodeDecodeError: ns.append(FName(fn.decode('utf-8', 'python-escape'), failed_decode=True)) return ns (And I guess I could define listdir() like this only on the non-unicode-safe platforms, as above.) However, that strikes me as even more horrible than the previous "listdir()" work-around, in part because it means decoding, re-encoding, and re-decoding every name, so I think I would stick with the previous version. The current unicode mode would skip the filenames you are interested (those that fail to decode correctly) - so you would have been forced to use the bytes mode. If you need access to the original bytes then you should continue to do this. PEP-383 is entirely neutral for your use case as far as I can see. Michael Oh, one more note: for Tahoe's purposes you can, in all of the code above, replace ".decode('utf-8', 'python-replace')" with ".decode('windows-1252')" and it works just as well. While UTF-8b seems like a really cool hack, and it would produce more legible results if utf-8-encoded strings were partially corrupted, I guess I should just use 'windows-1252' which is already implemented in Python 2 (as well as in all other software in the world). I guess this means that PEP 383, which I have approved of and liked so far in this discussion, would actually not help Tahoe at all and would in fact harm Tahoe -- I would have to remember to detect and work-around the automatic 'utf-8b' filesystem encoding when porting Tahoe to Python 3. If anyone else has a concrete, real use case which would be helped by PEP 383, I would like to hear about it. Perhaps Tahoe can learn something from it. Oh, if this PEP could be extended to add a flag to each unicode object indicating whether it was created with the python-escape handler or not, then it would be useful to me. Regards, Zooko [1] http://mail.python.org/pipermail/python-dev/2009-April/089020.html [2] http://allmydata.org/trac/tahoe/attachment/ticket/534/fsencode.3.py ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.uk -- http://www.ironpythoninaction.com/ http://www.voidspace.org.uk/blog ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 and GUI libraries
Folks: My use case (Tahoe-LAFS [1]) requires that I am *able* to read arbitrary binary names from the filesystem and store them so that I can regenerate the same byte string later, but it also requires that I *know* whether what I got was a valid string in the expected encoding (which might be utf-8) or whether it was not and I need to fall back to storing the bytes. So far, it looks like PEP 383 doesn't provide both of these requirements, so I am going to have to continue working-around the Python API even after PEP 383. In fact, it might actually increase the amount of working-around that I have to do. If I understand correctly, .decode(encoding, 'strict') will not be changed by PEP 383. A new error handler is added, so .decode('utf-8', 'python-escape') performs the utf-8b decoding. Am I right so far? Therefore if I have a string of bytes, I can attempt to decode it with 'strict', and if that fails I can set the flag showing that it was not a valid byte string in the expected encoding, and then I can invoke .decode('utf-8', 'python-escape') on it. So far, so good. (Note that I never want to do .decode(expected_encoding, 'python-escape') -- if it wasn't a valid bytestring in the expected_encoding, then I want to decode it with utf-8b, regardless of what the expected encoding was.) Anyway, I can use it like this: class FName: def __init__(self, name, failed_decode=False): self.name = name self.failed_decode = failed_decode def fs_to_unicode(bytes): try: return FName(bytes.decode(sys.getfilesystemencoding(), 'strict')) except UnicodeDecodeError: return FName(fn.decode('utf-8', 'python-escape'), failed_decode=True) And what about unicode-oriented APIs such as os.listdir()? Uh-oh, the PEP says that on systems with locale 'utf-8', it will automatically be changed to 'utf-8b'. This means I can't reliably find out whether the entries in the directory *were* named with valid encodings in utf-8? That's not acceptable for my use case. I would have to refrain from using the unicode-oriented os.listdir() on POSIX, and instead do something like this: if platform.system() in ('Windows', 'Darwin'): def listdir(d): return [FName(n) for n in os.listdir(d)] elif platform.system() in ('Linux', 'SunOs'): def listdir(d): bytesd = d.encode(sys.getfilesystemencoding()) return [fs_to_unicode(n) for n in os.listdir(bytesd)] else: raise NotImplementedError("Please classify platform.system() == %s \ as either unicode-safe or unicode-unsafe." % platform.system()) In fact, if 'utf-8' gets automatically converted to 'utf-8b' when *decoding* as well as encoding, then I would have to change my fs_to_unicode() function to check for that and make sure to use strict utf-8 in the first attempt: def fs_to_unicode(bytes): fse = sys.getfilesystemencoding() if fse == 'utf-8b': fse = 'utf-8' try: return FName(bytes.decode(fse, 'strict')) except UnicodeDecodeError: return FName(fn.decode('utf-8', 'python-escape'), failed_decode=True) Would it be possible for Python unicode objects to have a flag indicating whether the 'python-escape' error handler was present? That would serve the same purpose as my "failed_decode" flag above, and would basically allow me to use the Python APIs directory and make all this work-around code disappear. Failing that, I can't see any way to use the os.listdir() in its unicode-oriented mode to satisfy Tahoe's requirements. If you take the above code and then add the fact that you want to use the failed_decode flag when *encoding* the d argument to os.listdir(), then you get this code: [2]. Oh, I just realized that I *could* use the PEP 383 os.listdir(), like this: def listdir(d): fse = sys.getfilesystemencoding() if fse == 'utf-8b': fse = 'utf-8' ns = [] for fn in os.listdir(d): bytes = fn.encode(fse, 'python-escape') try: ns.append(FName(bytes.decode(fse, 'strict'))) except UnicodeDecodeError: ns.append(FName(fn.decode('utf-8', 'python-escape'), failed_decode=True)) return ns (And I guess I could define listdir() like this only on the non-unicode-safe platforms, as above.) However, that strikes me as even more horrible than the previous "listdir()" work-around, in part because it means decoding, re-encoding, and re-decoding every name, so I think I would stick with the previous version. Oh, one more note: for Tahoe's purposes you can, in all of the code above, replace ".decode('utf-8', 'python-replace')" with ".decode('windows-1252')" and it works just as well. While UTF-8b seems like a really cool hack, and it would produce more legible results if utf-8-encoded strings were partially corrupted, I guess I should just use 'windows-1252' which is already implemented in Python 2 (as well as in all other software in the world). I guess this means that PEP 383
Re: [Python-Dev] PEP 383 and GUI libraries
On 30-Apr-09, at 7:39 AM, Guido van Rossum wrote: FWIW, I'm in agreement with this PEP (i.e. its status is now Accepted). Martin, you can update the PEP and start the implementation. +1 Kudos to Martin for seeing this through with (imo) considerable patience and dignity. -Mike ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 and GUI libraries
FWIW, I'm in agreement with this PEP (i.e. its status is now Accepted). Martin, you can update the PEP and start the implementation. On Thu, Apr 30, 2009 at 2:12 AM, "Martin v. Löwis" wrote: >> Did you use a name with other characters? Were they displayed? Both >> before and after the surrogates? > > Yes, yes, and yes (IOW, I put the surrogate in the middle). > >> Did you use one or three half surrogates, to produce the three crossed >> boxes? > > Only one, and it produced three boxes - probably one for each UTF-8 byte > that pango considered invalid. > >> Did you use one or three half surrogates, to produce the single square box? > > Again, only one. Apparently, PyQt passes the Python Unicode string to Qt > in a character-by-character representation, rather than going through UTF-8. -- --Guido van Rossum (home page: http://www.python.org/~guido/) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 and GUI libraries
> Did you use a name with other characters? Were they displayed? Both > before and after the surrogates? Yes, yes, and yes (IOW, I put the surrogate in the middle). > Did you use one or three half surrogates, to produce the three crossed > boxes? Only one, and it produced three boxes - probably one for each UTF-8 byte that pango considered invalid. > Did you use one or three half surrogates, to produce the single square box? Again, only one. Apparently, PyQt passes the Python Unicode string to Qt in a character-by-character representation, rather than going through UTF-8. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 and GUI libraries
On approximately 4/30/2009 1:48 AM, came the following characters from the keyboard of Martin v. Löwis: I checked how GUI libraries deal with half surrogates. In pygtk, a warning gets issued to the console /tmp/helloworld.py:71: PangoWarning: Invalid UTF-8 string passed to pango_layout_set_text() self.window.show() and then the widget contains three crossed boxes. wxpython (in its wxgtk version) behaves the same way. PyQt displays a single square box. Interesting. Did you use a name with other characters? Were they displayed? Both before and after the surrogates? Did you use one or three half surrogates, to produce the three crossed boxes? Did you use one or three half surrogates, to produce the single square box? -- Glenn -- http://nevcal.com/ === A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] PEP 383 and GUI libraries
I checked how GUI libraries deal with half surrogates. In pygtk, a warning gets issued to the console /tmp/helloworld.py:71: PangoWarning: Invalid UTF-8 string passed to pango_layout_set_text() self.window.show() and then the widget contains three crossed boxes. wxpython (in its wxgtk version) behaves the same way. PyQt displays a single square box. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com