[Python-Dev] PEP 383 and GUI libraries

2009-05-03 Thread Jim Jewett
(sent only to python-dev, as I am not a subscriber of tahoe-dev)

Zooko wrote:

> [Tahoe] currently uses utf-8 for its internal storage (note: nothing to
> do with reading or writing files from external sources -- only for
> storing filenames in the decentralized storage system which is
> accessed by Tahoe clients), and we can't start putting non-utf-8-valid
> sequences in the "filename" slot because other Tahoe clients would
> then get a UnicodeDecodeError exception when trying to read those
> directories.

So what do you do when someone has an existing file whose name is
supposed to be in utf-8, but whose actual bytes are not valid utf-8?

If you have somehow solved that problem, then you're already done --
the PEP's encoding is a no-op on anything that isn't already invalid
unicode.

If you have not solved that problem, then those clients will already
be getting a UnicodeDecodeError; all the PEP does is make it at least
possible for them to recover.

...

> Requirement 1 (unicode):  Each filename that you see needs to be valid
> unicode (it is stored internally in utf-8).

(repeating) What does Tahoe do if this is violated?  Do you throw an
exception right there and not let them copy the file to tahoe?  If so,
then that same error correction means that utf8b will never differ
from utf-8, and you have nothing to worry about.

> Requirement 2 (faithful if unicode):

Doesn't the PEP meet this?

> Requirement 3 (no file left behind):

Doesn't the PEP also meet this?  I thought the concern was just that
the name used would not be valid unicode, unless the original name was
itself valid unicode.

> Possible Requirement 4 (faithful bytes if not unicode, a.k.a.
> "round-tripping"):

Doesn't the PEP also support this?  (Only) the invalid bytes get
escaped and therefore must be unescaped, but the escapement is
reversible.

> 3. (handling collisions)  In either case 2.a or 2.b the resulting
> unicode string may already be present in the directory.

This collision is what the use of half-surrogates (as the escape
characters) avoids.  Such collisions can't be present unless the data
was invalid unicode, in which case it was the result of an escapement
(unless something other than python is creating new invalid
filenames).

-jJ
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 and GUI libraries

2009-05-02 Thread Zooko O'Whielacronx
[cross-posting to python-dev and tahoe-dev]

On Fri, May 1, 2009 at 8:12 PM, James Y Knight  wrote:
>
> If I were designing a new system such as this, I'd probably just go for
> utf8b *always*.

Ah, this would be a very tempting possibility -- abandon all unix
users who are slow to embrace our utf-8b future!

However, it is moot because Tahoe is not a new system. It is currently
at v1.4.1, has a strong policy of backwards-compatibility, and already
has lots of data, lots of users, and programmers building on top of
it. It currently uses utf-8 for its internal storage (note: nothing to
do with reading or writing files from external sources -- only for
storing filenames in the decentralized storage system which is
accessed by Tahoe clients), and we can't start putting non-utf-8-valid
sequences in the "filename" slot because other Tahoe clients would
then get a UnicodeDecodeError exception when trying to read those
directories.

We *could* create a new metadata entry to hold things other than
utf-8. Current Tahoe clients would never look at that entry (the
metadata is a JSON-serialized dictionary, so we can add a new key name
into it without disturbing the existing clients), but future Tahoe
clients could look for that new key. That is where it is possible that
future versions of Tahoe might be able to benefit from utf-8b or PEP
383, although what PEP 383 offers for this use case remains unclear to
me.

> But if you don't do that, then, I still don't see what purpose your
> requirements serve. If I have two systems: one with a UTF-8 locale, and one
> with a Latin-1 locale, why should transmitting filenames from system 1 to
> system 2 through tahoe preserve the raw bytes, but doing the reverse *not*
> preserve the raw bytes? (all byte-sequences are valid in latin-1, remember,
> so they'll all decode into unicode without error, and then be reencoded in
> utf-8...). This seems rather a useless behavior to me.

I see I'm not explaining the Tahoe requirements clearly. It's probably
that I'm not understanding them clearly myself. Hopefully the
following will help.

There are two different things stored in Tahoe for each directory
entry: the filename and the metadata.

Suppose you have run "tahoe cp -r myfiles/ tahoe:" on a Linux system
and then you inspect the files in the Tahoe filesystem, such as by
examining the web interface [1] or by running "tahoe ls", either of
which you could do either from the same machine where you ran "tahoe
cp" or from a different machine (which could be using any operating
system). We have the following requirements about what ends up in your
Tahoe directory after that cp -r.

Requirement 1 (unicode):  Each filename that you see needs to be valid
unicode (it is stored internally in utf-8). This eliminates utf-8b and
PEP 383 from being directly applicable to the filename part, although
perhaps they could be useful for the metadata part (about which more
below).

Requirement 2 (faithful if unicode):  For each filename (byte string)
in your myfiles directory, if that bytestring is the valid encoding of
some string in your stated locale, then the resulting filename in
Tahoe is that (unicode) string. Nobody ever doesn't want this, right?
Well, maybe some people don't want this sometimes, because it could be
that the locale was wrong for this byte string and the resulting
successfully-decoded unicode name is gibberish. This is especially
acute if the locale is an 8-bit encoding such as latin-1 or
windows-1252. However, what's the alternative?  Guessing that their
locale shouldn't be set to latin-1 and instead decoding their bytes
some other way?  It seems like we're not going to do better than
requirement 2 (faithful if unicode).

Requirement 3 (no file left behind):  For each filename (byte string)
in your myfiles directory, whether or not that byte string is the
valid encoding of anything in your stated locale, then that file will
be added into the Tahoe filesystem under *some* name (a good candidate
would be mojibake, e.g. decode the bytes with latin-1, but that is not
the only possibility). I have heard some developers say that they
don't want to support this requirement and would rather tell the users
to fix their filenames before they can back up or share those files
through Tahoe. On the other hand, users have said that they require
this and they are not going to go mucking about with all their
filenames just so that they can use my backup and filesharing tool.

Now already we can say that these three requirements mean that there
can be collisions -- for example a directory could have two entries,
one of which is not a valid encoding in the locale, and whatever
unicode string we invent to name it with in order to satisfy
requirements 3 (no file left behind) and 1 (unicode) might happen to
be the same as the (correctly-encoded) name of the other file.
Therefore these three requirements imply that we have to detect such
collisions and deal with them somehow. (Thanks to Martin v. Löwis f

Re: [Python-Dev] PEP 383 and GUI libraries

2009-05-01 Thread James Y Knight

On May 1, 2009, at 9:42 PM, Zooko O'Whielacronx wrote:

Yep, I reversed the order of encode() and decode().  However, my whole
statement was utterly wrong and shows that I still didn't fully get it
yet.  I have flip-flopped again and currently think that PEP 383 is
useless for this use case and that my original plan [1] is still the
way to go.  Please let me know if you spot a flaw in my plan or a
ridiculousity in my requirements, or if you see a way that PEP 383 can
help me.


If I were designing a new system such as this, I'd probably just go  
for utf8b *always*. That is, set the filesystem encoding to utf-8b.  
The end. All files always keep the same bytes transferring between  
unix systems. Thus, for the 99% of the world that uses either windows  
or a utf-8 locale, they get useful filenames inside tahoe. The other  
1% of the world that uses something like latin-1, EUC_JP, etc. on  
their local system sees mojibake filenames in tahoe, but will see the  
same filename that they put in when they take it back out.


Gnome already uses only utf-8 for filename displays for a few years  
now, for example, so this isn't exactly an unheard-of position to  
take...


But if you don't do that, then, I still don't see what purpose your  
requirements serve. If I have two systems: one with a UTF-8 locale,  
and one with a Latin-1 locale, why should transmitting filenames from  
system 1 to system 2 through tahoe preserve the raw bytes, but doing  
the reverse *not* preserve the raw bytes? (all byte-sequences are  
valid in latin-1, remember, so they'll all decode into unicode without  
error, and then be reencoded in utf-8...). This seems rather a useless  
behavior to me.


James
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 and GUI libraries

2009-05-01 Thread Zooko O'Whielacronx
Folks:

Being new to the use of gmail, I accidentally sent the following only
to MvL and not to the list.  He promptly replied with a helpful
counterexample showing that my design can suffer collisions.  :-)

Regards,

Zooko


On Fri, May 1, 2009 at 10:38 AM, "Martin v. Löwis"  wrote:
>>
>> Requirement: either the unicode string or the bytes are faithfully
>> transmitted from one system to another.
>
> I don't understand this requirement very well, in particular not
> the "faithfully" part.
>
>> That is: if you read a filename from the filesystem, and transmit that
>> filename to another system and use it, then there are two cases:
>
> What do you mean by "use it"? Things like opening files? How does
> that work? In general, a file name valid on one system is invalid
> on a different system - or, at least, refers to a different file
> over there. This is independent of encodings.

Tahoe is a backup and filesharing program, so you might for example,
execute "tahoe cp -r Motörhead tahoe:" to copy all the contents of
your "Motörhead" directory to your Tahoe filesystem.  Later you or a
friend, might execute "tahoe cp -r tahoe:Motörhead ." to copy
everything from that directory within your Tahoe filesystem to your
local filesystem.  So in this case the flow of information is
local_system_1 -> Tahoe -> local_system_2.

The Requirement 1 is that for each filename encountered which is a
valid encoding in local_system_1, then the resulting (unicode) name is
transmitted through the Tahoe filesystem and then written out into
local_system_2 in the expected way (i.e. just by using the Python
unicode APIs and passing the unicode object to them).

Requirement 2 is that for each filename encountered which is not a
valid encoding in local_system_1, then the original bytes are
transmitted through the Tahoe filesystem and then, if the target
system is a byte-oriented system such as Linux, the original bytes are
written into the target filesystem.  (If the target is not Linux then
mojibake! but we don't have to go into that now.)

Does that make sense?

> In all your descriptions, I'm puzzled as to where exactly you get
> the source bytes from. If you use the PEP 383 interfaces, you will
> start with character strings, not byte strings, always.

On Mac and Windows, we use the Python unicode APIs e.g.
os.listdir(u"Motörhead").  On Linux and Solaris, we use the Python
bytestring APIs e.g.
os.listdir("Motörhead".encode(sys.getfilesystemencoding())).

>> Okay, I find it surprisingly easy to make subtle errors in this encoding
>> stuff, so please let me know if you spot one.  Is it true that
>> srcbytes.encode(srcencoding, 'python-escape').decode('utf-8',
>> 'python-escape') will always produce srcbytes ?
>
> I think you mixed up bytes and unicode here: if srcbytes is indeed
> a bytes object, then you can't apply .encode to it.

Yep, I reversed the order of encode() and decode().  However, my whole
statement was utterly wrong and shows that I still didn't fully get it
yet.  I have flip-flopped again and currently think that PEP 383 is
useless for this use case and that my original plan [1] is still the
way to go.  Please let me know if you spot a flaw in my plan or a
ridiculousity in my requirements, or if you see a way that PEP 383 can
help me.

Thank you very much.

Regards,

Zooko

[1] http://allmydata.org/trac/tahoe/ticket/534#comment:47
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 and GUI libraries

2009-05-01 Thread Cameron Simpson
On 01May2009 18:38, Martin v. L?wis  wrote:
| > Okay, I am wrong about this.  Having a flag to remember whether I had to
| > fall back to the utf-8b trick is one method to implement my requirement,
| > but my actual requirement is this:
| > 
| > Requirement: either the unicode string or the bytes are faithfully
| > transmitted from one system to another.
| 
| I don't understand this requirement very well, in particular not
| the "faithfully" part.
| 
| > That is: if you read a filename from the filesystem, and transmit that
| > filename to another system and use it, then there are two cases:
| 
| What do you mean by "use it"? Things like opening files? How does
| that work? In general, a file name valid on one system is invalid
| on a different system - or, at least, refers to a different file
| over there. This is independent of encodings.

I think he's doing a file transfer of some kind and needs to preserve
the names. Or I would guess the two systems are not both UNIX or there
is some subtlety not yet mentioned, or he'd just use tar or some other
byte-level UNIX tool.

| > Requirement 1: the byte string was valid in the encoding of source
| > system, in which case the unicode name is faithfully transmitted
| > (i.e. the bytes that finally land on the target system are the result of
| > sourcebytes.decode(source_sys_encoding).encode(target_sys_encoding).
| 
| In all your descriptions, I'm puzzled as to where exactly you get
| the source bytes from. If you use the PEP 383 interfaces, you will
| start with character strings, not byte strings, always.

But if both system do present POSIX layers, it's bytes underneath and
the system tools will natively use bytes. He wants to ensure that he can
read using python, using listdir, and elsewhere when he writing using
python, preserve the bytes layer. I think.

In fact it sounds like he may be translating valid unicode and carefully not
altering byte names that don't decode. That in turn implies that the codec
may be different on the two systems.

| > Okay, I find it surprisingly easy to make subtle errors in this encoding
| > stuff, so please let me know if you spot one.  Is it true that
| > srcbytes.encode(srcencoding, 'python-escape').decode('utf-8',
| > 'python-escape') will always produce srcbytes ? 
| 
| I think you mixed up bytes and unicode here: if srcbytes is indeed
| a bytes object, then you can't apply .encode to it.

I think he has encode/decode swapped (I did too back in the uber-thread;
if your mapping is one-to-one the distinction is almost arbitrary).

However, his assertion/hope is true only if srcencoding == 'utf-8'.
The PEP itself says that it works if the decode and encode use the same
mapping.
-- 
Cameron Simpson  DoD#743
http://www.cskk.ezoshosting.com/cs/

"How do you know I'm Mad?" asked Alice.
"You must be," said the Cat, "or you wouldn't have come here."
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 and GUI libraries

2009-05-01 Thread Terry Reedy

Zooko O'Whielacronx wrote:

Following-up to my own post to correct a major error:



Is it true that
srcbytes.encode(srcencoding, 'python-escape').decode('utf-8',
'python-escape') will always produce srcbytes ?  That is my Requirement


If you start with bytes, decode with utf-8b to unicode (possibly 
'invalid'), and encode the result back to bytes with utf-8b, you should 
get the original bytes, regardless of what they were.  That is the point 
of PEP 383 -- to reliably roundtrip file 'names' that start as bytes and 
must end as the same bytes but which may not otherwise have a unicode 
decoding.


If you start with invalid unicode text, encode to bytes with utf-8b, and 
decode back to unicode, you might instead get a different and valid 
unicode text.  An example was given in the discussion.  I believe this 
would be hard to avoid.  An any case, it does not matter for the use 
case of starting with bytes that one wants to temporarily but surely 
work with as text.


Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 and GUI libraries

2009-05-01 Thread Martin v. Löwis
> Okay, I am wrong about this.  Having a flag to remember whether I had to
> fall back to the utf-8b trick is one method to implement my requirement,
> but my actual requirement is this:
> 
> Requirement: either the unicode string or the bytes are faithfully
> transmitted from one system to another.

I don't understand this requirement very well, in particular not
the "faithfully" part.

> That is: if you read a filename from the filesystem, and transmit that
> filename to another system and use it, then there are two cases:

What do you mean by "use it"? Things like opening files? How does
that work? In general, a file name valid on one system is invalid
on a different system - or, at least, refers to a different file
over there. This is independent of encodings.

> Requirement 1: the byte string was valid in the encoding of source
> system, in which case the unicode name is faithfully transmitted
> (i.e. the bytes that finally land on the target system are the result of
> sourcebytes.decode(source_sys_encoding).encode(target_sys_encoding).

In all your descriptions, I'm puzzled as to where exactly you get
the source bytes from. If you use the PEP 383 interfaces, you will
start with character strings, not byte strings, always.

> Okay, I find it surprisingly easy to make subtle errors in this encoding
> stuff, so please let me know if you spot one.  Is it true that
> srcbytes.encode(srcencoding, 'python-escape').decode('utf-8',
> 'python-escape') will always produce srcbytes ? 

I think you mixed up bytes and unicode here: if srcbytes is indeed
a bytes object, then you can't apply .encode to it.

Regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 and GUI libraries

2009-05-01 Thread MRAB

Zooko O'Whielacronx wrote:

Following-up to my own post to correct a major error:


On Thu, Apr 30, 2009 at 11:44 PM, Zooko O'Whielacronx  wrote:

Folks:

My use case (Tahoe-LAFS [1]) requires that I am *able* to read arbitrary
binary names from the filesystem and store them so that I can regenerate
the same byte string later, but it also requires that I *know* whether
what I got was a valid string in the expected encoding (which might be
utf-8) or whether it was not and I need to fall back to storing the
bytes.


Okay, I am wrong about this.  Having a flag to remember whether I had to
fall back to the utf-8b trick is one method to implement my requirement,
but my actual requirement is this:

Requirement: either the unicode string or the bytes are faithfully
transmitted from one system to another.

That is: if you read a filename from the filesystem, and transmit that
filename to another system and use it, then there are two cases:

Requirement 1: the byte string was valid in the encoding of source
system, in which case the unicode name is faithfully transmitted
(i.e. the bytes that finally land on the target system are the result of
sourcebytes.decode(source_sys_encoding).encode(target_sys_encoding).

Requirement 2: the byte string was not valid in the encoding of source
system, in which case the bytes are faithfully transmitted (i.e. the
bytes that finally land on the target system are the same as the bytes
that originated in the source system).

Now I finally understand how fiendishly clever MvL's PEP 383
generalization of Markus Kuhn's utf-8b trick is!  The only thing
necessary to achieve both of those requirements above is that the
'python-escape' error handler is used on the target system .encode() as
well as on the source system .decode()!

Well, I'm going to have to let this sink in and maybe write some code to
see if I really understand it.

But if this is right, then I can do away with some of the mechanism that
I've built up, and instead:

Backport PEP 383 to Python 2.

And, document the PEP 383 trick in some generic, widely respected format
such as an Internet Draft so that I can explain to other users of the
Tahoe data (many of whom use other languages than Python) what they have
to do if they find invalid utf-8 in the data.  Oh good, I just realized
that Tahoe emits only utf-8, so all I have to do is point them to the
utf-8b documents (such as they are) and explain that to read filenames
produced by Tahoe they have to implement utf-8b.  That's really good
that they don't have to implement MvL's generalization of that trick to
other encodings, since utf-8b is already understood by some folks.


Okay, I find it surprisingly easy to make subtle errors in this encoding
stuff, so please let me know if you spot one.  Is it true that
srcbytes.encode(srcencoding, 'python-escape').decode('utf-8',
'python-escape') will always produce srcbytes ?  That is my Requirement
2.


No, but srcbytes.encode('utf-8', 'python-escape').decode('utf-8',
'python-escape') == srcbytes. The encodings on both ends need to be the
same.

For example:

>>> b'\x80'.decode('windows-1252')
u'\u20ac'
>>> u'\u20ac'.encode('utf-8')
'\xe2\x82\xac'

Currently:

>>> b'\x80'.decode('utf-8')

Traceback (most recent call last):
  File "", line 1, in 
b'\x80'.decode('utf-8')
  File "C:\Python26\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: 
unexpected code byte


But under this PEP:

>>> b'x80'.decode('utf-8', 'python-escape')
u'\xdc80'
>>> u'\xdc80'.encode('utf-8', 'python-escape')
'\x80'
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 and GUI libraries

2009-05-01 Thread Zooko O'Whielacronx
Following-up to my own post to correct a major error:


On Thu, Apr 30, 2009 at 11:44 PM, Zooko O'Whielacronx  wrote:
> Folks:
>
> My use case (Tahoe-LAFS [1]) requires that I am *able* to read arbitrary
> binary names from the filesystem and store them so that I can regenerate
> the same byte string later, but it also requires that I *know* whether
> what I got was a valid string in the expected encoding (which might be
> utf-8) or whether it was not and I need to fall back to storing the
> bytes.

Okay, I am wrong about this.  Having a flag to remember whether I had to
fall back to the utf-8b trick is one method to implement my requirement,
but my actual requirement is this:

Requirement: either the unicode string or the bytes are faithfully
transmitted from one system to another.

That is: if you read a filename from the filesystem, and transmit that
filename to another system and use it, then there are two cases:

Requirement 1: the byte string was valid in the encoding of source
system, in which case the unicode name is faithfully transmitted
(i.e. the bytes that finally land on the target system are the result of
sourcebytes.decode(source_sys_encoding).encode(target_sys_encoding).

Requirement 2: the byte string was not valid in the encoding of source
system, in which case the bytes are faithfully transmitted (i.e. the
bytes that finally land on the target system are the same as the bytes
that originated in the source system).

Now I finally understand how fiendishly clever MvL's PEP 383
generalization of Markus Kuhn's utf-8b trick is!  The only thing
necessary to achieve both of those requirements above is that the
'python-escape' error handler is used on the target system .encode() as
well as on the source system .decode()!

Well, I'm going to have to let this sink in and maybe write some code to
see if I really understand it.

But if this is right, then I can do away with some of the mechanism that
I've built up, and instead:

Backport PEP 383 to Python 2.

And, document the PEP 383 trick in some generic, widely respected format
such as an Internet Draft so that I can explain to other users of the
Tahoe data (many of whom use other languages than Python) what they have
to do if they find invalid utf-8 in the data.  Oh good, I just realized
that Tahoe emits only utf-8, so all I have to do is point them to the
utf-8b documents (such as they are) and explain that to read filenames
produced by Tahoe they have to implement utf-8b.  That's really good
that they don't have to implement MvL's generalization of that trick to
other encodings, since utf-8b is already understood by some folks.


Okay, I find it surprisingly easy to make subtle errors in this encoding
stuff, so please let me know if you spot one.  Is it true that
srcbytes.encode(srcencoding, 'python-escape').decode('utf-8',
'python-escape') will always produce srcbytes ?  That is my Requirement
2.


Regards,

Zooko
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 and GUI libraries

2009-05-01 Thread R. David Murray

On Thu, 30 Apr 2009 at 23:44, Zooko O'Whielacronx wrote:

Would it be possible for Python unicode objects to have a flag
indicating whether the 'python-escape' error handler was present?  That


Unless I'm misunderstanding something, couldn't you implement what you
need by looking in a given string for the half surrogates?  If you find
one, you have a string python-escape modified, if you don't, it didn't.

What does Tahoe do on Windows when it gets a filename that is not valid
Unicode?  You might not even have to conditionalize the above code
on platform (ie: instead you have a generalized is_valid_unicode test
function that you always use).

--David
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 and GUI libraries

2009-05-01 Thread Michael Foord

Zooko O'Whielacronx wrote:

[snip...]
Would it be possible for Python unicode objects to have a flag
indicating whether the 'python-escape' error handler was present?  That
would serve the same purpose as my "failed_decode" flag above, and would
basically allow me to use the Python APIs directory and make all this
work-around code disappear.

Failing that, I can't see any way to use the os.listdir() in its
unicode-oriented mode to satisfy Tahoe's requirements.

If you take the above code and then add the fact that you want to use
the failed_decode flag when *encoding* the d argument to os.listdir(),
then you get this code: [2].

Oh, I just realized that I *could* use the PEP 383 os.listdir(), like
this:

def listdir(d):
fse = sys.getfilesystemencoding()
if fse == 'utf-8b':
fse = 'utf-8'
ns = []
for fn in os.listdir(d):
bytes = fn.encode(fse, 'python-escape')
try:
ns.append(FName(bytes.decode(fse, 'strict')))
except UnicodeDecodeError:
ns.append(FName(fn.decode('utf-8', 'python-escape'),
  failed_decode=True))
return ns

(And I guess I could define listdir() like this only on the
non-unicode-safe platforms, as above.)

However, that strikes me as even more horrible than the previous
"listdir()" work-around, in part because it means decoding, re-encoding,
and re-decoding every name, so I think I would stick with the previous
version.
  


The current unicode mode would skip the filenames you are interested 
(those that fail to decode correctly) - so you would have been forced to 
use the bytes mode. If you need access to the original bytes then you 
should continue to do this. PEP-383 is entirely neutral for your use 
case as far as I can see.


Michael


Oh, one more note: for Tahoe's purposes you can, in all of the code
above, replace ".decode('utf-8', 'python-replace')" with
".decode('windows-1252')" and it works just as well.  While UTF-8b seems
like a really cool hack, and it would produce more legible results if
utf-8-encoded strings were partially corrupted, I guess I should just
use 'windows-1252' which is already implemented in Python 2 (as well as
in all other software in the world).

I guess this means that PEP 383, which I have approved of and liked so
far in this discussion, would actually not help Tahoe at all and would
in fact harm Tahoe -- I would have to remember to detect and work-around
the automatic 'utf-8b' filesystem encoding when porting Tahoe to Python
3.

If anyone else has a concrete, real use case which would be helped by
PEP 383, I would like to hear about it.  Perhaps Tahoe can learn
something from it.

Oh, if this PEP could be extended to add a flag to each unicode object
indicating whether it was created with the python-escape handler or not,
then it would be useful to me.

Regards,

Zooko

[1] http://mail.python.org/pipermail/python-dev/2009-April/089020.html
[2] http://allmydata.org/trac/tahoe/attachment/ticket/534/fsencode.3.py
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.uk
  



--
http://www.ironpythoninaction.com/
http://www.voidspace.org.uk/blog


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 and GUI libraries

2009-04-30 Thread Zooko O'Whielacronx
Folks:

My use case (Tahoe-LAFS [1]) requires that I am *able* to read arbitrary
binary names from the filesystem and store them so that I can regenerate
the same byte string later, but it also requires that I *know* whether
what I got was a valid string in the expected encoding (which might be
utf-8) or whether it was not and I need to fall back to storing the
bytes.  So far, it looks like PEP 383 doesn't provide both of these
requirements, so I am going to have to continue working-around the
Python API even after PEP 383.  In fact, it might actually increase the
amount of working-around that I have to do.

If I understand correctly, .decode(encoding, 'strict') will not be
changed by PEP 383.  A new error handler is added, so .decode('utf-8',
'python-escape') performs the utf-8b decoding.  Am I right so far?
Therefore if I have a string of bytes, I can attempt to decode it with
'strict', and if that fails I can set the flag showing that it was not a
valid byte string in the expected encoding, and then I can invoke
.decode('utf-8', 'python-escape') on it.  So far, so good.

(Note that I never want to do .decode(expected_encoding,
'python-escape') -- if it wasn't a valid bytestring in the
expected_encoding, then I want to decode it with utf-8b, regardless of
what the expected encoding was.)

Anyway, I can use it like this:

class FName:
def __init__(self, name, failed_decode=False):
self.name = name
self.failed_decode = failed_decode

def fs_to_unicode(bytes):
try:
return FName(bytes.decode(sys.getfilesystemencoding(), 'strict'))
except UnicodeDecodeError:
return FName(fn.decode('utf-8', 'python-escape'), failed_decode=True)

And what about unicode-oriented APIs such as os.listdir()?  Uh-oh, the
PEP says that on systems with locale 'utf-8', it will automatically be
changed to 'utf-8b'.  This means I can't reliably find out whether the
entries in the directory *were* named with valid encodings in utf-8?
That's not acceptable for my use case.  I would have to refrain from
using the unicode-oriented os.listdir() on POSIX, and instead do
something like this:

if platform.system() in ('Windows', 'Darwin'):
def listdir(d):
return [FName(n) for n in os.listdir(d)]
elif platform.system() in ('Linux', 'SunOs'):
def listdir(d):
bytesd = d.encode(sys.getfilesystemencoding())
return [fs_to_unicode(n) for n in os.listdir(bytesd)]
else:
raise NotImplementedError("Please classify platform.system() == %s \
as either unicode-safe or unicode-unsafe." % platform.system())

In fact, if 'utf-8' gets automatically converted to 'utf-8b' when
*decoding* as well as encoding, then I would have to change my
fs_to_unicode() function to check for that and make sure to use strict
utf-8 in the first attempt:

def fs_to_unicode(bytes):
fse = sys.getfilesystemencoding()
if fse == 'utf-8b':
fse = 'utf-8'
try:
return FName(bytes.decode(fse, 'strict'))
except UnicodeDecodeError:
return FName(fn.decode('utf-8', 'python-escape'),
 failed_decode=True)

Would it be possible for Python unicode objects to have a flag
indicating whether the 'python-escape' error handler was present?  That
would serve the same purpose as my "failed_decode" flag above, and would
basically allow me to use the Python APIs directory and make all this
work-around code disappear.

Failing that, I can't see any way to use the os.listdir() in its
unicode-oriented mode to satisfy Tahoe's requirements.

If you take the above code and then add the fact that you want to use
the failed_decode flag when *encoding* the d argument to os.listdir(),
then you get this code: [2].

Oh, I just realized that I *could* use the PEP 383 os.listdir(), like
this:

def listdir(d):
fse = sys.getfilesystemencoding()
if fse == 'utf-8b':
fse = 'utf-8'
ns = []
for fn in os.listdir(d):
bytes = fn.encode(fse, 'python-escape')
try:
ns.append(FName(bytes.decode(fse, 'strict')))
except UnicodeDecodeError:
ns.append(FName(fn.decode('utf-8', 'python-escape'),
  failed_decode=True))
return ns

(And I guess I could define listdir() like this only on the
non-unicode-safe platforms, as above.)

However, that strikes me as even more horrible than the previous
"listdir()" work-around, in part because it means decoding, re-encoding,
and re-decoding every name, so I think I would stick with the previous
version.

Oh, one more note: for Tahoe's purposes you can, in all of the code
above, replace ".decode('utf-8', 'python-replace')" with
".decode('windows-1252')" and it works just as well.  While UTF-8b seems
like a really cool hack, and it would produce more legible results if
utf-8-encoded strings were partially corrupted, I guess I should just
use 'windows-1252' which is already implemented in Python 2 (as well as
in all other software in the world).

I guess this means that PEP 383

Re: [Python-Dev] PEP 383 and GUI libraries

2009-04-30 Thread Mike Klaas


On 30-Apr-09, at 7:39 AM, Guido van Rossum wrote:


FWIW, I'm in agreement with this PEP (i.e. its status is now
Accepted). Martin, you can update the PEP and start the
implementation.


+1

Kudos to Martin for seeing this through with (imo) considerable  
patience and dignity.


-Mike
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 and GUI libraries

2009-04-30 Thread Guido van Rossum
FWIW, I'm in agreement with this PEP (i.e. its status is now
Accepted). Martin, you can update the PEP and start the
implementation.

On Thu, Apr 30, 2009 at 2:12 AM, "Martin v. Löwis"  wrote:
>> Did you use a name with other characters?  Were they displayed?  Both
>> before and after the surrogates?
>
> Yes, yes, and yes (IOW, I put the surrogate in the middle).
>
>> Did you use one or three half surrogates, to produce the three crossed
>> boxes?
>
> Only one, and it produced three boxes - probably one for each UTF-8 byte
> that pango considered invalid.
>
>> Did you use one or three half surrogates, to produce the single square box?
>
> Again, only one. Apparently, PyQt passes the Python Unicode string to Qt
> in a character-by-character representation, rather than going through UTF-8.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 and GUI libraries

2009-04-30 Thread Martin v. Löwis
> Did you use a name with other characters?  Were they displayed?  Both
> before and after the surrogates?

Yes, yes, and yes (IOW, I put the surrogate in the middle).

> Did you use one or three half surrogates, to produce the three crossed
> boxes?

Only one, and it produced three boxes - probably one for each UTF-8 byte
that pango considered invalid.

> Did you use one or three half surrogates, to produce the single square box?

Again, only one. Apparently, PyQt passes the Python Unicode string to Qt
in a character-by-character representation, rather than going through UTF-8.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 and GUI libraries

2009-04-30 Thread Glenn Linderman
On approximately 4/30/2009 1:48 AM, came the following characters from 
the keyboard of Martin v. Löwis:

I checked how GUI libraries deal with half surrogates.
In pygtk, a warning gets issued to the console

/tmp/helloworld.py:71: PangoWarning: Invalid UTF-8 string passed to
pango_layout_set_text()
  self.window.show()

and then the widget contains three crossed boxes.

wxpython (in its wxgtk version) behaves the same way.

PyQt displays a single square box.



Interesting.

Did you use a name with other characters?  Were they displayed?  Both 
before and after the surrogates?


Did you use one or three half surrogates, to produce the three crossed 
boxes?


Did you use one or three half surrogates, to produce the single square box?

--
Glenn -- http://nevcal.com/
===
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] PEP 383 and GUI libraries

2009-04-30 Thread Martin v. Löwis
I checked how GUI libraries deal with half surrogates.
In pygtk, a warning gets issued to the console

/tmp/helloworld.py:71: PangoWarning: Invalid UTF-8 string passed to
pango_layout_set_text()
  self.window.show()

and then the widget contains three crossed boxes.

wxpython (in its wxgtk version) behaves the same way.

PyQt displays a single square box.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com