Folks:

Being new to the use of gmail, I accidentally sent the following only
to MvL and not to the list.  He promptly replied with a helpful
counterexample showing that my design can suffer collisions.  :-)

Regards,

Zooko


On Fri, May 1, 2009 at 10:38 AM, "Martin v. Löwis" <mar...@v.loewis.de> wrote:
>>
>> Requirement: either the unicode string or the bytes are faithfully
>> transmitted from one system to another.
>
> I don't understand this requirement very well, in particular not
> the "faithfully" part.
>
>> That is: if you read a filename from the filesystem, and transmit that
>> filename to another system and use it, then there are two cases:
>
> What do you mean by "use it"? Things like opening files? How does
> that work? In general, a file name valid on one system is invalid
> on a different system - or, at least, refers to a different file
> over there. This is independent of encodings.

Tahoe is a backup and filesharing program, so you might for example,
execute "tahoe cp -r Motörhead tahoe:" to copy all the contents of
your "Motörhead" directory to your Tahoe filesystem.  Later you or a
friend, might execute "tahoe cp -r tahoe:Motörhead ." to copy
everything from that directory within your Tahoe filesystem to your
local filesystem.  So in this case the flow of information is
local_system_1 -> Tahoe -> local_system_2.

The Requirement 1 is that for each filename encountered which is a
valid encoding in local_system_1, then the resulting (unicode) name is
transmitted through the Tahoe filesystem and then written out into
local_system_2 in the expected way (i.e. just by using the Python
unicode APIs and passing the unicode object to them).

Requirement 2 is that for each filename encountered which is not a
valid encoding in local_system_1, then the original bytes are
transmitted through the Tahoe filesystem and then, if the target
system is a byte-oriented system such as Linux, the original bytes are
written into the target filesystem.  (If the target is not Linux then
mojibake! but we don't have to go into that now.)

Does that make sense?

> In all your descriptions, I'm puzzled as to where exactly you get
> the source bytes from. If you use the PEP 383 interfaces, you will
> start with character strings, not byte strings, always.

On Mac and Windows, we use the Python unicode APIs e.g.
os.listdir(u"Motörhead").  On Linux and Solaris, we use the Python
bytestring APIs e.g.
os.listdir("Motörhead".encode(sys.getfilesystemencoding())).

>> Okay, I find it surprisingly easy to make subtle errors in this encoding
>> stuff, so please let me know if you spot one.  Is it true that
>> srcbytes.encode(srcencoding, 'python-escape').decode('utf-8',
>> 'python-escape') will always produce srcbytes ?
>
> I think you mixed up bytes and unicode here: if srcbytes is indeed
> a bytes object, then you can't apply .encode to it.

Yep, I reversed the order of encode() and decode().  However, my whole
statement was utterly wrong and shows that I still didn't fully get it
yet.  I have flip-flopped again and currently think that PEP 383 is
useless for this use case and that my original plan [1] is still the
way to go.  Please let me know if you spot a flaw in my plan or a
ridiculousity in my requirements, or if you see a way that PEP 383 can
help me.

Thank you very much.

Regards,

Zooko

[1] http://allmydata.org/trac/tahoe/ticket/534#comment:47
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to