paul_kon...@dell.com wrote: > Perhaps I’m missing something. I'm not sure you're missing anything. You're simply describing another implementation choice that could have been made. Both your scheme and the actual scheme have their merits.
> I’m used to Windows API calls that come in a foo_A and foo_W flavor, the only > difference being that the _A flavor has ASCII arguments and the _W flavor has > Unicode arguments (for those arguments that are, abstractly, strings). Technically speaking, the _A flavor is MBCS. 8-bit entities, with an unknown encoding, where a single character can span multiple bytes. > In Python 3, the “str” type is an abstract string; its character repertoire > is Unicode but it doesn’t have an encoding. Instead, encoding and decoding > is done when it is converted to/from external interfaces — files, external > API calls, etc. It doesn't have an encoding because it doesn't NEED an encoding. > So... I would expect foo_A and foo_W to have “str” arguments, and the > interface machinery between Python3 and those functions would run the > appropriate encoding to generate the string representation expected. The big problem here is determining what is "the appropriate encoding". I'm not convinced there is any way for the Python COM machinery to know that definitively. It could make a guess, but you're always going to be wrong sometimes. Absent that confidence, it seems to me that the correct solution is to deliver the MBCS string exactly as it arrived, and that's what the current implementation does. Leave it to the application to figure out how to decode it. > For example, if a given API wants strings in ASCII form, it would be > str.encode (“ascii”) or perhaps str.encode (“latin1”). Assuming the API wants Latin-1. Does it? You don't know that. It varies from machine to machine, and even from run to run. That's the problem. > If it wants MBCS data, it would be encode to that encoding. I assume you understand that all 8-bit strings in Windows are MBCS. Latin-1 is just another MBCS. > If 2-byte Unicode, it would be encode to ucs-2. UTF-16; that's what the Windows Unicode encoding is. > I would only want/expect to see “bytes” types when the values in question are > binary data streams, or unknown format. But anytime we’re dealing with text > strings, the Python 3 approach is that the Python code sees “str” type, and > questions of encoding have been handled at the edge. This is where Python 3 > gets it right and Python 2 was a big muddle. The muddle is not the fault of Python. It is the fault of the character encoding decisions made by Microsoft in the mists of antiquity. When you get an 8-bit string that includes the byte 0x9F, there is absolutely no way for you answer the question "what character is that?" If that question cannot be answered, middleware should not be making a guess. -- Tim Roberts, t...@probo.com Providenza & Boekelheide, Inc. _______________________________________________ python-win32 mailing list python-win32@python.org https://mail.python.org/mailman/listinfo/python-win32