> You say "that's fine" but my issue was one of usability, which hasn't
> been addressed.

I think this might be a good point for trying to list the actual
use-cases. I'll make a start and you can see if you find more; and how
important each of them are. There seems to be a usecase for every possible
stance (which I'll iterate as utf-8 auto-conversion, ascii auto-conversion
failing on non-ascii data, and no automatic conversion), so it is about
weighing the importancy of each.

Interfacing with C code/libs:

- Language libraries (spell checking etc.). These will often work in one
specific encoding or allow you to specify the encoding the data is in;
typically, one would want to be specific about conversions in this case.

- Passing filenames. This seems to be a common case; open a file picker in
a Python GUI lib and pass the resulting filename to a library taking a
datafile parameter. Assuming the file picker returns a str/unicode (would
be nice if it returned bytes though) then auto-conversion would be nice to
have, however UTF-8 would be the wrong choice on many platforms (including
Windows, I think? Not sure about Vista.)

- Getting error messages. These are likely to either be in a hard-coded
encoding or platform default, no guarantee for UTF-8 so require encoding
consciousness.

- Passing UI messages. Think writing a wrapper around a GUI lib. In that
case it is again usually platform default that is wanted, which is not
UTF-8 for very many users (not sure about newer Windows libs, in the old
libs one had the choice between 8-bit and 16-bit Windows codepages IIRC).
So encoding consciousness is needed.

- En-/decryption and (de)compression libs, binary serialization libs, etc.
Here, UTF-8 auto-conversion would be incredibly excellent (ie if one wants
to encrypt or compress strings, and read them back again into the same
environment they came from).

- Text parsing/serialization libs: One would need to be consciuos about
encoding one way or another, likely encoding would have to be part of the
API, or in some cases, one would deal with bytes in Cython.

Internal Cython usecases:

All in all, using Python strings seems better when not dealing with
external C code and I've failed to find good usecases; perhaps anyone else
has got one?

- Using char* rather than unicode for optimization purposes. Early-binding
unicode objects:

typedef str s

should deal with some of these cases, if something like this doesn't
happen already like with list (will it be as efficient as copying between
buffers with strcat and friends? I can imagine more efficient due to less
copying potentially happening with a smarter string type...)

- Then there are cases where one wants to do some string modification
quickly, element by element. But almost all cases I could think of would
fail on a UTF-8 char* (string reversal, palindrome creation, merging
strings character by character, alphabet-based ROT-13... all such things
would fail with a naive UTF-8 char*, and if one is conscious about
understanding UTF-8 in order to do these properly one should be able to
explicitly convert as well).


Dag Sverre

_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev

Reply via email to