Hi,

I think it is a good idea to start this.

Dag Sverre Seljebotn wrote:
> - Language libraries (spell checking etc.). These will often work in one
> specific encoding or allow you to specify the encoding the data is in;
> typically, one would want to be specific about conversions in this case.

Right. No magic here.


> - Passing filenames. This seems to be a common case; open a file picker in
> a Python GUI lib and pass the resulting filename to a library taking a
> datafile parameter. Assuming the file picker returns a str/unicode (would
> be nice if it returned bytes though) then auto-conversion would be nice to
> have, however UTF-8 would be the wrong choice on many platforms (including
> Windows, I think? Not sure about Vista.)

Correct again. Trying to do magic here is futile.


> - Getting error messages. These are likely to either be in a hard-coded
> encoding or platform default, no guarantee for UTF-8 so require encoding
> consciousness.

Right again. They are either locale dependent or language dependent - and
they may even be translated in a function, as in

    raise TypeError(_("Wrong type"))

or

    printf(_("Wrong type"))

Cython shouldn't interfere here any step beyond getting the input string
correctly decoded from the source input.


> - Passing UI messages. Think writing a wrapper around a GUI lib. In that
> case it is again usually platform default that is wanted, which is not
> UTF-8 for very many users (not sure about newer Windows libs, in the old
> libs one had the choice between 8-bit and 16-bit Windows codepages IIRC).
> So encoding consciousness is needed.

Same case as libraries in general, I'd say.


> - En-/decryption and (de)compression libs, binary serialization libs, etc.
> Here, UTF-8 auto-conversion would be incredibly excellent (ie if one wants
> to encrypt or compress strings, and read them back again into the same
> environment they came from).

Two cases here. Most likely, you are dealing with binary data, not unicode
strings, so there is not much to gain. Then, auto conversion here is
dangerous, as it may come unexpected. You pass in a Python object and get
a UTF-8 encoded byte sequence at the end??? Imagine you had Cython on the
other end, too. Wouldn't you be surprised to pass in a unicode object on
one side and have Cython return a bytes object on the other? Because it
couldn't possibly know that the original input was a unicode object.


> - Text parsing/serialization libs: One would need to be consciuos about
> encoding one way or another, likely encoding would have to be part of the
> API, or in some cases, one would deal with bytes in Cython.

Yes, encodings are crucial here, so again, not a big gain, just a source
for potential laziness bugs.


> - Using char* rather than unicode for optimization purposes. Early-binding
> unicode objects:
>
> cdef str s
>
> should deal with some of these cases, if something like this doesn't
> happen already like with list (will it be as efficient as copying between
> buffers with strcat and friends? I can imagine more efficient due to less
> copying potentially happening with a smarter string type...)

The most efficient way to deal with this is early or late conversion, i.e.
at the API level. And it's good to be explicit here to avoid common bugs
and potential API incompatibilities.


> - Then there are cases where one wants to do some string modification
> quickly, element by element. But almost all cases I could think of would
> fail on a UTF-8 char* (string reversal, palindrome creation, merging
> strings character by character, alphabet-based ROT-13... all such things
> would fail with a naive UTF-8 char*, and if one is conscious about
> understanding UTF-8 in order to do these properly one should be able to
> explicitly convert as well).

Again, I agree. If you want UTF-8, it's better to say so as the thing you
will do with the result afterwards totally depends on the encoding in use.

I find it easier to read

    def dostuff(str text):
        cdef char* s = text.encode("UTF-8")
        # do UTF-8 handling stuff
        return s.decode("UTF-8")

than anything you could do with internal magic.

In lxml, for example, I try to be very explicit about the point where
stuff is converted to UTF-8. There is a utility function called "_utf8()"
that takes a Python object and returns a UTF-8 encoded byte string or
raises an exception if the input was neither an ASCII byte string nor a
unicode string. You will find this function at the beginning of almost all
API functions, and I am very happy to have it there. Because this makes it
explicit what is happening and when, and it makes sure that whatever
string we use in internal functions will be a UTF-8 encoded byte string. I
do not want Cython to do that for me, as the conversion is part of lxml's
API and this includes the semantics of its input checking.

Take this example again:

    def dostuff(char* input):
        # do some UTF-8 handling stuff
        return input

Now imagine you call it with a byte string like this:

    dostuff(u"Örglümpf".encode('iso-8859-1'))

This will not give you an API error, but it will most likely break the
function in one place or another or might even return an incorrect result
without error notice.

I seriously doubt that there are many applications that would be fine with
a simple

    cdef char* s = some_unicode_string

and I'm quite confident that even the remaining applications would be
better off using explicit conversion than *ignoring* the fact that there
is a semantic difference between a sequence of characters and a sequence
of bytes. Making people aware of this difference is a good thing. Doing
magic to support laziness is not a good thing.

Stefan

_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev

Reply via email to