On Apr 17, 2008, at 4:44 AM, Stefan Behnel wrote:
> Hi,
>
> I think it is a good idea to start this.
>
> Dag Sverre Seljebotn wrote:
>> - Language libraries (spell checking etc.). These will often work
>> in one
>> specific encoding or allow you to specify the encoding the data is
>> in;
>> typically, one would want to be specific about conversions in this
>> case.
>
> Right. No magic here.
>
>
>> - Passing filenames. This seems to be a common case; open a file
>> picker in
>> a Python GUI lib and pass the resulting filename to a library
>> taking a
>> datafile parameter. Assuming the file picker returns a str/unicode
>> (would
>> be nice if it returned bytes though) then auto-conversion would be
>> nice to
>> have, however UTF-8 would be the wrong choice on many platforms
>> (including
>> Windows, I think? Not sure about Vista.)
>
> Correct again. Trying to do magic here is futile.
>
>
>> - Getting error messages. These are likely to either be in a hard-
>> coded
>> encoding or platform default, no guarantee for UTF-8 so require
>> encoding
>> consciousness.
>
> Right again. They are either locale dependent or language dependent
> - and
> they may even be translated in a function, as in
>
> raise TypeError(_("Wrong type"))
>
> or
>
> printf(_("Wrong type"))
>
> Cython shouldn't interfere here any step beyond getting the input
> string
> correctly decoded from the source input.
Yes. This will not be interpreted as a C string anywhere along the way.
>> - Passing UI messages. Think writing a wrapper around a GUI lib.
>> In that
>> case it is again usually platform default that is wanted, which is
>> not
>> UTF-8 for very many users (not sure about newer Windows libs, in
>> the old
>> libs one had the choice between 8-bit and 16-bit Windows codepages
>> IIRC).
>> So encoding consciousness is needed.
>
> Same case as libraries in general, I'd say.
>
>
>> - En-/decryption and (de)compression libs, binary serialization
>> libs, etc.
>> Here, UTF-8 auto-conversion would be incredibly excellent (ie if
>> one wants
>> to encrypt or compress strings, and read them back again into the
>> same
>> environment they came from).
>
> Two cases here. Most likely, you are dealing with binary data, not
> unicode
> strings, so there is not much to gain. Then, auto conversion here is
> dangerous, as it may come unexpected. You pass in a Python object
> and get
> a UTF-8 encoded byte sequence at the end???
No, if you try and turn any char* into an object, you get unicode. If
you are assuming it is null-terminated, you are assuming it is a
string. If it really is binary data, then one would need to specify
the length.
> Imagine you had Cython on the
> other end, too. Wouldn't you be surprised to pass in a unicode
> object on
> one side and have Cython return a bytes object on the other?
> Because it
> couldn't possibly know that the original input was a unicode object.
No, see above.
>> - Text parsing/serialization libs: One would need to be consciuos
>> about
>> encoding one way or another, likely encoding would have to be part
>> of the
>> API, or in some cases, one would deal with bytes in Cython.
>
> Yes, encodings are crucial here, so again, not a big gain, just a
> source
> for potential laziness bugs.
>
>
>> - Using char* rather than unicode for optimization purposes. Early-
>> binding
>> unicode objects:
>>
>> cdef str s
>>
>> should deal with some of these cases, if something like this doesn't
>> happen already like with list (will it be as efficient as copying
>> between
>> buffers with strcat and friends? I can imagine more efficient due
>> to less
>> copying potentially happening with a smarter string type...)
>
> The most efficient way to deal with this is early or late
> conversion, i.e.
> at the API level. And it's good to be explicit here to avoid common
> bugs
> and potential API incompatibilities.
>
>
>> - Then there are cases where one wants to do some string modification
>> quickly, element by element. But almost all cases I could think of
>> would
>> fail on a UTF-8 char* (string reversal, palindrome creation, merging
>> strings character by character, alphabet-based ROT-13... all such
>> things
>> would fail with a naive UTF-8 char*, and if one is conscious about
>> understanding UTF-8 in order to do these properly one should be
>> able to
>> explicitly convert as well).
Strings are supposed to be immutable. The only heavy string-
processing I've done is a parser for mathematical expressions, and it
would work just fine with UTF-8 (as all the "special" characters are
ASCII, and it treats all other byte sequences as names).
> Again, I agree. If you want UTF-8, it's better to say so as the
> thing you
> will do with the result afterwards totally depends on the encoding
> in use.
>
> I find it easier to read
>
> def dostuff(str text):
> cdef char* s = text.encode("UTF-8")
> # do UTF-8 handling stuff
> return s.decode("UTF-8")
>
> than anything you could do with internal magic.
>
> In lxml, for example, I try to be very explicit about the point where
> stuff is converted to UTF-8. There is a utility function called
> "_utf8()"
> that takes a Python object and returns a UTF-8 encoded byte string or
> raises an exception if the input was neither an ASCII byte string
> nor a
> unicode string. You will find this function at the beginning of
> almost all
> API functions, and I am very happy to have it there. Because this
> makes it
> explicit what is happening and when, and it makes sure that whatever
> string we use in internal functions will be a UTF-8 encoded byte
> string. I
> do not want Cython to do that for me, as the conversion is part of
> lxml's
> API and this includes the semantics of its input checking.
This is because you *want* to think about encoding when you're
processing XML.
> Take this example again:
>
> def dostuff(char* input):
> # do some UTF-8 handling stuff
> return input
>
> Now imagine you call it with a byte string like this:
>
> dostuff(u"Örglümpf".encode('iso-8859-1'))
>
> This will not give you an API error, but it will most likely break the
> function in one place or another or might even return an incorrect
> result
> without error notice.
No, it would work just fine, because it would specify utf-8 in the
decoding phase.
> I seriously doubt that there are many applications that would be
> fine with
> a simple
>
> cdef char* s = some_unicode_string
I tried finding an example in the Sage codebase where this would
cause problems, and wasn't able to do so (other than the fact that
some libraries would barf on bad input, but then some of them would
barf if some_unicode_string wasn't a decimal number.)
> and I'm quite confident that even the remaining applications would be
> better off using explicit conversion than *ignoring* the fact that
> there
> is a semantic difference between a sequence of characters and a
> sequence
> of bytes.
There is a semantic difference between a pointer to a byte, a null-
terminated sequence of bytes, and a sequence of bytes with specified
length. We are ignoring this distinction. I primarily see the bytes
object as binary data, not assumed to be null-terminated. In any
case, normal Python uses are not going to want to pass/receive bytes
when they really want to be manipulating strings, so the burden is on
the Cython coder.
> Making people aware of this difference is a good thing.
I agree. Forcing the user to deal with it everywhere they want to use
a string is, in my opinion, not.
> Doing magic to support laziness is not a good thing.
No, magic is a very good thing. That's what makes Cython so much
better than writing against the C/API explicitly. Would you say all
he magic that converts between C and Python ints is a bad thing?
I can see that you are both convinced that forcing the user to
manually convert using an encoding via
def dostuff(str text):
cdef bytes tmp_text = text.encode("UTF-8")
cdef char* s = tmp_text
# do UTF-8 (often just ASCII) handling stuff
cdef bytes another_tmp = s # if one didn't use UTF-8
one may have to worry about specifying the length too.
return another_tmp.decode("UTF-8")
is worth the price paid in usability, backwards compatibility, and
efficiency. And since no one else has spoken up I guess there aren't
any other strong opinions on the matter.
- Robert
_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev