>>    printf(_("Wrong type"))
>>
>> Cython shouldn't interfere here any step beyond getting the input
>> string
>> correctly decoded from the source input.
>
> Yes. This will not be interpreted as a C string anywhere along the way.

Odds are that printf statement is within some C library, using char*,
using standard libc translation, meaning that when a Chinese BIG-5 system
translates this into something crazy then you *do* get a problem.
Meanwhile, the original coder doesn't notice because he/she is not the one
doing the chinese translation.

I think this is partially a culture thing: Me and Stefan live using
non-ASCII alphabets daily, and still (in 2008) have to live with lots of
software that just doesn't handle things properly or have small nags. This
is a real problem, and most coders don't bother with it.

In all the library interface cases I listed, auto-conversion has the
possibility to seriously bite non-experienced coders and UTF-8 is almost
never what is wanted when wrapping C libraries. This does not mean that
you "have a data buffer without known length" -- you know that you have a
string, but the reality is that when wrapping C libraries you a) know it
is a string, b) have no reason to assume anything about the encoding.

(I did initially argue for conversion to platform run-time default --
because that would have a possibility of working when wrapping C
libraries. (But I've gone away from that now, at least under the name of
char*).)

Meanwhile, in Cython code, there's no reason you have to call your utf-8
buffers "char*" except for the warm fuzzy C feeling. Make a "mutable_str"
type if the purpose is speeding things up, that will have the possibility
for infinitely more nice candy, and one can still generate char*. You can
even treat UTF-8 properly then (ie have [] return potentially something
>255).

> I can see that you are both convinced that forcing the user to
> manually convert using an encoding via
>
>      def dostuff(str text):
>          cdef bytes tmp_text = text.encode("UTF-8")
>          cdef char* s = tmp_text
>          # do UTF-8 (often just ASCII) handling stuff
>          cdef bytes another_tmp = s        # if one didn't use UTF-8
> one may have to worry about specifying the length too.
>          return another_tmp.decode("UTF-8")
>
> is worth the price paid in usability, backwards compatibility, and
> efficiency. And since no one else has spoken up I guess there aren't
> any other strong opinions on the matter.

I think the price paid in usability and backwards compatability (which
Python 3 breaks anyway, and people will have scripts to add b"" to their
strings...) is more than weighed up for by the price paid for subtle bugs
introduced by coders who didn't bother to learn about it properly when it
worked perfectly on their US system. C is the language where you are
allowed to shoot yourself in the foot, not Python.

Also your example is unfair -- you take an example of the current Cython
compiler and compare it with a fully candied up alternative. For the UTF-8
autoconversion to work there must be some candy as well (basically
something like the above must be generated by Cython, right?), so there
shouldn't be any reason (or is there?) that one can't stop adding candy
just one layer before, ie waiting with releasing temporaries assigned to
char* as one would have to with UTF-8 anyway:

      def dostuff(str text):
          cdef char* s = text.encode("UTF-8")
          # Do stuff
          return str(s, "UTF-8")

(OK, this is implying that char* auto-coerces to bytes using
null-termination, but that hardly seems to be an argument against it when
your candidate solution is _also_ assuming null-termination, it is just
assuming encoding in addition. Assuming only null-termination is fine, it
is a "middle ground".)

Finally some wisdom from the Zen of Python:
>>> import this
...
Explicit is better than implicit
...
>>>


Pewh. I'll try to make this my last one, this should rest a bit :-)

Dag Sverre

_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev

Reply via email to