On Apr 17, 2008, at 2:11 PM, Dag Sverre Seljebotn wrote:
>>>    printf(_("Wrong type"))
>>>
>>> Cython shouldn't interfere here any step beyond getting the input
>>> string
>>> correctly decoded from the source input.
>>
>> Yes. This will not be interpreted as a C string anywhere along the  
>> way.
>
> Odds are that printf statement is within some C library, using char*,
> using standard libc translation, meaning that when a Chinese BIG-5  
> system
> translates this into something crazy then you *do* get a problem.
> Meanwhile, the original coder doesn't notice because he/she is not  
> the one
> doing the chinese translation.
>
> I think this is partially a culture thing: Me and Stefan live using
> non-ASCII alphabets daily, and still (in 2008) have to live with  
> lots of
> software that just doesn't handle things properly or have small  
> nags. This
> is a real problem, and most coders don't bother with it.

I agree that it could be partially a cultural issue. I speak French  
and studies Chinese for several years, so it's not like I haven't  
dealt with these issues, but not on a daily basis (anymore). Even  
more significantly, the strings I deal with are almost all in a math  
context, which rarely use unicode (tex is the de-facto standard for  
typeset output).

> In all the library interface cases I listed, auto-conversion has the
> possibility to seriously bite non-experienced coders and UTF-8 is  
> almost
> never what is wanted when wrapping C libraries. This does not mean  
> that
> you "have a data buffer without known length" -- you know that you  
> have a
> string, but the reality is that when wrapping C libraries you a)  
> know it
> is a string, b) have no reason to assume anything about the encoding.
>
> (I did initially argue for conversion to platform run-time default --
> because that would have a possibility of working when wrapping C
> libraries. (But I've gone away from that now, at least under the  
> name of
> char*).)
>
> Meanwhile, in Cython code, there's no reason you have to call your  
> utf-8
> buffers "char*" except for the warm fuzzy C feeling.

And the fact that that's what they really are (rather than making the  
user learn a new type).

> Make a "mutable_str"
> type if the purpose is speeding things up, that will have the  
> possibility
> for infinitely more nice candy, and one can still generate char*.  
> You can
> even treat UTF-8 properly then (ie have [] return potentially  
> something
>> 255).
>
>> I can see that you are both convinced that forcing the user to
>> manually convert using an encoding via
>>
>>      def dostuff(str text):
>>          cdef bytes tmp_text = text.encode("UTF-8")
>>          cdef char* s = tmp_text
>>          # do UTF-8 (often just ASCII) handling stuff
>>          cdef bytes another_tmp = s        # if one didn't use UTF-8
>> one may have to worry about specifying the length too.
>>          return another_tmp.decode("UTF-8")
>>
>> is worth the price paid in usability, backwards compatibility, and
>> efficiency. And since no one else has spoken up I guess there aren't
>> any other strong opinions on the matter.
>
> I think the price paid in usability and backwards compatability (which
> Python 3 breaks anyway, and people will have scripts to add b"" to  
> their
> strings...) is more than weighed up for by the price paid for  
> subtle bugs
> introduced by coders who didn't bother to learn about it properly  
> when it
> worked perfectly on their US system. C is the language where you are
> allowed to shoot yourself in the foot, not Python.
>
> Also your example is unfair -- you take an example of the current  
> Cython
> compiler and compare it with a fully candied up alternative. For  
> the UTF-8
> autoconversion to work there must be some candy as well (basically
> something like the above must be generated by Cython, right?), so  
> there
> shouldn't be any reason (or is there?)

I think you underestimate how complicated it would be to figure out  
when it will be safe to release the temp, and if you're creating a  
copy whether or not you have to worry about freeing s. What you  
really want to do is use the buffer interface of unicode objects, and  
it is unclear how to do this without using magic or the C/API directly.

> that one can't stop adding candy
> just one layer before, ie waiting with releasing temporaries  
> assigned to
> char* as one would have to with UTF-8 anyway:
>
>       def dostuff(str text):
>           cdef char* s = text.encode("UTF-8")
>           # Do stuff
>           return str(s, "UTF-8")
>
> (OK, this is implying that char* auto-coerces to bytes using
> null-termination, but that hardly seems to be an argument against  
> it when
> your candidate solution is _also_ assuming null-termination, it is  
> just
> assuming encoding in addition. Assuming only null-termination is  
> fine, it
> is a "middle ground".)

I was just saying that we don't want to take the no assumption route,  
so it's a question of what assumptions to make. The code above will  
still break for a lot of encodings (e.g. UCS-2 or UCS-4).

> Finally some wisdom from the Zen of Python:
>>>> import this
> ...
> Explicit is better than implicit

This is a good point. It's a step backwards in terms of usability.  
Worth it? I'm unconvinced but outvoted.

> ...
>>>>
>
>
> Pewh. I'll try to make this my last one, this should rest a bit :-)

Same here.

- Robert


_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev

Reply via email to