Hi,
Dag Sverre Seljebotn wrote:
> I think this is partially a culture thing: Me and Stefan live using
> non-ASCII alphabets daily, and still (in 2008) have to live with lots of
> software that just doesn't handle things properly or have small nags. This
> is a real problem, and most coders don't bother with it.
Yes, I think that's the main problem here. There just is no such thing as a
default encoding. There is only Unicode and tons of different mappings to
bytes that all have their respective corner. Giving Cython a default encoding
would just ignore that fact.
If you give unexperienced people tools that keep them from thinking about real
problems, they will not think about them. And even if you generally know what
you're doing, there may always be situations where you write a quick
def f(char* s):
...
without caring about the implications, and it just breaks in a bizarre way
when a user from the other end of the world passes something really
unexpected. It doesn't even have to be that obvious, think of a call chain of
functions from the API to some C level string treatment. It is good design to
have a designated, explicit point in that chain where conversion is taking
place. And it's worth bothering with that.
lxml, for example, and its ancestor ElementTree, allow unicode strings and
byte strings in their interface wherever Unicode input makes sense. However,
if you pass a byte string, it must be a plain 7-bit ASCII string or it will be
explicitly rejected, very close to the API entry point. Allowing byte strings
is just for convenience, as many, many users work with ISO encodings, ASCII or
UTF-8, and most XML names in the world really are plain ASCII. However, as
soon as you allow any 8-bit data, users will run into the trap of accidentally
passing things as they receive them, without thinking. And they may not even
notice until much later, when things get decoded on the way out again and
break. And believe me, they will not say "oh, my bad", as it will take them
days to debug these things to figure out where the broken string really came
from. And then they will come to the mailing list and shout "why didn't your
software tell me?!".
One thing I learned is that explicit input checking is worth it. And this is
definitely true in the string world.
Stefan
BTW, I might even decide to reject byte string input in lxml when it runs in
Py3 (except for XML byte streams, obviously). I think that would match the way
Py3 code works.
_______________________________________________
Cython-dev mailing list
[email protected]
http://codespeak.net/mailman/listinfo/cython-dev