On Sep 9, 2009, at 7:40 PM, Stefan Behnel wrote: > > Robert Bradshaw wrote: >> One of the reasons I was so quick to discard this is >> because I thought the usecase was that null characters needed to be >> embedded, which is completely orthogonal, and I couldn't think of >> anywhere I'd come across unsigned char* used for strings (but clearly >> libxml2 is such a library). > > I actually never understood why people use plain char* in the first > place > (ok, apart from tradition, laziness and non-ASCII unawareness). Any > 1-byte > encoding table I've ever come across maps characters to the byte > values > 0-255 or 0x00-0xFF. I've never seen an encoded byte string > represented with > negative byte values. The habit of using char* for text goes so far > that I > wasn't even aware that char* was pointing to a signed value when I > learned > C.
It might not be signed, depends on the compiler. Unsigned char is just so long to type for something so common, and that upper bit (or the negative values) really aren't all that useful--once you leave ASCII there's a whole host of bigger issues to deal with (if only everyone used UTF-8 unicode...). > Before I was made aware of it, I just unconsciously considered > 'char' a > special case in the language (which it actually is when you think > about it). > > >> Just out of curiosity, does it use char* for ASCII and unsigned char* >> for utf-8 as a poor-man's typechecking for encoding? > > It's a form of type-checking, yes, but not in that way. It uses > unsigned > char* for text (tag names, text values, etc.) and char* for data > sequences > (e.g. file names and serialised XML). It even redefines "unsigned > char" as > "xmlChar" for that purpose, and a macro "BAD_CAST" that does > exactly what > it sounds like. > > I guess the historical reason to do that was that you can (or could?) > switch the internal text encoding in libxml2, so you could use Latin-1 > instead of UTF-8, for example, and the xmlChar* would denote all > strings > encoded that way. Doesn't make much sense for XML nowadays and just > little > more for HTML, but it's still a nice way of documenting the API. > And to me, > it makes sense to use "unsigned char" anyway. Ah. - Robert _______________________________________________ Cython-dev mailing list [email protected] http://codespeak.net/mailman/listinfo/cython-dev
