On 3/22/07, Alexander Kjeldaas <[EMAIL PROTECTED]> wrote:
Python is *definitively* not utf16. Python can be compiled to use utf8, utf16 or utf32/ucs4.
UTF-16 or UTF-32. Not UTF-8. I'll ask around and see if the Python folks think this has been good, bad, or indifferent. My impression was that it's considered to have been a mistake, but I could be wrong. My thoughts on this topic actually come largely from Python's experience in this arena.
Python does not have a character type, avoiding the issue of whether there should be O(1) access to characters.
Um, this is a misunderstanding of how Python works. Python provides O(1) access to code units, so for example on a "ucs2" build (the default): >>> s = u'\U00012345' >>> len(s) 2 >>> s[0] u'\ud808' On a "ucs4" build the same code gives different answers. No one exactly likes this in the Python camp, and I don't think we want this for Scheme. If R6RS exposes code units, it should either standardize on a representation everyone can live with; or set the code unit API aside in a separate library, maybe (r6rs string-code-units), so people won't naively trip over it.
According to Guido van Rossum, python 3000 might use all three internal representations at the same time.
Well, it's possible. I think he mentioned it at PyCon. I'll gladly bet it doesn't change: too much work, and it would either complicate the Python C API (one of Python's major strings--er, strengths) or hurt performance, or both. I'll ask about this too.
Neither Xerces-C nor ICU specifies their internal representation as part of the interface AFAIK. On the other hand, since they deal with with encodings they support lots of them.
Xerces-C: "String is represented by 'XMLCh*' which is a pointer to unsigned 16 bit type holding utf-16 values, null terminated." http://xml.apache.org/xerces-c/ApacheDOMC++BindingL2.html ICU: "In ICU, a Unicode string consists of 16-bit Unicode code units. A Unicode character may be stored with either one code unit (the most common case) or with a matched pair of special code units ("surrogates"). The data type for code units is UChar. [...] "Indexes and offsets into and lengths of strings always count code units, not code points." http://www.icu-project.org/apiref/icu4c/classUnicodeString.html#_details Regarding the rest of your comments: your experience and mine obviously differ. I wonder if you have profiled a system using both UTF-16 and UTF-32 strings. I have not. I think the rate-determining step is probably neither unaligned accesses nor processor cache but how much copying and transcoding you're forced to do. UTF-16 is a significant win in that regard. -j _______________________________________________ r6rs-discuss mailing list [email protected] http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss
