On 3/22/07, Alexander Kjeldaas <[EMAIL PROTECTED]> wrote:
Python is *definitively* not utf16.  Python can be compiled to use
utf8, utf16 or utf32/ucs4.

UTF-16 or UTF-32.  Not UTF-8.

I'll ask around and see if the Python folks think this has been good,
bad, or indifferent.  My impression was that it's considered to have
been a mistake, but I could be wrong.

My thoughts on this topic actually come largely from Python's
experience in this arena.

Python does not have a character type,
avoiding the issue of whether there should be O(1) access to
characters.

Um, this is a misunderstanding of how Python works.  Python
provides O(1) access to code units, so for example on a
"ucs2" build (the default):

 >>> s = u'\U00012345'
 >>> len(s)
 2
 >>> s[0]
 u'\ud808'

On a "ucs4" build the same code gives different answers.  No one
exactly likes this in the Python camp, and I don't think we want this
for Scheme.  If R6RS exposes code units, it should either
standardize on a representation everyone can live with; or
set the code unit API aside in a separate library, maybe
(r6rs string-code-units), so people won't naively trip over it.

According to Guido van Rossum, python 3000 might use all
three internal representations at the same time.

Well, it's possible.  I think he mentioned it at PyCon.  I'll gladly
bet it doesn't change: too much work, and it would either complicate
the Python C API (one of Python's major strings--er, strengths) or
hurt performance, or both.

I'll ask about this too.

Neither Xerces-C nor ICU specifies their internal representation as
part of the interface AFAIK.  On the other hand, since they deal with
with encodings they support lots of them.

Xerces-C:

 "String is represented by 'XMLCh*' which is a pointer to unsigned
 16 bit type holding utf-16 values, null terminated."

 http://xml.apache.org/xerces-c/ApacheDOMC++BindingL2.html

ICU:

 "In ICU, a Unicode string consists of 16-bit Unicode code units.
 A Unicode character may be stored with either one code unit
 (the most common case) or with a matched pair of special
 code units ("surrogates"). The data type for code units is UChar.
 [...]

 "Indexes and offsets into and lengths of strings always count
 code units, not code points."

 http://www.icu-project.org/apiref/icu4c/classUnicodeString.html#_details

Regarding the rest of your comments:  your experience and mine
obviously differ.  I wonder if you have profiled a system using both
UTF-16 and UTF-32 strings.  I have not.

I think the rate-determining step is probably neither unaligned
accesses nor processor cache but how much copying and transcoding
you're forced to do.  UTF-16 is a significant win in that regard.

-j

_______________________________________________
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss

Reply via email to