There are two, essentially separate issues going on in this discussion of
strings. One has to do with data structures, and the other has to do with
character encodings. Let me tackle them in turn:
*Data Structure Issues*
*
*
While I understand the motivation for a packed string encoding - one where
the character payload sequentially follows the object header in memory - it
is *not at all clear* that this is a good representation for strings. The
problem is compaction/copying. For short strings (up 64 bytes), this isn't
really a factor, but for long strings it's a problem. A long,
reference-free object that needs relocation introduces several issues:
- It reduces the cache efficiency of the collector
- It takes a potentially unbounded amount of time to copy
- It doesn't necessarily improve performance
This is why *many* runtimes prefer to put string and array data into
separately allocated storage (sometimes called large object storage). What
this does is eliminate most of the situations in which the large object
might need to be relocated. It is usually true in those runtimes that array
payload compaction is handled specially. In some such runtimes it isn't
done at all.
In general, an array (of either sort) can only be unboxed when (a) it's
size is statically known, or (b) it appears in tail position of a data
structure. In order to be in tail position, the containing data structure
must be final so that the tail position constraint is guaranteed.
Whether it is unboxed or not, a non-statically-sized array must carry a
length field. The reason that a separately stored payload isn't a big deal
is:
- The payload pointer can be loaded at the same time as the length
- On modern processors, the offset computation can be done concurrent
with the bounds check
- On modern processors, register rename means that we don't need
to consider register pressure issues.
These statements increasingly apply to embedded processors as well.
The case where a packed "tail payload" is more compelling is things like
network packets, where a variable length packed format is arriving from an
external source and you need a way to describe it as a language level data
structure.
*Character Encoding*
*
*
Bennie wrote:
Now you say i want the 132rd letter in a string this is not meaningful and
> incorrect in some asian languages since they use multiple unicode chars as
> ascii for encoding as discussed ie word[5] may be the 2nd character...
>
This is mistaken. For starters, Bennie's comment confuses code points for
characters and then confuses some encoding issues. Once that is cleaned up,
we still need to consider normalizations and modifiers.
In a UNICODE string, it is *perfectly* sensible to request either the 132nd
*code point* or the 132nd *code unit*. Obtaining the 132nd code unit is an
O(1) operation, but the result may not correspond to a code point or even
to the *beginning* of a code point. Obtaining the 132nd code point may or
may not be an O(1) operation depending on the encoding selected (UTF-8,
UTF-16, UTF-32) and the character planes that are in use. In most cases,
the extended character planes can be ignored. Under that assumption, UTF-16
encoding can be implemented in UCS-2 code units and O(1) indexing is
achieved. Yes, that incurs memory overhead.
Having a code point does not mean you have a character or even the
beginning of a character. There are so-called "combining sequences" in
UNICODE that involve multiple code points. For example, there are multiple
valid ways of representing a capital C with a cedile. See
here<http://unicode.org/reports/tr15/#Canon_Compat_Equivalence>in the
applicable standard.
What this mainly serves to reveal is that the proper handling of
international character data is a nightmarishly complex business, and the
entire *concept* of a fixed-length character needs to be discarded in order
to understand how international text really works. Once you realize that,
your whole point of view on indexing encodings changes, because getting an
O(1) indexing operation at the encoding layer doesn't really help you that
much.
In fact - and I say this having done it myself - if you're still looking
for an O(1) way to extract a "character" from a UNICODE string, it's fair
to say that you probably don't understand what a character is.
I'm only a little bit ahead of some of you. I've dug in to UNICODE far
enough to recognize that I don't know what a character is, but not far
enough to feel like I know how to really drive UNICODE. On the bright side,
I seem to have a lot of company. :-)
CLR includes a 'char' datatype largely for legacy compatibility reasons,
and because you need *something* to hold the result of indexing a string.
But if you dig into the CLR internationalization APIs (or the Java
equivalents), you'll find that most of them are framed in terms of
operations on strings rather than operations on "characters". That's
because there ain't no such thing as a fixed-length character.
Jonathan
_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev