RE: XMLCh & wchar_t conversion on multiple platforms

Arnold, Curt Mon, 14 May 2001 16:25:34 -0700
> That wouldn't work. I assume you meant <= 255, but anyway, it 
> still probably
> wouldn't work. Just because 8859-1 can *hold* any one byte 
> encoding, the
> meaning of those code points would be lost. You would have to 
> use something
> like UTF-8, i.e. something that is a transfer encoding and 
> can represent
> Unicode code points, since that's the only way you can retain 
> the semantics.

I did mean <= 255.  How my DOMString() internally holds the data is known to the
DOMString implementation, but is hidden from external users.

When one of my DOMString is initialized from an XMLCh*,
the XMLCh* is analyzed to determine the appropriate internal
representation for that particular DOMString.
If the XMLCh* only contains code points <= 255, then the internal 
representation is marked as ISO-8859-1 (or USASCII if it 
all code points are <= 127). If it contains code points > 255, 
then it will choose UTF-8 or UTF-16 depending on relative sizes.  
There are a lot of nasty switch statements within the DOMString
class that direct you to the appropriate implementation of
DOMString::operator+() for example, depending on the internal 
representations of the participating strings.   However, the
ISO-8859-1 implementations are more efficient since they can
directly convert character offsets into byte offsets that
would not be possible with UTF-8.

No one directly gets at the buffer,
you have to do a DOMString::copy to transfer the data from the
DOMString to your buffer.  That would even allow some simple
dictionary compression to be used to keep the data size down
(again at some cost in processing).

> I was discussing a Unicode aware application, in which it 
> would *always* be
> demanded, because that's the only format the program works 
> in. This is the
> future, and this is what should be targeted.

That the representation is "compressed" within the implementation
of DOMString in no way compromises the rest of the application
since it either is doing DOMString manipulations (in which it is
not exposed to the internal representation) or it is explicitly
copying the data to XMLCh.

Basically, I'm saving memory at the cost of increasing the complexity
of the DOMString implementation and a slightly higher initialization time.

It might not be everyones cup of tea, but it is addressing my needs
quite well.  I'm not trying to impose my design decisions on anyone,
I'm just saying what would be beneficial to have ICU provide that I'm
currently doing with code inside of my DOMString representation.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
RE: XMLCh & wchar_t conversion on multiple platforms

Reply via email to