Re: ICU incorporation and string changes heads-up

Jarkko Hietaniemi Sat, 10 Apr 2004 02:02:40 -0700

> Jeff Clites <[EMAIL PROTECTED]> wrote:
> 
>>On Apr 9, 2004, at 7:19 AM, Leopold Toetsch wrote:


I'm replying for Jeff since I've been burned by the same questions
over and over again :-)

> 
>>So internally, strings don't have an associated encoding (or chartype
>>or anything)
> 
> 
> How do you handle EBCDIC? UTF8 for Ponie?


All character sets (like EBCDIC) or encodings (like UTF-8) are
"normalized" to the Unicode (character set) (and our own *internal*
encoding, the 8/16/32 one.)

> Not used *yet* - what about:
> 
>    use German;
>    print uc("i");
>    use Turkish;
>    print uc("i");

That is implementable (and already implemented by ICU) but by something
higher level than a "string".

> And if one is working with two different language at a time?

One becomes mad.  As Jeff demonstrated, there is no silver bullet in
there, one gets quickly to situations where there provably is NO correct
solution.  So we shouldn't try building the impossible to the lowest
level of string implementation.

> when comparing graphemes or letters. The latter might depend on the
> language too.
> 
> We'll basically need 4 levels of string support:
> 
> ,--[ Larry Wall ]--------------------------------------------------------
> |  level 0    byte == character, "use bytes" basically
> |  level 1    codepoint == character, what we seem to be aiming for, vaguely
> |  level 2    grapheme == character, what the user usually wants
> |  level 3    letter == character, what the "current language" wants
> `------------------------------------------------------------------------

Jeff's solution gives us level 1, and I assume that level 0 is trivially
deductible from that.  Note, however, that not all string operations
(especially such a rich set of string ops as Perl has) can even be
defined for all those levels: e.g. bitstring boolean bit ops are rather
insane at levels higher than zero.

> The N-th character depends on the level. Above examples C<.length> gives
> either 2 or 1, when the user queries at level 1 or 2. The same problem
> arises with positions. The current level depends on the scope were the
> string was coming from too. (s. example WRT turkish letter "i")

The levels 2 and 3 depend on something higher level, like the higher
levels of ICU.  I believe we have everything we need (and even more) in
ICU.  Let's get the levels 0 and 1 working first.

>>>- What's the plan towards all the transcode opcodes? (And leaving these
>>>  as a noop would have been simpler)
> 
> 
>>Basically there's no need for a transcode op on a string--it no longer
>>makes sense, there's nothing to transcode.
> 
> 
> I can't imagine that. I've an ASCII string and want to convert it to UTF8
> and UTF16 and write it into a file. How do I do that?

IIUC the old "transcoding" stuff was doing transcoding in run-time so
that two encoding-marked strings could be compared.  The new scheme
"normalizes" (not to be confused with Unicode normalization) all strings
to Unicode.  If you want to do transformations like you describe above
you either call an explicit transcoding interface (which ICU no doubt
has) or your I/O layers do that implicitly (this functionality PIO does
not yet have, if I understood Jeff correctly).

Maybe it's good to refresh on the 'character hierarchy' as defined by
Unicode (and IETF, and W3C).

ACR - Abstract Character Repertoire: an unordered collection of abstract
characters, like "UPPERCASE A" or "LOWERCASE B" or "DECIMAL DIGIT SEVEN".

CCS - Coded Character Set: an ordered (numbered) list of characters,
like 65 -> "UPPERCASE A".  For example: ASCII and EBCDIC.

CEF - Character Encoding Form: mapping the numbers of the CCS character
codoes to platform-specific numbers like bytes or integers.

CES - Character Encoding Scheme: mapping the CEF numbers to serialized
bytes, possibly adding synchronization metadata like shift codes or byte
order markers.

Why the great confusion exists is mostly because in the old way (like
ASCII or Latin-1) all these four levels were conflated into one.

ISO 8859-1 (which is a CCS) has an eight-bit CEF.  UTF-8 is both a CEF
and a CES.  UTF-16 is a CEF, while UTF-16LE is a CES.  ISO 2022-{JP,KR}
are CES.

(Outside of Unicode) there is TES (Transfer Encoding Syntax), too, which
is application-level encoding like base64 or gzip.

Re: ICU incorporation and string changes heads-up

Reply via email to