Re: ICU incorporation and string changes heads-up

Leopold Toetsch Sat, 10 Apr 2004 01:22:58 -0700

Jeff Clites <[EMAIL PROTECTED]> wrote:
> On Apr 9, 2004, at 7:19 AM, Leopold Toetsch wrote:


> So internally, strings don't have an associated encoding (or chartype
> or anything)

How do you handle EBCDIC? UTF8 for Ponie?

>> - Where is string->language?

> I removed it from the string struct because I think that's the wrong
> place for it (and it wasn't actually being used anywhere yet,
> fortunately).

Not used *yet* - what about:

   use German;
   print uc("i");
   use Turkish;
   print uc("i");

> language-dependent (sorting, for example), the operation doesn't depend
> on the language of the strings involved, but rather on the locale of
> the reader.

And if one is working with two different language at a time?

>> With this string type how do we deal with anything beyond codepoints?

> Hmm, what do you mean?

 "\N{LATIN CAPITAL LETTER A}\N{COMBINING ACUTE ACCENT}" eq
 "\N{LATIN CAPITAL LETTER A WITH ACUTE}"

when comparing graphemes or letters. The latter might depend on the
language too.

We'll basically need 4 levels of string support:

,--[ Larry Wall ]--------------------------------------------------------
|  level 0      byte == character, "use bytes" basically
|  level 1      codepoint == character, what we seem to be aiming for, vaguely
|  level 2      grapheme == character, what the user usually wants
|  level 3      letter == character, what the "current language" wants
`------------------------------------------------------------------------

> or ask arbitrary strings what their N-th character

The N-th character depends on the level. Above examples C<.length> gives
either 2 or 1, when the user queries at level 1 or 2. The same problem
arises with positions. The current level depends on the scope were the
string was coming from too. (s. example WRT turkish letter "i")

>> - What happenend to external constant strings?

> They should still work (or could).

,--[ string.c:714 ]--------------------------------------------------
|   else /* even if buffer is "external", we won't use it directly */
`--------------------------------------------------------------------

>> - What's the plan towards all the transcode opcodes? (And leaving these
>>   as a noop would have been simpler)

> Basically there's no need for a transcode op on a string--it no longer
> makes sense, there's nothing to transcode.

I can't imagine that. I've an ASCII string and want to convert it to UTF8
and UTF16 and write it into a file. How do I do that?

>> - hash_string seems not to deal with mixed encodings anymore.

> Yep, since we're hashing based on characters rather than bytes, there's
> no such thing as mixed encodings.

s. above

> JEff

leo

Re: ICU incorporation and string changes heads-up

Reply via email to