Jeff Clites <[EMAIL PROTECTED]> wrote:
> On Apr 9, 2004, at 7:19 AM, Leopold Toetsch wrote:
> So internally, strings don't have an associated encoding (or chartype
> or anything)
How do you handle EBCDIC? UTF8 for Ponie?
>> - Where is string->language?
> I removed it from the string struct because I think that's the wrong
> place for it (and it wasn't actually being used anywhere yet,
> fortunately).
Not used *yet* - what about:
use German;
print uc("i");
use Turkish;
print uc("i");
> language-dependent (sorting, for example), the operation doesn't depend
> on the language of the strings involved, but rather on the locale of
> the reader.
And if one is working with two different language at a time?
>> With this string type how do we deal with anything beyond codepoints?
> Hmm, what do you mean?
"\N{LATIN CAPITAL LETTER A}\N{COMBINING ACUTE ACCENT}" eq
"\N{LATIN CAPITAL LETTER A WITH ACUTE}"
when comparing graphemes or letters. The latter might depend on the
language too.
We'll basically need 4 levels of string support:
,--[ Larry Wall ]--------------------------------------------------------
| level 0 byte == character, "use bytes" basically
| level 1 codepoint == character, what we seem to be aiming for, vaguely
| level 2 grapheme == character, what the user usually wants
| level 3 letter == character, what the "current language" wants
`------------------------------------------------------------------------
> or ask arbitrary strings what their N-th character
The N-th character depends on the level. Above examples C<.length> gives
either 2 or 1, when the user queries at level 1 or 2. The same problem
arises with positions. The current level depends on the scope were the
string was coming from too. (s. example WRT turkish letter "i")
>> - What happenend to external constant strings?
> They should still work (or could).
,--[ string.c:714 ]--------------------------------------------------
| else /* even if buffer is "external", we won't use it directly */
`--------------------------------------------------------------------
>> - What's the plan towards all the transcode opcodes? (And leaving these
>> as a noop would have been simpler)
> Basically there's no need for a transcode op on a string--it no longer
> makes sense, there's nothing to transcode.
I can't imagine that. I've an ASCII string and want to convert it to UTF8
and UTF16 and write it into a file. How do I do that?
>> - hash_string seems not to deal with mixed encodings anymore.
> Yep, since we're hashing based on characters rather than bytes, there's
> no such thing as mixed encodings.
s. above
> JEff
leo