Jeff Clites <[EMAIL PROTECTED]> wrote: > On Apr 9, 2004, at 7:19 AM, Leopold Toetsch wrote:
> So internally, strings don't have an associated encoding (or chartype > or anything) How do you handle EBCDIC? UTF8 for Ponie? >> - Where is string->language? > I removed it from the string struct because I think that's the wrong > place for it (and it wasn't actually being used anywhere yet, > fortunately). Not used *yet* - what about: use German; print uc("i"); use Turkish; print uc("i"); > language-dependent (sorting, for example), the operation doesn't depend > on the language of the strings involved, but rather on the locale of > the reader. And if one is working with two different language at a time? >> With this string type how do we deal with anything beyond codepoints? > Hmm, what do you mean? "\N{LATIN CAPITAL LETTER A}\N{COMBINING ACUTE ACCENT}" eq "\N{LATIN CAPITAL LETTER A WITH ACUTE}" when comparing graphemes or letters. The latter might depend on the language too. We'll basically need 4 levels of string support: ,--[ Larry Wall ]-------------------------------------------------------- | level 0 byte == character, "use bytes" basically | level 1 codepoint == character, what we seem to be aiming for, vaguely | level 2 grapheme == character, what the user usually wants | level 3 letter == character, what the "current language" wants `------------------------------------------------------------------------ > or ask arbitrary strings what their N-th character The N-th character depends on the level. Above examples C<.length> gives either 2 or 1, when the user queries at level 1 or 2. The same problem arises with positions. The current level depends on the scope were the string was coming from too. (s. example WRT turkish letter "i") >> - What happenend to external constant strings? > They should still work (or could). ,--[ string.c:714 ]-------------------------------------------------- | else /* even if buffer is "external", we won't use it directly */ `-------------------------------------------------------------------- >> - What's the plan towards all the transcode opcodes? (And leaving these >> as a noop would have been simpler) > Basically there's no need for a transcode op on a string--it no longer > makes sense, there's nothing to transcode. I can't imagine that. I've an ASCII string and want to convert it to UTF8 and UTF16 and write it into a file. How do I do that? >> - hash_string seems not to deal with mixed encodings anymore. > Yep, since we're hashing based on characters rather than bytes, there's > no such thing as mixed encodings. s. above > JEff leo