At 12:01 PM 3/5/2001 -0800, Hong Zhang wrote:
> > struct perl_string {
> > void *string_buffer;
> > UV length;
> > UV allocated;
> > UV flags;
> > }
> >
> > The low three bits of the flags field is reserved for the type of the
> > string. The various types are:
> >
> > =over 4
> >
> > =item BINARY (0)
> >
> > =item ASCII (1)
> >
> > =item EBCDIC (2)
> >
> > =item UTF_8 (3)
> >
> > =item UTF_32 (4)
> >
> > =item NATIVE_1 (5) through NATIVE_3 (7)
>
>Some thoughts about string encoding. Because Unicode normalization
>and canonical equivalent, some characters that take one codepoint
>in one encoding may take two or more codepoints in another encoding,
>mainly vowels with diacritics. In that sense, the substr() may give
>different results depending on its current encoding.
As would ord, potentially. And how substr returns its data is also open to
interpretation. (Should it return a single code point, or should it return
a full grapheme?)
>Here is an example, "re`sume`" takes 6 characters in Latin-1, but
>could take 8 characters in Unicode. All Perl functions that directly
>deal with character position and length will be sensitive to encoding.
>I wonder how we should handle this case.
My first inclination is to force normalization on any data we manipulate.
This looks like a legit thing to do per the Unicode standard, but we'd need
to be careful to not do it if we don't need to, both for speed and user
expectation reasons. ("while (<>){print}" should spit out exactly what it
got in if both the input and output streams are both Unicode enabled)
Of course that brings up the question of which normalization form--do we
decompose everything or combine everything? Combining is best for most of
perl's installed base since generally they don't deal with encodings that
can't be combined into single characters. (I'm pretty sure all the european
languages can make do without combining characters, and I think katakana
and hiragana can as well.) Decomposition's a better general solution,
though it's more computationally expensive to process. (On the other hand,
you can do substitutions on combining characters or base characters, which
is kinda neat if not always useful)
Dan
--------------------------------------"it's like this"-------------------
Dan Sugalski even samurai
[EMAIL PROTECTED] have teddy bears and even
teddy bears get drunk