Re: PDD 4: Internal data types

Dan Sugalski Mon, 05 Mar 2001 13:56:14 -0800
At 12:01 PM 3/5/2001 -0800, Hong Zhang wrote:
> >    struct perl_string {
> >      void *string_buffer;
> >      UV length;
> >      UV allocated;
> >      UV flags;
> >    }
> >
> > The low three bits of the flags field is reserved for the type of the
> > string. The various types are:
> >
> > =over 4
> >
> > =item BINARY (0)
> >
> > =item ASCII (1)
> >
> > =item EBCDIC (2)
> >
> > =item UTF_8 (3)
> >
> > =item UTF_32 (4)
> >
> > =item NATIVE_1 (5) through NATIVE_3 (7)
>
>Some thoughts about string encoding. Because Unicode normalization
>and canonical equivalent, some characters that take one codepoint
>in one encoding may take two or more codepoints in another encoding,
>mainly vowels with diacritics. In that sense, the substr() may give
>different results depending on its current encoding.

As would ord, potentially. And how substr returns its data is also open to 
interpretation. (Should it return a single code point, or should it return 
a full grapheme?)

>Here is an example, "re`sume`" takes 6 characters in Latin-1, but
>could take 8 characters in Unicode. All Perl functions that directly
>deal with character position and length will be sensitive to encoding.
>I wonder how we should handle this case.

My first inclination is to force normalization on any data we manipulate. 
This looks like a legit thing to do per the Unicode standard, but we'd need 
to be careful to not do it if we don't need to, both for speed and user 
expectation reasons. ("while (<>){print}" should spit out exactly what it 
got in if both the input and output streams are both Unicode enabled)

Of course that brings up the question of which normalization form--do we 
decompose everything or combine everything? Combining is best for most of 
perl's installed base since generally they don't deal with encodings that 
can't be combined into single characters. (I'm pretty sure all the european 
languages can make do without combining characters, and I think katakana 
and hiragana can as well.) Decomposition's a better general solution, 
though it's more computationally expensive to process. (On the other hand, 
you can do substitutions on combining characters or base characters, which 
is kinda neat if not always useful)

                                        Dan

--------------------------------------"it's like this"-------------------
Dan Sugalski                          even samurai
[EMAIL PROTECTED]                         have teddy bears and even
                                      teddy bears get drunk
Re: PDD 4: Internal data types

Reply via email to