Re: Unicode handling

Dan Sugalski Fri, 23 Mar 2001 09:20:29 -0800
At 11:41 PM 3/22/2001 +0000, Nicholas Clark wrote:
>On Thu, Mar 22, 2001 at 04:10:28PM -0500, Dan Sugalski wrote:
> > 1) All Unicode data perl does regular expressions against will be in
> > Normalization Form C, except for...
> > 2) Regexes tagged to run against a decomposed form will instead be run
> > against data in Normalization Form D. (What the tag is at the perl 
> level is
> > up for grabs. I'd personally choose a D suffix)
> > 3) Perl won't otherwise force any normalization on data already in Unicode
> > format.
>
>So if I understand that correctly, running a regexp against a scalar will
>cause that scalar to become normalized in a defined way (C or D, depending
>on regexp)

It will be run against a normalized version of the data in the scalar, yes. 
Whether that forces the scalar to be normalized or not is something I 
hadn't thought of. If we do, then something as simple as this:

   while (<IN>) {
         $count++ if /bar/;
         print OUT $_;
   }

would potentially result in the output file being rather different from the 
input file. Equivalent, yes, but different. Whether that's bad or not is an 
open question.

> > 5) Any character-based call (ord, substr, whatever) will deal with 
> whatever
> > code-points are at the location specified. If the string is LATIN SMALL
> > LETTER A, COMBINING ACUTE ACCENT and someone does a substr($foo, 1, 1) on
> > it, you get back the single character COMBINING ACUTE ACCENT, and an ord
> > would return the value 796.
>
>So if you do (ord, substr, whatever) on a scalar without knowing where it has
>been, you have no idea whether you're working on normalised or not.

Potentially, yes. If it's important, you force normalization on it.

>And in fact the same scalar may be come denormalised:
>
>   $bar = substr $foo, 3, 1;
>   &frob ($foo);
>   $baz = substr $foo, 3, 1;
>
>[so $bar and $baz differ] if someone runs it against a regular expression
>[in this case inside the subroutine &frob. Hmm, but currently you can
>make changes to parameters as they are pass-by-reference]
>
>   $bar = substr $foo, 3, 1;
>   $foo =~ /foo/;        # This is not read only in perl6
>   $baz = substr $foo, 3, 1;
>
>But this is documented - if you want (ord, substr, whatever) on a string
>to make sense, you must explicitly normalized it to the form you want before
>hand, and not use any of the documented-as-normalizing operators on it
>without normalizing it again.

It's generally safe to say that if you want data to make sense period, you 
need to make sure it's sensible first. Unicode with combining characters 
does tend to exacerbate things, but it's not a new problem.

>And by implication of the above (particularly rule 3), eq compares
>codepoints, not normalized forms.

I hadn't thought about eq, gt, or lt. (Or sort, for that matter)

eq probably ought to work against codepoints, and be done with it. 
gt/lt/sort should normalize and use the Unicode sorting stuff to determine 
where things stand. I don't think they should alter the data, as we may be 
promoting non-unicode data to unicode for comparisons. (If we're comparing 
ASCII and Unicode scalars, or even something odd like Shift-JIS and EBCDIC 
scalars)

                                        Dan

--------------------------------------"it's like this"-------------------
Dan Sugalski                          even samurai
[EMAIL PROTECTED]                         have teddy bears and even
                                      teddy bears get drunk
Re: Unicode handling

Reply via email to